Open tpaviot opened 10 years ago
It would be nice to add an option in CMakeLists.txt to enable using GCD (only on Apple, and enabled by default) so that people can easily build with and without GCD and compare performance.
@dbarbier I agree. Enabling GCD by default on APPLE shoud however break compilation for older OSX versions. I don't know how to detect OSX Maverick with CMake, which would be the only way to safely enable GCD multithreading by default.
Modifications to Standard_Mutex.cxx miss some changes to Standard_Mutex.hxx otherwise compilation fails. It's necessary to:
<dispatch/dispatch.h>
to benefit from the dispatch_semaphore* functions/types<libkern/OSAtomic.h>
so that OSAtomicCompareAndSwap32Barrier is foundWith commit 6fcd293, branch compiles fine. However, I don't know if it does the job as expected.
Sorry I probably missed some patches. myCount should be volatile.
GCD is part of OSX since 10.6, and it is the oldest supported version, isn't it?
In the unlock function the semaphore has to be decremented not released, I will submit the correct version when I arrive home.
@istvan-csanady since 10.6 right, I thought it was older (maybe some recent major update confused me about that). We will check your parallel_for_each function, if it works multithreading can be enabled by default on OSX.
Btw the mutex implementation is based on this: http://www.mr-edd.co.uk/blog/sad_state_of_osx_pthread_mutex_t
However the attached source is not entirely correct as far is I remember, because the ciunter variable is not declared as volatile.
Here are the updated files: https://www.cubby.com/pl/OCCT/_917a934c23ef42b0a1de1adedd637747
@istvan-csanady thank you, I merged modified Standard_Mutex.cxx and hxx files into one single commit cda5aa5a7309f271f8ded473ed5b24b7b65120c4.
The next step are to:
Indeed, we should rename it, eg. Standard_ParallelFor? MTAPI is just a prefix that I use in a part of my project.
@istvan-csanady I updated the branch with your MT* files and modified the cmake files. I patched FastDiscret/IncrmentalMesh with your parallel_for_each static method but I have the following compilation error for both BRepMesh_FastDiscret and BRepMesh_IncrementalMesh:
[ 30%] Building CXX object adm/cmake/TKMesh/CMakeFiles/TKMesh.dir/__/__/__/src/BRepMesh/BRepMesh_FastDiscret.cxx.o
/Users/thomas/Devel/oce/src/BRepMesh/BRepMesh_FastDiscret.cxx:270:54: error:
call to implicitly-deleted copy constructor of 'BRepMesh_FastDiscret'
parallel_for_each (aFaces.begin(), aFaces.end(), *this);
^~~~~
/Users/thomas/Devel/oce/inc/BRepMesh_FastDiscret.hxx:234:33: note: copy
constructor of 'BRepMesh_FastDiscret' is implicitly deleted because field
'myVertices' has an inaccessible copy constructor
BRepMesh_DataMapOfVertexInteger myVertices;
(I run the clang version provided with OSX Maverick).
@tpaviot the usage of my parallel for each is slightly different from Intel TBBs, it uses a modern, C++11 based approach, the last argument is a lambda function. So to rewrite the code in BRepMesh_FastDiscret, you will have to write:
parallel_for_each(aFaces.begin(), aFaces.end(), [ this ](TopoDS_Face aFace) { (*this)(aFace); });
However, it could be rewritten, to only take an object with an operator() defined. But this approach fits much better to C++'s current state.
And just another comment on the parallel_for_each: with C++11 it is absolutely unnecessary to pull the whole Intel TBB library only because of the parallel_for_each. With std::thread such a parallel_for could be easily implemented. I will consider to write a prototype, and if I will, then parallelism could be turned on by default on every platform where C++11 is available. The only difficulty is to manage the lifetime of threads, and preventing the creation of new threads every time when a parallel_for_each is called. Probably we will need a thread pool.
One can already build with openmp if TBB dependency is not wanted, I am not sure that we want to provide our own thread pool.
@dbarbier Thats right, I forgot that OCC has an OpenMP based implementation as well. But since OpenMP is not supported by all the supported compilers, it would be reasonable to provide a parallel_for_each implementation. std::async could be a good choice, if we don't want to write a thread pool (that is not trivial of course).
@istvan-csanady Your suggestion for the parallel for each call from BRepMesh_FastDiscret does not compile on my machine:
parallel_for_each(aFaces.begin(), aFaces.end(), [ this ](TopoDS_Face aFace) {(*this)(aFace);});
This produces the following output:
In file included from /Users/thomas/Devel/oce/src/BRepMesh/BRepMesh_FastDiscret.cxx:102:
/Users/thomas/Devel/oce/src/BRepMesh/MTAPIParallel.hxx:28:13: error: no matching
function for call to object of type 'const <lambda at
/Users/thomas/Devel/oce/src/BRepMesh/BRepMesh_FastDiscret.cxx:270:53>'
function(iterator);
^~~~~~~~
/Users/thomas/Devel/oce/src/BRepMesh/BRepMesh_FastDiscret.cxx:270:5: note: in
instantiation of function template specialization
'parallel_for_each<std::__1::__wrap_iter<TopoDS_Face *>, <lambda at
/Users/thomas/Devel/oce/src/BRepMesh/BRepMesh_FastDiscret.cxx:270:53> >'
requested here
parallel_for_each(aFaces.begin(), aFaces.end(), [ this ](TopoDS_Fac...
^
/Users/thomas/Devel/oce/src/BRepMesh/BRepMesh_FastDiscret.cxx:270:53: note:
candidate function not viable: no known conversion from 'const
std::__1::__wrap_iter<TopoDS_Face *>' to 'TopoDS_Face' for 1st argument
parallel_for_each(aFaces.begin(), aFaces.end(), [ this ](TopoDS_Fac...
^
1 error generated.
Sorry, here is the correct one... The argument of the lambda is an iterator
In BRepMesh_FastDiscret:
parallel_for_each(aFaces.begin(), aFaces.end(), [ this ](std::vector
Similarly in BRepMesh_IncrementalMesh:
parallel_for_each(aFaces.begin(), aFaces.end(), [this](std::vector
Of course the CreateMutexesForSubShapes call is necessary in both cases.
@istvan-csanady I had to slightly modify what you suggest in order to compile TKMesh. Can you please review ff53abe ? Not tested yet.
@tpaviot Seems to be OK.
@istvan-csanady ok, that works. I added a simple test case (BRepMesh_IncrementalMesh for the famous cylinder_head geometry), see 5b31e63dc9cc7fb1469772f79a8891a15918cbb0.
Results of this test are reproduced below:
$ ./MultiThread_test
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from MultiThreadTestSuite
[ RUN ] MultiThreadTestSuite.testMeshCylinderHeadSingleThread
[ OK ] MultiThreadTestSuite.testMeshCylinderHeadSingleThread (26665 ms)
[ RUN ] MultiThreadTestSuite.testMeshCylinderHeadMultiThread
[ OK ] MultiThreadTestSuite.testMeshCylinderHeadMultiThread (22577 ms)
[----------] 2 tests from MultiThreadTestSuite (49242 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (49242 ms total)
[ PASSED ] 2 tests.
Your GCD MT implementation results in an speed increase of 15.3%. This test was conducted on an old 2009 MacBoocPro 2.53GHz Intel Core 2 Duo/8Gb RAM. I'll try to do other tests and compare with Intel TBB performances. Note that both FastDiscret and IncrementalMesh are single threaded by default (the myInParallel boolean is set to Standard_False by default). Maybe we should set it to Standard_True by default on OSX/GCD. We may notice speed improvements in boolean operations.
@tpaviot If I understand right, this test case calculates the mesh of a cylinder. I suggest to add a test case where a shape, that has many faces is triangulated, because the performance improvement could be measured better in such a shape.
@istvan-csanady No, cylinder head is not just a simple cylinder: it's usually part of an engine (see http://www.opencascade.org/showroom/shapegallery/gal4/). It's a complex shape made of several faces, that's why I chose it for the use case.
I ran the same test with Intel TBB:
$ ./MultiThread_test
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from MultiThreadTestSuite
[ RUN ] MultiThreadTestSuite.testMeshCylinderHeadSingleThread
[ OK ] MultiThreadTestSuite.testMeshCylinderHeadSingleThread (26109 ms)
[ RUN ] MultiThreadTestSuite.testMeshCylinderHeadMultiThread
[ OK ] MultiThreadTestSuite.testMeshCylinderHeadMultiThread (22301 ms)
[----------] 2 tests from MultiThreadTestSuite (48410 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (48410 ms total)
[ PASSED ] 2 tests.
14.6% in time reduction. This is very close to the previous result, I need more tests to conclude about the compared benefits from using GCB or TBB.
@istvan-csanady I finished merging your patch with current master branch. On Appel, MT is enabled by default for BRepMesh incremental and fast discrete meshes. If you don't want to use parallel meshing, you have to explicitely SetParallel your mesh to Standard_False. The test case use 2 complex shapes from the official occt gallery (cylinder head and F1). Do you have any other experimental code that could be added to the MT test case (not only meshing of course) ?
@tpaviot No, I don't.
@tpaviot Shouldn't we rename the MT* files to something that fits to the naming convention of OCC?
@istvan-csanady I don't really know about OCC naming conventions, and so far I left your files as you provided them.
Hello, I suggest to rename them into src/Standard/Standard_GrandCentralDispatch.[ch]xx
There are other problems with this branch:
Copyright (c) 2013 icsanady. All rights reserved.
We are thus not allowed to distribute them. @istvan-csanady: if you wrote these files, can you please put a less restrictive copyright notice?
#if(defined(__APPLE__) && defined(USE_GCD))
from the hxx file, and provide a dummy sequential alternative in cxx when compiling without defining USE_GCD.
@dbarbier I agree to use these files for whatever you want. Those notes are just automatically generated headers.
ok, I take care of cleaning up everything.
@dbarbier I renamed/moved files to Standard and updated licence headers. However, I don't get your argument related to the use of USE_GCD. I sequential alternative is already available in BRepMesh_FastDiscret (lines 285--286), why should we add another one to Standard_GrandCentralDispacth ? You write "people may want to use it": which people? people running linux or windows ?? If they are OSX users, there's no problem at all I guess.
@tpaviot At the moment, I guess that src/Standard/Standard_GrandCentralDispatch.hxx
is installed along with other header files, right? Then developers may use it for their own projects, and having conditional macros in header files is then troublesome.
OTOH you can also decide to not install this header file, this is fine; for instance by changing its suffix.
@tpaviot , this feature is stalled, and TBB support is working out alright on OSX... Perhaps this feature should be closed?
Istvan recently posted the following topic on the official OCCT dev portal forum:
Apple released GCD about two years ago. GCD is a basically parallel programming toolkit, with super-efficient, lightweight asynchronous tools, like dispatch queues, semaphores, asynchronous I/O tools etc. Here is the reference: https://developer.apple.com/library/Mac/DOCUMENTATION/Performance/Reference/GCD_libdispatch_Ref/Reference/reference.html
I have created a patch, that uses dispatch_semaphore_t instead of pthread_mutex_t in Standard_Mutex, and it seems to give some performance improvement over the phtread_mutex implementation. I have attached the patch for further examination.
The other thing where it can be useful, is the replacement of Intel TBB parallel_for_each, with dispatch queues, they can be easily implemented, reducing the 3rd party dependencies of OCCT. I also attached a simple implementation of it. It uses std::function, which is a C++11 feature, but C++11 is supported by Apple since 10.7+ and iOS 5.0+, so this should not be a problem. Otherwise, it can be rewritten work just like TBB's parallel_for_each.
These should not be considered as tested, and reviewed codes, these rather ideas, and suggestions for using this state of the art technology where it is available.
Code is under the ic/osx-gcd-multithreading (https://github.com/tpaviot/oce/tree/ic/osx-gcd-multithreading) dev branch.