Grand Dispatch Central for OSX multi threading

tpaviot commented 10 years ago

Istvan recently posted the following topic on the official OCCT dev portal forum:

Apple released GCD about two years ago. GCD is a basically parallel programming toolkit, with super-efficient, lightweight asynchronous tools, like dispatch queues, semaphores, asynchronous I/O tools etc. Here is the reference: https://developer.apple.com/library/Mac/DOCUMENTATION/Performance/Reference/GCD_libdispatch_Ref/Reference/reference.html

I have created a patch, that uses dispatch_semaphore_t instead of pthread_mutex_t in Standard_Mutex, and it seems to give some performance improvement over the phtread_mutex implementation. I have attached the patch for further examination.

The other thing where it can be useful, is the replacement of Intel TBB parallel_for_each, with dispatch queues, they can be easily implemented, reducing the 3rd party dependencies of OCCT. I also attached a simple implementation of it. It uses std::function, which is a C++11 feature, but C++11 is supported by Apple since 10.7+ and iOS 5.0+, so this should not be a problem. Otherwise, it can be rewritten work just like TBB's parallel_for_each.

These should not be considered as tested, and reviewed codes, these rather ideas, and suggestions for using this state of the art technology where it is available.

Code is under the ic/osx-gcd-multithreading (https://github.com/tpaviot/oce/tree/ic/osx-gcd-multithreading) dev branch.

dbarbier commented 10 years ago

It would be nice to add an option in CMakeLists.txt to enable using GCD (only on Apple, and enabled by default) so that people can easily build with and without GCD and compare performance.

tpaviot commented 10 years ago

@dbarbier I agree. Enabling GCD by default on APPLE shoud however break compilation for older OSX versions. I don't know how to detect OSX Maverick with CMake, which would be the only way to safely enable GCD multithreading by default.

tpaviot commented 10 years ago

Modifications to Standard_Mutex.cxx miss some changes to Standard_Mutex.hxx otherwise compilation fails. It's necessary to:

include <dispatch/dispatch.h> to benefit from the dispatch_semaphore* functions/types
include <libkern/OSAtomic.h> so that OSAtomicCompareAndSwap32Barrier is found
define myMutex and myCount as private members of the Standard_Mutex class ;
rewrite the UnLock method.

With commit 6fcd293, branch compiles fine. However, I don't know if it does the job as expected.

istvan-csanady commented 10 years ago

Sorry I probably missed some patches. myCount should be volatile.

istvan-csanady commented 10 years ago

GCD is part of OSX since 10.6, and it is the oldest supported version, isn't it?

istvan-csanady commented 10 years ago

In the unlock function the semaphore has to be decremented not released, I will submit the correct version when I arrive home.

tpaviot commented 10 years ago

@istvan-csanady since 10.6 right, I thought it was older (maybe some recent major update confused me about that). We will check your parallel_for_each function, if it works multithreading can be enabled by default on OSX.

istvan-csanady commented 10 years ago

Btw the mutex implementation is based on this: http://www.mr-edd.co.uk/blog/sad_state_of_osx_pthread_mutex_t

However the attached source is not entirely correct as far is I remember, because the ciunter variable is not declared as volatile.

istvan-csanady commented 10 years ago

Here are the updated files: https://www.cubby.com/pl/OCCT/_917a934c23ef42b0a1de1adedd637747

tpaviot commented 10 years ago

@istvan-csanady thank you, I merged modified Standard_Mutex.cxx and hxx files into one single commit cda5aa5a7309f271f8ded473ed5b24b7b65120c4.

The next step are to:

add your MTAPIParallel cpp and header files to the Standard folder. I think we should rename those files to something more explicit ;
modify CMake files in order to compile and use, by default, this multi threading implementation on OSX ;
add one or more test cases into our testing suite to be able to check everything is going fine (and measure performance enhancements).

istvan-csanady commented 10 years ago

Indeed, we should rename it, eg. Standard_ParallelFor? MTAPI is just a prefix that I use in a part of my project.

tpaviot commented 10 years ago

@istvan-csanady I updated the branch with your MT* files and modified the cmake files. I patched FastDiscret/IncrmentalMesh with your parallel_for_each static method but I have the following compilation error for both BRepMesh_FastDiscret and BRepMesh_IncrementalMesh:

[ 30%] Building CXX object adm/cmake/TKMesh/CMakeFiles/TKMesh.dir/__/__/__/src/BRepMesh/BRepMesh_FastDiscret.cxx.o
/Users/thomas/Devel/oce/src/BRepMesh/BRepMesh_FastDiscret.cxx:270:54: error: 
      call to implicitly-deleted copy constructor of 'BRepMesh_FastDiscret'
    parallel_for_each (aFaces.begin(), aFaces.end(), *this);
                                                     ^~~~~
/Users/thomas/Devel/oce/inc/BRepMesh_FastDiscret.hxx:234:33: note: copy
      constructor of 'BRepMesh_FastDiscret' is implicitly deleted because field
      'myVertices' has an inaccessible copy constructor
BRepMesh_DataMapOfVertexInteger myVertices;

(I run the clang version provided with OSX Maverick).

istvan-csanady commented 10 years ago

@tpaviot the usage of my parallel for each is slightly different from Intel TBBs, it uses a modern, C++11 based approach, the last argument is a lambda function. So to rewrite the code in BRepMesh_FastDiscret, you will have to write:

parallel_for_each(aFaces.begin(), aFaces.end(), [ this ](TopoDS_Face aFace) { (*this)(aFace); });

istvan-csanady commented 10 years ago

However, it could be rewritten, to only take an object with an operator() defined. But this approach fits much better to C++'s current state.

istvan-csanady commented 10 years ago

And just another comment on the parallel_for_each: with C++11 it is absolutely unnecessary to pull the whole Intel TBB library only because of the parallel_for_each. With std::thread such a parallel_for could be easily implemented. I will consider to write a prototype, and if I will, then parallelism could be turned on by default on every platform where C++11 is available. The only difficulty is to manage the lifetime of threads, and preventing the creation of new threads every time when a parallel_for_each is called. Probably we will need a thread pool.

dbarbier commented 10 years ago

One can already build with openmp if TBB dependency is not wanted, I am not sure that we want to provide our own thread pool.

istvan-csanady commented 10 years ago

@dbarbier Thats right, I forgot that OCC has an OpenMP based implementation as well. But since OpenMP is not supported by all the supported compilers, it would be reasonable to provide a parallel_for_each implementation. std::async could be a good choice, if we don't want to write a thread pool (that is not trivial of course).

tpaviot commented 10 years ago

@istvan-csanady Your suggestion for the parallel for each call from BRepMesh_FastDiscret does not compile on my machine:

parallel_for_each(aFaces.begin(), aFaces.end(), [ this ](TopoDS_Face aFace) {(*this)(aFace);});

This produces the following output:

In file included from /Users/thomas/Devel/oce/src/BRepMesh/BRepMesh_FastDiscret.cxx:102:
/Users/thomas/Devel/oce/src/BRepMesh/MTAPIParallel.hxx:28:13: error: no matching
      function for call to object of type 'const <lambda at
      /Users/thomas/Devel/oce/src/BRepMesh/BRepMesh_FastDiscret.cxx:270:53>'
            function(iterator);
            ^~~~~~~~
/Users/thomas/Devel/oce/src/BRepMesh/BRepMesh_FastDiscret.cxx:270:5: note: in
      instantiation of function template specialization
      'parallel_for_each<std::__1::__wrap_iter<TopoDS_Face *>, <lambda at
      /Users/thomas/Devel/oce/src/BRepMesh/BRepMesh_FastDiscret.cxx:270:53> >'
      requested here
    parallel_for_each(aFaces.begin(), aFaces.end(), [ this ](TopoDS_Fac...
    ^
/Users/thomas/Devel/oce/src/BRepMesh/BRepMesh_FastDiscret.cxx:270:53: note: 
      candidate function not viable: no known conversion from 'const
      std::__1::__wrap_iter<TopoDS_Face *>' to 'TopoDS_Face' for 1st argument
    parallel_for_each(aFaces.begin(), aFaces.end(), [ this ](TopoDS_Fac...
                                                    ^
1 error generated.

istvan-csanady commented 10 years ago

Sorry, here is the correct one... The argument of the lambda is an iterator In BRepMesh_FastDiscret: parallel_for_each(aFaces.begin(), aFaces.end(), [ this ](std::vector::iterator iter){ Process (*iter); });

Similarly in BRepMesh_IncrementalMesh: parallel_for_each(aFaces.begin(), aFaces.end(), [this](std::vector::iterator iter){ myMesh->Process (*iter); });

Of course the CreateMutexesForSubShapes call is necessary in both cases.

tpaviot commented 10 years ago

@istvan-csanady I had to slightly modify what you suggest in order to compile TKMesh. Can you please review ff53abe ? Not tested yet.

istvan-csanady commented 10 years ago

@tpaviot Seems to be OK.

tpaviot commented 10 years ago

@istvan-csanady ok, that works. I added a simple test case (BRepMesh_IncrementalMesh for the famous cylinder_head geometry), see 5b31e63dc9cc7fb1469772f79a8891a15918cbb0.

Results of this test are reproduced below:

$ ./MultiThread_test 
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from MultiThreadTestSuite
[ RUN      ] MultiThreadTestSuite.testMeshCylinderHeadSingleThread
[       OK ] MultiThreadTestSuite.testMeshCylinderHeadSingleThread (26665 ms)
[ RUN      ] MultiThreadTestSuite.testMeshCylinderHeadMultiThread
[       OK ] MultiThreadTestSuite.testMeshCylinderHeadMultiThread (22577 ms)
[----------] 2 tests from MultiThreadTestSuite (49242 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (49242 ms total)
[  PASSED  ] 2 tests.

Your GCD MT implementation results in an speed increase of 15.3%. This test was conducted on an old 2009 MacBoocPro 2.53GHz Intel Core 2 Duo/8Gb RAM. I'll try to do other tests and compare with Intel TBB performances. Note that both FastDiscret and IncrementalMesh are single threaded by default (the myInParallel boolean is set to Standard_False by default). Maybe we should set it to Standard_True by default on OSX/GCD. We may notice speed improvements in boolean operations.

istvan-csanady commented 10 years ago

@tpaviot If I understand right, this test case calculates the mesh of a cylinder. I suggest to add a test case where a shape, that has many faces is triangulated, because the performance improvement could be measured better in such a shape.

tpaviot commented 10 years ago

@istvan-csanady No, cylinder head is not just a simple cylinder: it's usually part of an engine (see http://www.opencascade.org/showroom/shapegallery/gal4/). It's a complex shape made of several faces, that's why I chose it for the use case.

tpaviot commented 10 years ago

I ran the same test with Intel TBB:

$ ./MultiThread_test 
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from MultiThreadTestSuite
[ RUN      ] MultiThreadTestSuite.testMeshCylinderHeadSingleThread
[       OK ] MultiThreadTestSuite.testMeshCylinderHeadSingleThread (26109 ms)
[ RUN      ] MultiThreadTestSuite.testMeshCylinderHeadMultiThread
[       OK ] MultiThreadTestSuite.testMeshCylinderHeadMultiThread (22301 ms)
[----------] 2 tests from MultiThreadTestSuite (48410 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (48410 ms total)
[  PASSED  ] 2 tests.

14.6% in time reduction. This is very close to the previous result, I need more tests to conclude about the compared benefits from using GCB or TBB.

tpaviot commented 10 years ago

@istvan-csanady I finished merging your patch with current master branch. On Appel, MT is enabled by default for BRepMesh incremental and fast discrete meshes. If you don't want to use parallel meshing, you have to explicitely SetParallel your mesh to Standard_False. The test case use 2 complex shapes from the official occt gallery (cylinder head and F1). Do you have any other experimental code that could be added to the MT test case (not only meshing of course) ?

istvan-csanady commented 10 years ago

@tpaviot No, I don't.

istvan-csanady commented 10 years ago

@tpaviot Shouldn't we rename the MT* files to something that fits to the naming convention of OCC?

tpaviot commented 10 years ago

@istvan-csanady I don't really know about OCC naming conventions, and so far I left your files as you provided them.

dbarbier commented 10 years ago

Hello, I suggest to rename them into src/Standard/Standard_GrandCentralDispatch.[ch]xx

There are other problems with this branch:

Files contain this copyright notice:

  Copyright (c) 2013 icsanady. All rights reserved.

We are thus not allowed to distribute them. @istvan-csanady: if you wrote these files, can you please put a less restrictive copyright notice?

Declarations in MTAPIParallel.hxx depend on macro USE_GCD. If this header file is installed, users may want to use it, and they will have to define USE_GCD in their project too. It would be better to remove conditions

#if(defined(__APPLE__) && defined(USE_GCD))

from the hxx file, and provide a dummy sequential alternative in cxx when compiling without defining USE_GCD.

istvan-csanady commented 10 years ago

@dbarbier I agree to use these files for whatever you want. Those notes are just automatically generated headers.

tpaviot commented 10 years ago

ok, I take care of cleaning up everything.

tpaviot commented 10 years ago

@dbarbier I renamed/moved files to Standard and updated licence headers. However, I don't get your argument related to the use of USE_GCD. I sequential alternative is already available in BRepMesh_FastDiscret (lines 285--286), why should we add another one to Standard_GrandCentralDispacth ? You write "people may want to use it": which people? people running linux or windows ?? If they are OSX users, there's no problem at all I guess.

dbarbier commented 10 years ago

@tpaviot At the moment, I guess that src/Standard/Standard_GrandCentralDispatch.hxx is installed along with other header files, right? Then developers may use it for their own projects, and having conditional macros in header files is then troublesome. OTOH you can also decide to not install this header file, this is fine; for instance by changing its suffix.

jf--- commented 8 years ago

@tpaviot , this feature is stalled, and TBB support is working out alright on OSX... Perhaps this feature should be closed?

tpaviot / oce

Grand Dispatch Central for OSX multi threading #442