Closed wds15 closed 5 years ago
How does this interact with GPU, MPI, and existing multi-threading? It's only helpful to keep parallelizing until all the cores are engaged. So I think anyting like this would need to be controllable.
@syclik is the one coordinating decisions like this in the math lib, but this also crosses into language issues that @seantalts is in charge of.
@wds15: I agree with Bob. Before considering the PR, you'll need to break down how this interacts with GPU, MPI, and multi-threading. And we should update the top-level comment as we gain more knowledge about whether this will work.
I would really like the issues to have a clear specification of what happens. It helps with the releases and allows for the implementer to figure out how to get it done. Reading this issue now, I'm not sure how this is going to impact the existing code base.
I think the answer here may be as simple as using the system thread pool. That is, it'll just be thread contention, nothing special. I think we still need to call that out, as it's a serious issue for efficiency tuning.
On Dec 16, 2018, at 10:36 PM, Daniel Lee notifications@github.com wrote:
@wds15: I agree with Bob. Before considering the PR, you'll need to break down how this interacts with GPU, MPI, and multi-threading. And we should update the top-level comment as we gain more knowledge about whether this will work.
I would really like the issues to have a clear specification of what happens. It helps with the releases and allows for the implementer to figure out how to get it done. Reading this issue now, I'm not sure how this is going to impact the existing code base.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
@bob-carpenter ... not sure how you mean that. So far we rely in terms of threading on very high-level building blocks as we are bound to use a portable thread facility.
However, my benchmarks do show that when using threads, then one should really use a so-called scalable memory allocator (at least under MacOS). Otherwise the performance is degraded due to memory allocation bottlenecks - the arena allocator is much less effective as the child threads go out of scope. This is why I am interested in the parallel STL from Intel - they use a threadpool to my knowledge and as such the arena allocator is not anymore reset so frequently. At least this is my understanding.
in case map_rect is called and each job is accessing the GPU resource, then this needs locking of the GPU ressource... I do not know how the GPU code is organized here, maybe @rok-cesnovar can comment. However, this is nothing new.
OpenCL is thread-safe by default so nothing bad would happen. Though we currently use a sequential execution policy on the GPU so if each thread calls the GPU then those threads would still go off sequentially when they hit the GPU code.
After the Cholesky PR I'd like to sit down and see how much of a speedup we would receive from using an out of sequence policy while turning on MPI/threading
@wds15 this can be closed right?
yeah, this is out of date
boy, I have been looking at this stuff for a while now...
Description
The C++17 standard brings parallel versions of most algorithms, including
for_each
. While most compilers do not yet support this standard out of the box, the Intel Parallel STL provides a free and portable implementation of this standard. The Intel Parallel STL will be used as the foundation for the parallel C++17 implementation in forthcoming gcc 9.0. The advantages of using this implementation is:The proposal here is to change things as follows. Here is what we do right now:
map_rect_concurrent
is based on the C++11async
facility.STAN_THREADS
is defined the function will keepSTAN_NUM_THREADS
busy and execute themap_rect
call in parallelSTAN_MPI
andSTAN_THREADS
is defined, thenmap_rect
will use MPI as a backend. The threading bit is not used unless the user performs a nestedmap_rect
call. In the nested case, things work just fine as the nested operation happens within each MPI processThe intent of this issue is to change things along these lines:
for_each
with execution policies as introduced on C++17for_each
will replace theasync
based threading inmap_rect_concurrent
.STAN_THREADS
is NOT defined, then thesequential_policy
is usedSTAN_THREADS
is defined, then theparallel_unsequenced_policy
is usedI am proposing 3 mutual exclusive implementations for the
for_each
:STAN_PSTL_CPP17
is defined, then the vanilla C++17 implementation is used => the compiler needs to support this (supported by the forthcoming gcc 9, msvc and very recent intel compilers)STAN_PSTL_INTEL
is defined then the Intel Parallel STL is used. The Intel Parallel STL is relatively easy to install and only requires a C++11 compiler with openMP SIMD support (available sinde g++ 4.9). I am fairly optimistic that this can even be achieved on Windows.for_each
implementation is now part of/stan/math/parallel/for_each.hpp
which has the same interface just like the C++17 interface. So this one supports the same execution policies as in C++17, but the implementation is based on C++11async
facilities, just as before.In summary, the proposal here does not change anything in terms of threading for the
stan-math
library at all. We simply refactor themap_rect_concurrent
in such a way that it uses a C++17for_each
algorithm. In case the C++17 thing is not there (either by the standard itself or the Intel Parallel STL), then a C++11 version is provided.As a side-product, this will introduce a
for_each
which allows parallel execution. The parallel execution is permitted no matter of theSTAN_THREADS
flag or not. This is done intentionally in order to allow parallelization of code where we know it is safe to do so.The C++11
for_each
version will useSTAN_NUM_THREADS
number of threads. The number of threads to be used for the other backends need to be managed separately. For the Intel Parallel STL, we will need to initialize the Intel Threading Building Blocks base library accordingly. This needs to happen in cmdstan (I will file an issue for this). For the vanilla C++17 based version it is an implementation detail of the compiler (so we need to see how to control that).Interactions:
No new interactions between the different extra features is introduced. I still list them as I am aware of them:
MPI:
map_rect
is used from the Stan, then things work as described above (threading will be nested in MPI)GPU / threading -
STAN_THREADS
defined:map_rect
is called and each job is accessing the GPU resource, then this needs locking of the GPU ressource... I do not know how the GPU code is organized here, maybe @rok-cesnovar can comment. However, this is nothing new.GPU / threading -
STAN_THREADS
NOT defined:map_rect
will not parallelize using threadsfor_each
parallel call... this is only safe if the AD stack is not touched and the GPU resource is thread-safe.GPU / MPI:
map_rect
can make use of many GPUs)map_rect
calls cannot work on the GPU as the GPU resource is only available on the main node, but not on the workers.I think that covers the relevant setups.
Example
No behavior will be changed. Although, the respective PR should fix issue #1079 as well.
Expected Output
Things should run faster whenever the Intel Parallel STL is used instead of the C++11 based implementation.
Additional Information
Any additional information goes here.
Current Math Version
v2.18.0