Make optional use of parallel C++17 algorithms in map_rect_concurrent

wds15 commented 5 years ago

Description

The C++17 standard brings parallel versions of most algorithms, including for_each. While most compilers do not yet support this standard out of the box, the Intel Parallel STL provides a free and portable implementation of this standard. The Intel Parallel STL will be used as the foundation for the parallel C++17 implementation in forthcoming gcc 9.0. The advantages of using this implementation is:

implicit scheduling which is beneficial if jobs have unequal runtime
benchmarks show faster execution, see thread on discourse
with the availability of C++17 at some point this will rely on a standard language
Intel Parallel STL only requires C++11 and an openMP capable compiler

The proposal here is to change things as follows. Here is what we do right now:

right now the map_rect_concurrent is based on the C++11 async facility.
whenever STAN_THREADS is defined the function will keep STAN_NUM_THREADS busy and execute the map_rect call in parallel
the MPI parallelism always takes precedence over the threading.
if STAN_MPI and STAN_THREADS is defined, then map_rect will use MPI as a backend. The threading bit is not used unless the user performs a nested map_rect call. In the nested case, things work just fine as the nested operation happens within each MPI process

The intent of this issue is to change things along these lines:

make use of the for_each with execution policies as introduced on C++17
this new for_each will replace the async based threading in map_rect_concurrent.
if STAN_THREADS is NOT defined, then the sequential_policy is used
if STAN_THREADS is defined, then the parallel_unsequenced_policy is used

I am proposing 3 mutual exclusive implementations for the for_each:

If STAN_PSTL_CPP17 is defined, then the vanilla C++17 implementation is used => the compiler needs to support this (supported by the forthcoming gcc 9, msvc and very recent intel compilers)
If STAN_PSTL_INTEL is defined then the Intel Parallel STL is used. The Intel Parallel STL is relatively easy to install and only requires a C++11 compiler with openMP SIMD support (available sinde g++ 4.9). I am fairly optimistic that this can even be achieved on Windows.
If neither of the above is defined, then a for_each implementation is now part of /stan/math/parallel/for_each.hpp which has the same interface just like the C++17 interface. So this one supports the same execution policies as in C++17, but the implementation is based on C++11 async facilities, just as before.

In summary, the proposal here does not change anything in terms of threading for the stan-math library at all. We simply refactor the map_rect_concurrent in such a way that it uses a C++17 for_each algorithm. In case the C++17 thing is not there (either by the standard itself or the Intel Parallel STL), then a C++11 version is provided.

As a side-product, this will introduce a for_each which allows parallel execution. The parallel execution is permitted no matter of the STAN_THREADS flag or not. This is done intentionally in order to allow parallelization of code where we know it is safe to do so.

The C++11 for_each version will use STAN_NUM_THREADS number of threads. The number of threads to be used for the other backends need to be managed separately. For the Intel Parallel STL, we will need to initialize the Intel Threading Building Blocks base library accordingly. This needs to happen in cmdstan (I will file an issue for this). For the vanilla C++17 based version it is an implementation detail of the compiler (so we need to see how to control that).

Interactions:

No new interactions between the different extra features is introduced. I still list them as I am aware of them:

MPI:

if map_rect is used from the Stan, then things work as described above (threading will be nested in MPI)
what does not work is to nest MPI calls inside of threads. So if we have multiple threads and each thread wants to make MPI calls, then this will fail with the current setup. It is possible to my knowledge to make this work - but this cannot happen in any Stan program as of now. This constellation is impossible to achieve with a given Stan program. The only situation this can arise is if an interface decides to sample multiple chains using threads and then allows MPI as a backend. This won't wok at the moment.

GPU / threading - STAN_THREADS defined:

in case only the main program interacts with the GPU, then all is just fine
in case map_rect is called and each job is accessing the GPU resource, then this needs locking of the GPU ressource... I do not know how the GPU code is organized here, maybe @rok-cesnovar can comment. However, this is nothing new.

GPU / threading - STAN_THREADS NOT defined:

in this case map_rect will not parallelize using threads
it is still possible to make a for_each parallel call... this is only safe if the AD stack is not touched and the GPU resource is thread-safe.

GPU / MPI:

it is a matter of configuring MPI to make sure a GPU is per process available (then map_rect can make use of many GPUs)
in case MPI is configured to only use on the main process a GPU, then map_rect calls cannot work on the GPU as the GPU resource is only available on the main node, but not on the workers.

I think that covers the relevant setups.

Example

No behavior will be changed. Although, the respective PR should fix issue #1079 as well.

Expected Output

Things should run faster whenever the Intel Parallel STL is used instead of the C++11 based implementation.

Additional Information

Any additional information goes here.

Current Math Version

v2.18.0

bob-carpenter commented 5 years ago

How does this interact with GPU, MPI, and existing multi-threading? It's only helpful to keep parallelizing until all the cores are engaged. So I think anyting like this would need to be controllable.

@syclik is the one coordinating decisions like this in the math lib, but this also crosses into language issues that @seantalts is in charge of.

syclik commented 5 years ago

@wds15: I agree with Bob. Before considering the PR, you'll need to break down how this interacts with GPU, MPI, and multi-threading. And we should update the top-level comment as we gain more knowledge about whether this will work.

I would really like the issues to have a clear specification of what happens. It helps with the releases and allows for the implementer to figure out how to get it done. Reading this issue now, I'm not sure how this is going to impact the existing code base.

bob-carpenter commented 5 years ago

I think the answer here may be as simple as using the system thread pool. That is, it'll just be thread contention, nothing special. I think we still need to call that out, as it's a serious issue for efficiency tuning.

On Dec 16, 2018, at 10:36 PM, Daniel Lee notifications@github.com wrote:

@wds15: I agree with Bob. Before considering the PR, you'll need to break down how this interacts with GPU, MPI, and multi-threading. And we should update the top-level comment as we gain more knowledge about whether this will work.

I would really like the issues to have a clear specification of what happens. It helps with the releases and allows for the implementer to figure out how to get it done. Reading this issue now, I'm not sure how this is going to impact the existing code base.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

wds15 commented 5 years ago

@bob-carpenter ... not sure how you mean that. So far we rely in terms of threading on very high-level building blocks as we are bound to use a portable thread facility.

However, my benchmarks do show that when using threads, then one should really use a so-called scalable memory allocator (at least under MacOS). Otherwise the performance is degraded due to memory allocation bottlenecks - the arena allocator is much less effective as the child threads go out of scope. This is why I am interested in the parallel STL from Intel - they use a threadpool to my knowledge and as such the arena allocator is not anymore reset so frequently. At least this is my understanding.

SteveBronder commented 5 years ago

in case map_rect is called and each job is accessing the GPU resource, then this needs locking of the GPU ressource... I do not know how the GPU code is organized here, maybe @rok-cesnovar can comment. However, this is nothing new.

OpenCL is thread-safe by default so nothing bad would happen. Though we currently use a sequential execution policy on the GPU so if each thread calls the GPU then those threads would still go off sequentially when they hit the GPU code.

After the Cholesky PR I'd like to sit down and see how much of a speedup we would receive from using an out of sequence policy while turning on MPI/threading

rok-cesnovar commented 5 years ago

@wds15 this can be closed right?

wds15 commented 5 years ago

yeah, this is out of date

boy, I have been looking at this stuff for a while now...

stan-dev / math