stan-dev / math

The Stan Math Library is a C++ template library for automatic differentiation of any order using forward, reverse, and mixed modes. It includes a range of built-in functions for probabilistic modeling, linear algebra, and equation solving.
https://mc-stan.org
BSD 3-Clause "New" or "Revised" License
753 stars 188 forks source link

Make optional use of parallel C++17 algorithms in map_rect_concurrent #1084

Closed wds15 closed 5 years ago

wds15 commented 5 years ago

Description

The C++17 standard brings parallel versions of most algorithms, including for_each. While most compilers do not yet support this standard out of the box, the Intel Parallel STL provides a free and portable implementation of this standard. The Intel Parallel STL will be used as the foundation for the parallel C++17 implementation in forthcoming gcc 9.0. The advantages of using this implementation is:

The proposal here is to change things as follows. Here is what we do right now:

The intent of this issue is to change things along these lines:

I am proposing 3 mutual exclusive implementations for the for_each:

  1. If STAN_PSTL_CPP17 is defined, then the vanilla C++17 implementation is used => the compiler needs to support this (supported by the forthcoming gcc 9, msvc and very recent intel compilers)
  2. If STAN_PSTL_INTEL is defined then the Intel Parallel STL is used. The Intel Parallel STL is relatively easy to install and only requires a C++11 compiler with openMP SIMD support (available sinde g++ 4.9). I am fairly optimistic that this can even be achieved on Windows.
  3. If neither of the above is defined, then a for_each implementation is now part of /stan/math/parallel/for_each.hpp which has the same interface just like the C++17 interface. So this one supports the same execution policies as in C++17, but the implementation is based on C++11 async facilities, just as before.

In summary, the proposal here does not change anything in terms of threading for the stan-math library at all. We simply refactor the map_rect_concurrent in such a way that it uses a C++17 for_each algorithm. In case the C++17 thing is not there (either by the standard itself or the Intel Parallel STL), then a C++11 version is provided.

As a side-product, this will introduce a for_each which allows parallel execution. The parallel execution is permitted no matter of the STAN_THREADS flag or not. This is done intentionally in order to allow parallelization of code where we know it is safe to do so.

The C++11 for_each version will use STAN_NUM_THREADS number of threads. The number of threads to be used for the other backends need to be managed separately. For the Intel Parallel STL, we will need to initialize the Intel Threading Building Blocks base library accordingly. This needs to happen in cmdstan (I will file an issue for this). For the vanilla C++17 based version it is an implementation detail of the compiler (so we need to see how to control that).

Interactions:

No new interactions between the different extra features is introduced. I still list them as I am aware of them:

MPI:

GPU / threading - STAN_THREADS defined:

GPU / threading - STAN_THREADS NOT defined:

GPU / MPI:

I think that covers the relevant setups.

Example

No behavior will be changed. Although, the respective PR should fix issue #1079 as well.

Expected Output

Things should run faster whenever the Intel Parallel STL is used instead of the C++11 based implementation.

Additional Information

Any additional information goes here.

Current Math Version

v2.18.0

bob-carpenter commented 5 years ago

How does this interact with GPU, MPI, and existing multi-threading? It's only helpful to keep parallelizing until all the cores are engaged. So I think anyting like this would need to be controllable.

@syclik is the one coordinating decisions like this in the math lib, but this also crosses into language issues that @seantalts is in charge of.

syclik commented 5 years ago

@wds15: I agree with Bob. Before considering the PR, you'll need to break down how this interacts with GPU, MPI, and multi-threading. And we should update the top-level comment as we gain more knowledge about whether this will work.

I would really like the issues to have a clear specification of what happens. It helps with the releases and allows for the implementer to figure out how to get it done. Reading this issue now, I'm not sure how this is going to impact the existing code base.

bob-carpenter commented 5 years ago

I think the answer here may be as simple as using the system thread pool. That is, it'll just be thread contention, nothing special. I think we still need to call that out, as it's a serious issue for efficiency tuning.

On Dec 16, 2018, at 10:36 PM, Daniel Lee notifications@github.com wrote:

@wds15: I agree with Bob. Before considering the PR, you'll need to break down how this interacts with GPU, MPI, and multi-threading. And we should update the top-level comment as we gain more knowledge about whether this will work.

I would really like the issues to have a clear specification of what happens. It helps with the releases and allows for the implementer to figure out how to get it done. Reading this issue now, I'm not sure how this is going to impact the existing code base.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

wds15 commented 5 years ago

@bob-carpenter ... not sure how you mean that. So far we rely in terms of threading on very high-level building blocks as we are bound to use a portable thread facility.

However, my benchmarks do show that when using threads, then one should really use a so-called scalable memory allocator (at least under MacOS). Otherwise the performance is degraded due to memory allocation bottlenecks - the arena allocator is much less effective as the child threads go out of scope. This is why I am interested in the parallel STL from Intel - they use a threadpool to my knowledge and as such the arena allocator is not anymore reset so frequently. At least this is my understanding.

SteveBronder commented 5 years ago

in case map_rect is called and each job is accessing the GPU resource, then this needs locking of the GPU ressource... I do not know how the GPU code is organized here, maybe @rok-cesnovar can comment. However, this is nothing new.

OpenCL is thread-safe by default so nothing bad would happen. Though we currently use a sequential execution policy on the GPU so if each thread calls the GPU then those threads would still go off sequentially when they hit the GPU code.

After the Cholesky PR I'd like to sit down and see how much of a speedup we would receive from using an out of sequence policy while turning on MPI/threading

rok-cesnovar commented 5 years ago

@wds15 this can be closed right?

wds15 commented 5 years ago

yeah, this is out of date

boy, I have been looking at this stuff for a while now...