Use TBB to parallelise unary functions of large containers

andrjohns commented 4 years ago

Description

Vectorised unary functions (e.g., exp, log) currently use either looping or Eigen's vectorised functions (for arithmetic types only) to evaluate containers. We could instead use the TBB's parallel_for functionality to evaluate the container in parallel.

A minimal example of this would be applying exp to every element of a large Eigen vector:

  using Eigen::VectorXd;
  VectorXd m_d = VectorXd::Random(10000000);
  VectorXd m_res(10000000);

  tbb::task_scheduler_init init;
  tbb::parallel_for(
    tbb::blocked_range<size_t>(0, m_d.size()), 
    [&m_d,&m_res](const tbb::blocked_range<size_t>& r) {
      for (size_t i = r.begin(); i < r.end(); ++i)
        m_res[i] = std::exp(m_d[i]);
    });

Benchmarking shows a performance of ~2x that of Eigen's vectorised exp, and ~5x that of a simple loop (benchmarking code at end of post):

Run on (16 X 3700 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 64 KiB (x8)
  L2 Unified 512 KiB (x8)
  L3 Unified 8192 KiB (x2)
Load Average: 1.53, 0.87, 0.79
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
ParExp       10259973 ns     10225812 ns           71
EigExp       23918425 ns     23917527 ns           28
LoopExp      53930647 ns     53929832 ns           10

This could provide large performance improvements for containers of autodiff types (where Eigen's vectorised functions can't be used), or for containers of arithmetic types where a vectorised function isn't available in Eigen.

I still need to do more testing to identify the size at which parallelisation becomes beneficial, but things look pretty promising

Current Version:

v3.2.0

Code used for benchmarking:

#include <stan/math/mix.hpp>
#include <benchmark/benchmark.h>
#include <tbb/parallel_for.h>
#include <tbb/blocked_range.h>

static void ParExp(benchmark::State& state) {
  using Eigen::VectorXd;
  VectorXd m_d = VectorXd::Random(10000000);
  VectorXd m_res(10000000);

  for (auto _ : state) {
    tbb::task_scheduler_init init;
    tbb::parallel_for(
      tbb::blocked_range<size_t>(0, m_d.size()), 
      [&m_d,&m_res](const tbb::blocked_range<size_t>& r) {
        for (size_t i = r.begin(); i < r.end(); ++i)
          m_res[i] = std::exp(m_d[i]);
      });
  }
}
BENCHMARK(ParExp);

static void EigExp(benchmark::State& state) {
  using Eigen::VectorXd;
  VectorXd m_d = VectorXd::Random(10000000);
  VectorXd m_res(10000000);

  for (auto _ : state) {
    m_res.array() = m_d.array().exp();
  }
}
BENCHMARK(EigExp);

static void LoopExp(benchmark::State& state) {
  using Eigen::VectorXd;
  VectorXd m_d = VectorXd::Random(10000000);
  VectorXd m_res(10000000);

  for (auto _ : state) {
    for(size_t i = 0; i < m_d.size(); ++i) {
      m_res[i] = std::exp(m_d[i]);
    }
  }
}
BENCHMARK(LoopExp);

BENCHMARK_MAIN();

andrjohns commented 4 years ago

I'm now realising that synchronisation is more to do with the parallelisation than the autodiff, ignore my last question. I'm reading up on it now and will post again if I run into more issues

bob-carpenter commented 4 years ago

I think the issue is that the autodiff needs to be synchronized if things are going to hit it I parallel.

By improper synch, it's usually two different threads trying to read from or write to the same memory location.

andrjohns commented 4 years ago

Unfortunately I'm not making much progress here. As a test, I tried running the old (pre testing framework) mix tests for exp, and they pass consistently. It's only when the tests are run using expect_ad that things fail, even though both are working with containers of var, fvar<var>, and fvar<fvar<var>>. Any idea where that behaviour could be coming from?

Branch is up here

bob-carpenter commented 4 years ago

What fails with expect_ad? In some cases, the relative error tests can be overly strict, especially at higher orders where finite diffs aren't so good. The other thing they do is test that all the throw conditions match.

andrjohns commented 4 years ago

By failing I meant the previous behaviour where the tests only intermittently pass. I've also tried running things with a mutex that restricts the execution to run serially and the behaviour is the same, which is odd

andrjohns commented 4 years ago

By failing I meant the previous behaviour where the tests only intermittently pass. I've also tried running things with a mutex that restricts the execution to run serially and the behaviour is the same, which is odd

SteveBronder commented 4 years ago

You can get race condition info by using clang++ and adding the -fsanitize=thread flag.

SteveBronder commented 4 years ago

Also kind of random, but can you try this with the newest tbb? When I'm looking at this it seems like the leaks are happening in the tbb. Though that could also be some connection between our stack and the tbb

andrjohns commented 4 years ago

@wds15 or @bbbales2 Could I bother you for some parallelisation ideas? I've got a minimum working example of this parallel map with the exp function up here: https://github.com/stan-dev/math/compare/develop...andrjohns:feature/map

The tests are all consistently passing without segfaults (hooray), but the parallelised version is orders of magnitude slower than the apply_scalar_unary version in develop. In the branch above I have a rev test of a vector with 100000 var, on develop the test runs in ~25ms, but in the parallelised branch that jumps up to ~500ms.

Am I doing anything particularly dumb that's causing this performance hit?

bbbales2 commented 4 years ago

I think you should put the nested thing inside the parallel for.

If you don't put it in there, then if the same thread gets two different chunks of work, the first chunk will be efficient, but the second chunk will be doing autodiff for the first chunk unless the stack gets cleared after the first is finished.

Clearing the nested stack after the first is finished just means doing nesting inside the tbb block.

andrjohns commented 4 years ago

Nice! That brought the runtime down to about parity with develop. I'll work up an example with an actual Stan model and see how the performance differs

andrjohns commented 4 years ago

Quick update that things are progressing very well with this. I've got a generalised parallel_map that I've tested with one-argument (i.e., vectorised exp) and two-argument (i.e., vectorised pow) loops with reverse-mode autodiff, and all tests are passing consistently.

I've got some more testing (different combinations of arguments as well as speed-ups) and optimisations to look at, but will be ready to write up a design doc in the near future to get more feedback on the approach.

If anyone wanted to test locally, the current branch is here: https://github.com/stan-dev/math/compare/develop...andrjohns:feature/map

(Note that I fixed the threads to 4 for testing)

SteveBronder commented 4 years ago

This is really cool! I'll have a look this weekene

SteveBronder commented 4 years ago

Lol sorry!

andrjohns commented 4 years ago

Steve (or anyone else interested) - bit of background for when you get the time to have a look. On the branch above I've got the prim and fwd implementations of parallel_map added as well. Like with reduce_sum, the fwd version is not parallelised.

At the moment, I've got it implemented to parallelise exp with Eigen types of either var or double, but that's more to test that the values and/or gradients are all correct than any kind of roadmap. From (very rough) testing, I'm seeing that the parallelised exp with containers of var is markedly slower than the apply_scalar_unary version. I'm assuming this is because the rev versions need to traverse the container length twice (once in parallel to compute values/adjoints, and again serially to pack them into new vars).

andrjohns commented 4 years ago

Got some early benchmarks. I compared performance with vectors of a unary function that had SIMD-vectorisation for doubles (exp), a unary function that was just looping (lgamma), and a binary function that operated on two matrices (columns_dot_product).

The benchmarking and plotting code is here: parallel_map_benchmarking.zip. (uploading in a zip file due to GitHub restrictions)

Exp

When working with doubles, parallel_map didn't start outperforming develop until vector sizes exceeded 10000. For vars, the parallel_map performance was always less than half that of develop, regardless of size.

ExpPlot

Lgamma

When working with doubles, parallel_map provided significant speed-ups, with 2x the speed of develop at a size of 1000, and 8x the speed of develop at the size of 100000. For vars, there was a slight performance increase with a vector size greater than 1000, but was again performing worse than develop at a size of 100000

LGammaPlot

Columns Dot Product

For doubles, with a matrix of 100x100, parallel_map was markedly worse than develop. Once the matrix size increased to 1000x1000, parallel_map was 6x faster than develop, with this increase reducing to ~2x at sizes of 5000x5000.

For vars, parallel_map was ~2x faster than develop at matrices of both 100x100 and 1000x1000, but was slower than develop with matrices of 5000x5000.

ColDotProdPlot

andrjohns commented 4 years ago

A bit of a mixed bag overall, and hopefully there are some optimisations to be made to parallel_map that can improve performance with vars

SteveBronder commented 4 years ago

Is this with 4 cores?

andrjohns commented 4 years ago

Yep, but performance is pretty similar at 8 cores as well

wds15 commented 4 years ago

I will take a look. Cool work!

andrjohns commented 4 years ago

Sorry to add the benchmark/plot overload, but I've also worked up a rough implementation of combining parallel_map with Eigen's SIMD-vectorisation for doubles.

Benchmarks are showing a 4.5x speed-up with large vector sizes (100000), compared to 2.5x with parallelism alone. I still need to make some minor tweaks to get it working with Matrices, but overall good to see more performance (even if not with vars).

ExpEigenPlot

andrjohns commented 4 years ago

Forgot to say that the above benchmarks are using exp

wds15 commented 4 years ago

Hmm... this is not too easy to digest since it's quite sweet C++ plumbing going on here... I like the ideas you use here; I am not yet sure if this level of complexity is what we need/want, but let's park this for now.

I think you are getting things right. There are two things which cross my mind:

Can you try to pull out the save_varis thing (here https://github.com/andrjohns/math/blob/d2efb0da9b6ab8aa4d31af75adeb5a34d3418841/stan/math/rev/functor/parallel_map.hpp#L63) into a simple for loop which is run outside of the parallel_for from the TBB? Maybe this is better for memory locality reasons...maybe not.
Given that each computation is relatively cheap, can you introduce a grainsize?
I would probably even try to start with the static partitioner from the TBB and ensure that on 4 cores you run 4 chunks as a starting point. This will eliminate any scheduling overhead for now which is ok, since the test cases should compute fairly similarly long.

Going parallel always means that we introduce quite some overhead in the rev case due to the need to do a lot of brute-force copying. Still, what bothers me at the moment is that the lgamma case is quite a bit slower for small vector sizes and that the var case even degrades in performance. I would suggest to first concentrate on lgamma (maybe even per iteration a sum of two or even three lgamma calls) which is an easy case to speedup due to the large computational cost of it. Once that is ok (grainsize, scheduling) then we go to things like exp. This is how I would do it.

EDIT: I also recall that lambda functors are not as well inlined from compilers which is why for reduce_sum we used a struct instead as functor passed to parallel_for...maybe that matters as well.

andrjohns commented 4 years ago

Thanks Sebastian! I'll have a look into these ideas and let you know how I go.

andrjohns commented 4 years ago

Real interesting stuff, good call on the multiple operations. It looks like this approach will be work best when there's more than one input/operation at each iteration of the loop. I compared the performance of an elementwise lgamma(x) + lgamma(y) across vector sizes of 100 & 1000, and across increasing grainsizes with either the automatic, simple, or static partitioners.

It looks like 100 is too small a size for parallelisation to be beneficial. However, with a vector of size 1000, the automatic partitioner and a grainsize of 5, there was 2x speedup with vars!

LGamma_Red_auto

LGamma_Red_simple

LGamma_Red_static

andrjohns commented 4 years ago

I'll have a look into converting the lambdas to structs tomorrow, but things are finally looking promising for reverse mode!

EDIT: I also tried pulling the save_varis out into a separate (serial) loop, but it tanked performance

wds15 commented 4 years ago

Great. This is in line with my expectation that this approach is great for things with a ton of work rather than just one ad node.

It would also be nice to plot the absolute runtime on log log plots.

andrjohns commented 4 years ago

It would also be nice to plot the absolute runtime on log log plots

What goes on the respective axes? Develop time vs. parallel time? Was there a similar plot from the reduce_sum testing that you could point me towards?

SteveBronder commented 4 years ago

I'd do x axis as sizes of the problem, y as Speedup vs. Develop, and then have colors by grainsize

andrjohns commented 4 years ago

Sounds good, will do!

SteveBronder commented 4 years ago

Word! If you are able to shoot off and do multiple cores I'd do the combination of core plus grainsize as the color

wds15 commented 4 years ago

What I meant was:

y-axis is runtime (log scale)
x-axis is number of cores (log scaled)
colour coding / linetypes / shapes for dots is whatever we vary
develop runtime is shown as a dashed line

but no worries.

BTW, another interesting example could be a finite mixture model. Say you have M mixture components (M= 10 or 20) and then you are interested in the log-lik of a largish data-set. Since we do not of a reduce_log_sum_exp right now, this could be a good application for this parallel_map which can be used to calculate the log lik of each mixture component, return all M values and then we log_sum_exp it in a numerically stable way.

(I still don't like parallel_map as a name since we never make it explicit if things run in parallel, but that's to discuss in a design doc, so no worries for now)

andrjohns commented 4 years ago

Ah that makes sense I'll work that up. The mixture model also sounds like a good match for testing so I'll add it to the to-do list.

As for the naming, I'm not really tied to anything, so I'm happy for that to be a consensus decision at the design doc stage too.

Thanks for help, looking forward to seeing how this develops!

SteveBronder commented 4 years ago

Alright so I took a hack at this and put up the code on this branch with benchmark code here

If the font is too teeny lmk and I'll blow it up, but the below plots are for calling lgamma with doubles and vars for grain sizes of 16:512 and data from sizes 128:128,384

But note I haven't checked that I'm calculating the adjoints correctly though it didn't change much from @andrjohns stuff so should be fine. I think being fast when the data is in the thousands is good, and when someone has an expensive function this should work for smaller N. The main thing is chopping off the bulk startup time for var (I'm not sure how much of the TBB's startup time we can chop). Andrew I got the segment version working for double, I think for var we need to use simple_partitioner with the grainsize to get the number of vars for each iteration, then we can do a bulk copy once, do the calculation on a chunk of the input all at once, set their output adjoints to one and call grad for that whole chunk. It feels like that would work right? I'm not sure where else get rid of startup costs

Below is also zooming on on a grainsize of 64 and looking at loop_time/parallel_time It looks like once var types hit over 1500 or so this becomes a thing

Is there some way we could not allocate that blob of partials for the adjoints? Like could we just plop the adjoint values right in the vari's? tho' if we did that it would be real dangerous if we ever called set_zero_adjoint(). The only other things I can think of are rewriting accumulate_adjoints() and save_varis() etc. so that instead of doing a recursion they did a parameter pack expansion

SteveBronder commented 4 years ago

Oh also data and code for graphs here

SteveBronder commented 4 years ago

Forgot to mention all those graphs are for 28 threads. Is there a way to set the number of threads at runtime?

wds15 commented 4 years ago

Thanks for the benchmarks. I would really not want to rely on requiring any pre-defined chunking structure as you suggest by using simple_paritioner. Working on chunks is fine, but that should fit into the general idea of the TBB that the scheduler gets to choose how large work chunks are (just recall that this stuff runs ideally concurrently with other operations).

Getting rid of the recursion of accumulate_adjoints and save_varis is probably a good thing which would benefit reduce_sum as well...but here we talk about an unary function such that there is not much of a recursion.

From my view we seem to have something which is already now very powerful - as long as the element we iterate over is sufficiently expensive. Examples were this is probably fantastic are

multiple GP processes which require super expensive cholesky decompositions which are independent from one another
solving an ODE for many things in parallel
mixture models are also a nice candidate

What I am saying is that expensive operations with many terms should really benefit a lot from the parallel_map, since the relative cost of AD tape handling effort becomes less of a concern.

I would suggest to try out an expensive example as suggested above and confirm that we are doing well with expensive things already now. Then we can move ahead with this already now and possibly come back to this if we really feel like we need to tune this thing for those tiny operations.

So get the low hanging fruit first is what I am saying here.

SteveBronder commented 4 years ago

I would really not want to rely on requiring any pre-defined chunking structure as you suggest by using simple_paritioner

Is there a guarantee that each loop size will be some max value? I only suggested the partitioner so when we figure out how many vars we need per loop chunk we will know a definite number

https://github.com/stan-dev/math/compare/feature/par-map-v2#diff-75fb2e278dc34158a0dc5888d95951b9R36

Getting rid of the recursion of accumulate_adjoints and save_varis is probably a good thing which would benefit reduce_sum as well...but here we talk about an unary function such that there is not much of a recursion.

It's not the recursion per-se, it's calling it multiple times in the inner parallel_for loop when we could call it once for a chunk of mem.

What I am saying is that expensive operations with many terms should really benefit a lot from the parallel_map, since the relative cost of AD tape handling effort becomes less of a concern.

I agree those would work right now, but what I'd like to think about is brms where we can do all the distribution and random effects with these. eod that's just a mul of var * data so I'd like this to be decently fast for medium N in that simple case as well.

What size N should this kick off for in terms of an expensive function?

andrjohns commented 4 years ago

Thanks for working this up Steve! I believe that the adjoints don't get calculated correctly if the precomputed_gradients is inside the parallel_for call. But that might have been from an older implementation I was working on, so I'll double-check tonight.

I only suggested the partitioner so when we figure out how many vars we need per loop chunk we will know a definite number

Given that the number of vars per iterations will be constant, can we just multiply by the number of iterations to be processed in a given chunk: nvars * (r.end() - r.begin())?

andrjohns commented 4 years ago

Hey Steve, just looking through the code now. I like the use of parameter packs, definitely much cleaner. The changes you most recently pushed to remove the lambdas have broken things a bit, since the index functions are expecting a functor to be passed that it then applies to the indexed arguments. The only alternative to this (requiring a functor) that I could think of was to have the index functions return a tuple of the indexed arguments, but I didn't really like that.

Unfortunately the approach you're using to deduce the return type:

std::decay_t<decltype(app_fun(args...))> result(max_size(args...));

Won't generalise very well. With only a single 'size' argument, this would only work for vector return types. Also, this won't work for cases where the return size differs from the input sizes. Using columns_dot_product as an example: the inputs are both size R C so the deduced return type would be a row-vector of size R C, when it should just be size C.

Also, I checked and including the precomputed_gradients_vari call in the parallel_map functor definitely causes the wrong adjoints to be returned

wds15 commented 4 years ago

The precomputed_gradients_vari must be created in the thread with the target AD tape.

@SteveBronder it would make indeed sense to handle save_varis in a blocked fashion according to what the blocks assigned from the TBB scheduler are which we ought to process, but I don't think we should attempt to control in advance the exact blocking size (other than roughly tuning it with grainsize).

To me the parallel_map is useful to run big things in parallel, not elementary operations.

The idea to parallelize elementary operations and by this defacto automatically parallelize the entire Stan program is certainly tempting, but I really doubt that it is even possible to do. I can be proven wrong in that, of course. The issue is the overhead due to parallelization (scheduling & AD tape munging) cost.

Instead we should focus on a parallel map which is good enough for running large things in parallel. Something like this can still be tuned later for elementary operations. At least this is how I see it.

andrjohns commented 4 years ago

@wds15 Implementation question - I've just about got this working with two-index loops (i.e., using blocked_range2d), and that requires a separate grainsize for each dimension (i.e., a col-wise grainsize and a row-wise grainsize). Is there much value in allowing parallel_map to take two grainsize arguments, or should I just apply the single provided grainsize to both dimensions?

wds15 commented 4 years ago

I have never used 2D stuff, but I would definitely keep two grainsizes in this case and not simplify.

This 2D stuff would be interesting for reduce_sum as well...

But shouldn't we start with the simple stuff first? It's really a big thing if we have this parallel map (ok it does make sense to think ahead of time how it generalizes, sure).

andrjohns commented 4 years ago

Separate grainsizes works for me, and simplifies the templating a bunch as well.

My thinking behind getting it all built out now, was that we can focus on optimising the single-index case and I can double-check that any changes won't get in the way of the other implementations. I really wanted the two-index case so that we could use it for matrix multiplication, like (untested):

  auto ind_f = [&](int i, int j, const auto& fun,
                    const auto& x, const auto& y) {
    return fun(x.row(i), y.col(j));
  };

  // Functor defining function to be applied to indexed arguments
  auto f = [&](const auto& x, const auto& y) { return dot_product(x, y); };

  stan::math::parallel_map(f, ind_f, std::forward<stan::math::matrix_v>(out_par), 1, 1,
                         in1_par, in2_par);

Which will hopefully provide some speed-ups in cases where the matrices are too small for the copy-cost to GPUs to be worth it.

andrjohns commented 4 years ago

Incidentally, I literally just got the two-index case working with vars, so will try some multiplication benchmarking later this arvo

andrjohns commented 4 years ago

Parallelised matrix multiply results are... underwhelming:

2020-09-25 20:07:23
Running ./parallel_map_multiply
Run on (16 X 3700 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 64 KiB (x8)
  L2 Unified 512 KiB (x8)
  L3 Unified 8192 KiB (x2)
Load Average: 1.83, 1.73, 1.64
---------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
mult_var_1000x1000_ParMap  127459554973 ns   106917498549 ns            1
mult_var_1000x1000_develop   24444338 ns     24444797 ns           29

I ran the model through VTune, and the vast majority of the runtime is due to the chain() call (via the callback_vari method in dot_product):

Screenshot_20200925_201331

I've also got the full VTune result here if anyone else wanted to take a look: r003hs.zip

I'm not really sure if there's a way to optimise past this, since the adjoints for each input need to be calculated in the parallel_for loop and extracted out

t4c1 commented 4 years ago

Calculating matrix product via dot products is memory bound and will pretty much always be slower than a proper matrix multiply algorithm that uses cache efficiently. You should instead make a calculation of a block, lets say 8 by 8 elements a basic unit of work that you are dividing between processors. This way you can use the cache efficient matrix multiplication algorithm even in parallel.

andrjohns commented 4 years ago

Hmm that's a good idea, thanks Tadej! In that case I might put the matrix multiply on the backburner and focus on getting the framework proper off the ground first

andrjohns commented 4 years ago

I've got this in a stable and (mostly) documented state that's ready for others to look at or test.

I've dropped the ranged parallelism (i.e., vector segments and matrix blocks) since I couldn't get it to stop segfaulting, so I'll revisit that later once this is fully developed. For now, I've got both one-index and two-index loops parallelised and passing tests. There are basic prim and rev tests in place and I've also modified the exp mix tests.

Branch is here: https://github.com/stan-dev/math/compare/develop...andrjohns:feature/map

stan-dev / math