Open andrjohns opened 4 years ago
I'm now realising that synchronisation is more to do with the parallelisation than the autodiff, ignore my last question. I'm reading up on it now and will post again if I run into more issues
I think the issue is that the autodiff needs to be synchronized if things are going to hit it I parallel.
By improper synch, it's usually two different threads trying to read from or write to the same memory location.
Unfortunately I'm not making much progress here. As a test, I tried running the old (pre testing framework) mix
tests for exp, and they pass consistently. It's only when the tests are run using expect_ad
that things fail, even though both are working with containers of var
, fvar<var>
, and fvar<fvar<var>>
. Any idea where that behaviour could be coming from?
What fails with expect_ad
? In some cases, the relative error tests can be overly strict, especially at higher orders where finite diffs aren't so good. The other thing they do is test that all the throw conditions match.
By failing I meant the previous behaviour where the tests only intermittently pass. I've also tried running things with a mutex that restricts the execution to run serially and the behaviour is the same, which is odd
By failing I meant the previous behaviour where the tests only intermittently pass. I've also tried running things with a mutex that restricts the execution to run serially and the behaviour is the same, which is odd
You can get race condition info by using clang++ and adding the -fsanitize=thread
flag.
Also kind of random, but can you try this with the newest tbb? When I'm looking at this it seems like the leaks are happening in the tbb. Though that could also be some connection between our stack and the tbb
@wds15 or @bbbales2 Could I bother you for some parallelisation ideas? I've got a minimum working example of this parallel map with the exp
function up here: https://github.com/stan-dev/math/compare/develop...andrjohns:feature/map
The tests are all consistently passing without segfaults (hooray), but the parallelised version is orders of magnitude slower than the apply_scalar_unary
version in develop. In the branch above I have a rev
test of a vector with 100000 var
, on develop the test runs in ~25ms, but in the parallelised branch that jumps up to ~500ms.
Am I doing anything particularly dumb that's causing this performance hit?
I think you should put the nested thing inside the parallel for.
If you don't put it in there, then if the same thread gets two different chunks of work, the first chunk will be efficient, but the second chunk will be doing autodiff for the first chunk unless the stack gets cleared after the first is finished.
Clearing the nested stack after the first is finished just means doing nesting inside the tbb block.
Nice! That brought the runtime down to about parity with develop
. I'll work up an example with an actual Stan model and see how the performance differs
Quick update that things are progressing very well with this. I've got a generalised parallel_map
that I've tested with one-argument (i.e., vectorised exp
) and two-argument (i.e., vectorised pow
) loops with reverse-mode autodiff, and all tests are passing consistently.
I've got some more testing (different combinations of arguments as well as speed-ups) and optimisations to look at, but will be ready to write up a design doc in the near future to get more feedback on the approach.
If anyone wanted to test locally, the current branch is here: https://github.com/stan-dev/math/compare/develop...andrjohns:feature/map
(Note that I fixed the threads to 4 for testing)
This is really cool! I'll have a look this weekene
Lol sorry!
Steve (or anyone else interested) - bit of background for when you get the time to have a look. On the branch above I've got the prim
and fwd
implementations of parallel_map
added as well. Like with reduce_sum
, the fwd
version is not parallelised.
At the moment, I've got it implemented to parallelise exp
with Eigen types of either var
or double
, but that's more to test that the values and/or gradients are all correct than any kind of roadmap. From (very rough) testing, I'm seeing that the parallelised exp
with containers of var
is markedly slower than the apply_scalar_unary
version. I'm assuming this is because the rev
versions need to traverse the container length twice (once in parallel to compute values/adjoints, and again serially to pack them into new vars
).
Got some early benchmarks. I compared performance with vectors of a unary function that had SIMD-vectorisation for doubles (exp
), a unary function that was just looping (lgamma
), and a binary function that operated on two matrices (columns_dot_product
).
The benchmarking and plotting code is here: parallel_map_benchmarking.zip. (uploading in a zip file due to GitHub restrictions)
When working with doubles
, parallel_map
didn't start outperforming develop
until vector sizes exceeded 10000. For vars
, the parallel_map
performance was always less than half that of develop
, regardless of size.
When working with doubles
, parallel_map
provided significant speed-ups, with 2x the speed of develop at a size of 1000, and 8x the speed of develop at the size of 100000. For vars
, there was a slight performance increase with a vector size greater than 1000, but was again performing worse than develop at a size of 100000
For doubles
, with a matrix of 100x100, parallel_map
was markedly worse than develop. Once the matrix size increased to 1000x1000, parallel_map
was 6x faster than develop, with this increase reducing to ~2x at sizes of 5000x5000.
For vars
, parallel_map
was ~2x faster than develop at matrices of both 100x100 and 1000x1000, but was slower than develop with matrices of 5000x5000.
A bit of a mixed bag overall, and hopefully there are some optimisations to be made to parallel_map
that can improve performance with vars
Is this with 4 cores?
Yep, but performance is pretty similar at 8 cores as well
I will take a look. Cool work!
Sorry to add the benchmark/plot overload, but I've also worked up a rough implementation of combining parallel_map
with Eigen's SIMD-vectorisation for doubles
.
Benchmarks are showing a 4.5x speed-up with large vector sizes (100000), compared to 2.5x with parallelism alone. I still need to make some minor tweaks to get it working with Matrices, but overall good to see more performance (even if not with vars
).
Forgot to say that the above benchmarks are using exp
Hmm... this is not too easy to digest since it's quite sweet C++ plumbing going on here... I like the ideas you use here; I am not yet sure if this level of complexity is what we need/want, but let's park this for now.
I think you are getting things right. There are two things which cross my mind:
save_varis
thing (here https://github.com/andrjohns/math/blob/d2efb0da9b6ab8aa4d31af75adeb5a34d3418841/stan/math/rev/functor/parallel_map.hpp#L63) into a simple for loop which is run outside of the parallel_for from the TBB? Maybe this is better for memory locality reasons...maybe not.Going parallel always means that we introduce quite some overhead in the rev case due to the need to do a lot of brute-force copying. Still, what bothers me at the moment is that the lgamma
case is quite a bit slower for small vector sizes and that the var case even degrades in performance. I would suggest to first concentrate on lgamma
(maybe even per iteration a sum of two or even three lgamma calls) which is an easy case to speedup due to the large computational cost of it. Once that is ok (grainsize, scheduling) then we go to things like exp
. This is how I would do it.
EDIT: I also recall that lambda functors are not as well inlined from compilers which is why for reduce_sum
we used a struct instead as functor passed to parallel_for...maybe that matters as well.
Thanks Sebastian! I'll have a look into these ideas and let you know how I go.
Real interesting stuff, good call on the multiple operations. It looks like this approach will be work best when there's more than one input/operation at each iteration of the loop. I compared the performance of an elementwise lgamma(x) + lgamma(y)
across vector sizes of 100 & 1000, and across increasing grainsizes with either the automatic, simple, or static partitioners.
It looks like 100 is too small a size for parallelisation to be beneficial. However, with a vector of size 1000, the automatic partitioner and a grainsize of 5, there was 2x speedup with vars
!
I'll have a look into converting the lambdas to structs tomorrow, but things are finally looking promising for reverse mode!
EDIT: I also tried pulling the save_varis
out into a separate (serial) loop, but it tanked performance
Great. This is in line with my expectation that this approach is great for things with a ton of work rather than just one ad node.
It would also be nice to plot the absolute runtime on log log plots.
It would also be nice to plot the absolute runtime on log log plots
What goes on the respective axes? Develop time vs. parallel time? Was there a similar plot from the reduce_sum
testing that you could point me towards?
I'd do x axis as sizes of the problem, y as Speedup vs. Develop, and then have colors by grainsize
Sounds good, will do!
Word! If you are able to shoot off and do multiple cores I'd do the combination of core plus grainsize as the color
What I meant was:
but no worries.
BTW, another interesting example could be a finite mixture model. Say you have M mixture components (M= 10 or 20) and then you are interested in the log-lik of a largish data-set. Since we do not of a reduce_log_sum_exp right now, this could be a good application for this parallel_map which can be used to calculate the log lik of each mixture component, return all M values and then we log_sum_exp it in a numerically stable way.
(I still don't like parallel_map as a name since we never make it explicit if things run in parallel, but that's to discuss in a design doc, so no worries for now)
Ah that makes sense I'll work that up. The mixture model also sounds like a good match for testing so I'll add it to the to-do list.
As for the naming, I'm not really tied to anything, so I'm happy for that to be a consensus decision at the design doc stage too.
Thanks for help, looking forward to seeing how this develops!
Alright so I took a hack at this and put up the code on this branch with benchmark code here
If the font is too teeny lmk and I'll blow it up, but the below plots are for calling lgamma with doubles and vars for grain sizes of 16:512 and data from sizes 128:128,384
But note I haven't checked that I'm calculating the adjoints correctly though it didn't change much from @andrjohns stuff so should be fine. I think being fast when the data is in the thousands is good, and when someone has an expensive function this should work for smaller N. The main thing is chopping off the bulk startup time for var (I'm not sure how much of the TBB's startup time we can chop). Andrew I got the segment version working for double, I think for var we need to use simple_partitioner
with the grainsize to get the number of vars for each iteration, then we can do a bulk copy once, do the calculation on a chunk of the input all at once, set their output adjoints to one and call grad for that whole chunk. It feels like that would work right? I'm not sure where else get rid of startup costs
Below is also zooming on on a grainsize of 64 and looking at loop_time/parallel_time
It looks like once var types hit over 1500 or so this becomes a thing
Is there some way we could not allocate that blob of partials
for the adjoints? Like could we just plop the adjoint values right in the vari's? tho' if we did that it would be real dangerous if we ever called set_zero_adjoint()
. The only other things I can think of are rewriting accumulate_adjoints()
and save_varis()
etc. so that instead of doing a recursion they did a parameter pack expansion
Oh also data and code for graphs here
Forgot to mention all those graphs are for 28 threads. Is there a way to set the number of threads at runtime?
Thanks for the benchmarks. I would really not want to rely on requiring any pre-defined chunking structure as you suggest by using simple_paritioner
. Working on chunks is fine, but that should fit into the general idea of the TBB that the scheduler gets to choose how large work chunks are (just recall that this stuff runs ideally concurrently with other operations).
Getting rid of the recursion of accumulate_adjoints
and save_varis
is probably a good thing which would benefit reduce_sum
as well...but here we talk about an unary function such that there is not much of a recursion.
From my view we seem to have something which is already now very powerful - as long as the element we iterate over is sufficiently expensive. Examples were this is probably fantastic are
What I am saying is that expensive operations with many terms should really benefit a lot from the parallel_map, since the relative cost of AD tape handling effort becomes less of a concern.
I would suggest to try out an expensive example as suggested above and confirm that we are doing well with expensive things already now. Then we can move ahead with this already now and possibly come back to this if we really feel like we need to tune this thing for those tiny operations.
So get the low hanging fruit first is what I am saying here.
I would really not want to rely on requiring any pre-defined chunking structure as you suggest by using simple_paritioner
Is there a guarantee that each loop size will be some max value? I only suggested the partitioner so when we figure out how many vars we need per loop chunk we will know a definite number
https://github.com/stan-dev/math/compare/feature/par-map-v2#diff-75fb2e278dc34158a0dc5888d95951b9R36
Getting rid of the recursion of accumulate_adjoints and save_varis is probably a good thing which would benefit reduce_sum as well...but here we talk about an unary function such that there is not much of a recursion.
It's not the recursion per-se, it's calling it multiple times in the inner parallel_for loop when we could call it once for a chunk of mem.
What I am saying is that expensive operations with many terms should really benefit a lot from the parallel_map, since the relative cost of AD tape handling effort becomes less of a concern.
I agree those would work right now, but what I'd like to think about is brms where we can do all the distribution and random effects with these. eod that's just a mul of var * data
so I'd like this to be decently fast for medium N in that simple case as well.
What size N should this kick off for in terms of an expensive function?
Thanks for working this up Steve! I believe that the adjoints don't get calculated correctly if the precomputed_gradients
is inside the parallel_for
call. But that might have been from an older implementation I was working on, so I'll double-check tonight.
I only suggested the partitioner so when we figure out how many vars we need per loop chunk we will know a definite number
Given that the number of vars per iterations will be constant, can we just multiply by the number of iterations to be processed in a given chunk: nvars * (r.end() - r.begin())
?
Hey Steve, just looking through the code now. I like the use of parameter packs, definitely much cleaner. The changes you most recently pushed to remove the lambdas have broken things a bit, since the index functions are expecting a functor to be passed that it then applies to the indexed arguments. The only alternative to this (requiring a functor) that I could think of was to have the index functions return a tuple of the indexed arguments, but I didn't really like that.
Unfortunately the approach you're using to deduce the return type:
std::decay_t<decltype(app_fun(args...))> result(max_size(args...));
Won't generalise very well. With only a single 'size' argument, this would only work for vector return types. Also, this won't work for cases where the return size differs from the input sizes. Using columns_dot_product
as an example: the inputs are both size R C so the deduced return type would be a row-vector of size R C, when it should just be size C.
Also, I checked and including the precomputed_gradients_vari
call in the parallel_map
functor definitely causes the wrong adjoints to be returned
The precomputed_gradients_vari
must be created in the thread with the target AD tape.
@SteveBronder it would make indeed sense to handle save_varis
in a blocked fashion according to what the blocks assigned from the TBB scheduler are which we ought to process, but I don't think we should attempt to control in advance the exact blocking size (other than roughly tuning it with grainsize).
To me the parallel_map is useful to run big things in parallel, not elementary operations.
The idea to parallelize elementary operations and by this defacto automatically parallelize the entire Stan program is certainly tempting, but I really doubt that it is even possible to do. I can be proven wrong in that, of course. The issue is the overhead due to parallelization (scheduling & AD tape munging) cost.
Instead we should focus on a parallel map which is good enough for running large things in parallel. Something like this can still be tuned later for elementary operations. At least this is how I see it.
@wds15 Implementation question - I've just about got this working with two-index loops (i.e., using blocked_range2d
), and that requires a separate grainsize for each dimension (i.e., a col-wise grainsize and a row-wise grainsize). Is there much value in allowing parallel_map
to take two grainsize arguments, or should I just apply the single provided grainsize to both dimensions?
I have never used 2D stuff, but I would definitely keep two grainsizes in this case and not simplify.
This 2D stuff would be interesting for reduce_sum
as well...
But shouldn't we start with the simple stuff first? It's really a big thing if we have this parallel map (ok it does make sense to think ahead of time how it generalizes, sure).
Separate grainsizes works for me, and simplifies the templating a bunch as well.
My thinking behind getting it all built out now, was that we can focus on optimising the single-index case and I can double-check that any changes won't get in the way of the other implementations. I really wanted the two-index case so that we could use it for matrix multiplication, like (untested):
auto ind_f = [&](int i, int j, const auto& fun,
const auto& x, const auto& y) {
return fun(x.row(i), y.col(j));
};
// Functor defining function to be applied to indexed arguments
auto f = [&](const auto& x, const auto& y) { return dot_product(x, y); };
stan::math::parallel_map(f, ind_f, std::forward<stan::math::matrix_v>(out_par), 1, 1,
in1_par, in2_par);
Which will hopefully provide some speed-ups in cases where the matrices are too small for the copy-cost to GPUs to be worth it.
Incidentally, I literally just got the two-index case working with vars
, so will try some multiplication benchmarking later this arvo
Parallelised matrix multiply results are... underwhelming:
2020-09-25 20:07:23
Running ./parallel_map_multiply
Run on (16 X 3700 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 64 KiB (x8)
L2 Unified 512 KiB (x8)
L3 Unified 8192 KiB (x2)
Load Average: 1.83, 1.73, 1.64
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
mult_var_1000x1000_ParMap 127459554973 ns 106917498549 ns 1
mult_var_1000x1000_develop 24444338 ns 24444797 ns 29
I ran the model through VTune, and the vast majority of the runtime is due to the chain()
call (via the callback_vari
method in dot_product):
I've also got the full VTune result here if anyone else wanted to take a look: r003hs.zip
I'm not really sure if there's a way to optimise past this, since the adjoints for each input need to be calculated in the parallel_for
loop and extracted out
Calculating matrix product via dot products is memory bound and will pretty much always be slower than a proper matrix multiply algorithm that uses cache efficiently. You should instead make a calculation of a block, lets say 8 by 8 elements a basic unit of work that you are dividing between processors. This way you can use the cache efficient matrix multiplication algorithm even in parallel.
Hmm that's a good idea, thanks Tadej! In that case I might put the matrix multiply on the backburner and focus on getting the framework proper off the ground first
I've got this in a stable and (mostly) documented state that's ready for others to look at or test.
I've dropped the ranged parallelism (i.e., vector segments and matrix blocks) since I couldn't get it to stop segfaulting, so I'll revisit that later once this is fully developed. For now, I've got both one-index and two-index loops parallelised and passing tests. There are basic prim
and rev
tests in place and I've also modified the exp
mix
tests.
Branch is here: https://github.com/stan-dev/math/compare/develop...andrjohns:feature/map
Description
Vectorised unary functions (e.g.,
exp
,log
) currently use either looping or Eigen's vectorised functions (for arithmetic types only) to evaluate containers. We could instead use the TBB'sparallel_for
functionality to evaluate the container in parallel.A minimal example of this would be applying
exp
to every element of a large Eigen vector:Benchmarking shows a performance of ~2x that of Eigen's vectorised
exp
, and ~5x that of a simple loop (benchmarking code at end of post):This could provide large performance improvements for containers of autodiff types (where Eigen's vectorised functions can't be used), or for containers of arithmetic types where a vectorised function isn't available in Eigen.
I still need to do more testing to identify the size at which parallelisation becomes beneficial, but things look pretty promising
Current Version:
v3.2.0
Code used for benchmarking: