Open andrjohns opened 4 years ago
very nice! If its not too much work, want to try out lgamma also?
Also is it possible to use Eigen's vectorized form inside the parallel for on a slice of the vector?
How many cores did the evaluation use?
How is this going to work when other things are being sent to TBB around this, like map-rect and its descendants?
How are the autodiff types synchronized? The benchmarks looked to be only for double values.
I'd suggest putting the random generation code outside the blocks being tested so it's not included in their timing and all functions are called on the same inputs.
How do you suggest to integrate this into math? I mean, would you fire off the parallel version when the container size is sufficiently large... or would you provide additional function definitions which would also allow to specify a grainsize
(which essentially also controls this chunking).
Other than that this plays well with the other map_rect / reduce_sum functions as we just throw tasks at the TBB and everything is designed such that we can use nested parallelism.
In all honesty, I think what would be of great interest is a variadic map (just like reduce_sum, but a 1-1 map thing). The issue here is to figure out how to concatenate outputs.
Another thing would be to provide parallel lpdfs / lpmfs which kick in for large data sets. Maybe the glm functions are actually even better suited for that given that more work is happening. Right now the glm functions are accelerated on the GPU, but multiple CPUs can help as well.
Given that you are the master, @andrjohns, of the apply*
functions: What I think can be cool is to use parallel_reduce
from the TBB to speed up vectorised unary reduce functions. This could be implemented on the back of reduce_sum
presumably.
+1 to using Eigen expressions within the parallel for instead of loop + std::exp
@rok-cesnovar
If its not too much work, want to try out lgamma also?
That saw some really nice speed-ups too:
ParLgamma_dbl 30094603 ns 30093781 ns 23
LoopLgamma_dbl 318332120 ns 318325288 ns 2
Also is it possible to use Eigen's vectorized form inside the parallel for on a slice of the vector?
I'm not sure, to be honest, since parallel_for
has to be passed a loop for it to parallelise. One approach would be do something like:
m_res.segment(i, i+10) = m_d.segment(i, i+10).array().exp();
But would have to ensure that each iteration doesn't overlap, somehow. @t4c1 any ideas here?
@bob-carpenter
How many cores did the evaluation use?
This was using the default, and letting the TBB automatically choose (I have 8 physical cores available). It looks like the parallelised exp
is slower than Eigen's (but faster than a loop) for 2 cores:
ParExp_dbl 27989277 ns 27987591 ns 25
EigExp_dbl 23901004 ns 23899449 ns 28
LoopExp_dbl 53516822 ns 53513638 ns 10
But faster for 4 cores:
ParExp_dbl 14673224 ns 14671461 ns 46
EigExp_dbl 23816895 ns 23814757 ns 27
LoopExp_dbl 53724174 ns 53722104 ns 10
How is this going to work when other things are being sent to TBB around this, like map-rect and its descendants?
As Sebastion mentioned, the TBB handles this kind of scheduling.
How are the autodiff types synchronized? The benchmarks looked to be only for double values.
I can't test autodiff types yet, as the perf-math
library is segfaulting when assigning values to vars
(i.e., just var x = 5.0;
segfaults). I can post benchmarks once I've chased that down.
I'd suggest putting the random generation code outside the blocks being tested so it's not included in their timing and all functions are called on the same inputs.
The generation was already separate from the timing, but I moved to a single vector of inputs shared by all runs, and the timings were consistent.
@wds15
How do you suggest to integrate this into math?
To be honest, I hadn't thought that far ahead (got caught up in whether I could, not whether I should). My (likely simplistic) perspective was to use the parallel version depending on container size (similar to the GPU code). I feel like there's some value to 'naive' parallelism which users can just take advantage of without changing their code.
As for your other suggestions, I need to get a bit more familiar with the options in the TBB before I can talk about other applications with much confidence. I just saw the unary functions as a good first step since they can be easily represented as parallelisable loop. We can even build this into binary scalar functions, so things like lbeta(Matrix, Matrix)
could be parallelised in a similar way. But I'm happy to go another direction that would generalise better.
But would have to ensure that each iteration doesn't overlap
Doesn't TBB ensure that the blocked ranges don't overlap?
parallel version depending on container size (similar to the GPU code)
I would actually like to get rid of that as the values are arbitrary and would need to be modified every time some performance improvement is made on the sequential or parallel version (for the GPU Cholesky decomposition N for switching is still 1250 though recent improvements have lowered that to at least half that).
@rok-cesnovar & @t4c1 Spoke too soon, parallel_for
gives you the first index and upper bound for the loop:
(size_t i = r.begin(); i < r.end(); ++i)
So I can use that to segment the Eigen vector and use Eigen's exp
:
m_res.segment(r.begin()+1, r.end()-r.begin()-1).array()
= m_d.segment(r.begin()+1, r.end()-r.begin()-1).array().exp();
While there's no noticeable difference when using all 8-cores for the parallelism:
ParExp_dbl 9824678 ns 9816867 ns 72
ParExp_Eig_dbl 9719649 ns 9693837 ns 72
EigExp_dbl 23886641 ns 23887389 ns 27
It performs much better when dropping down to 2 cores:
ParExp_dbl 33495995 ns 33495759 ns 20
ParExp_Eig_dbl 12505482 ns 12504933 ns 52
EigExp_dbl 23954809 ns 23953778 ns 28
Which will be good for getting benefits for those with lower core-count CPUs
My guess is that on 8 cores exp is so fast, it saturates memory bus.
Is your 8 core run improving in speedup with larger container sizes? Otherwise the grainsize could be important to tune a bit to tell the TBB what are good chunk sizes here.
My only concern with parallelising these little functions is that it may not matter after all in a big stan program. Think about Amdahl's law. You have to run a large percentage of your code in parallel to get any useful scaling. Pimping exp
which is just one of many operations, is not very promising.
This is why I suggested to use the TBB for the glm functions.
I really don't want to discourage you, but maybe think about an entire Stan program where this will be useful.
I see what you mean, but I feel like this would benefit a wider range of users than the glm functions. Something like this can be built into the apply_scalar_unary
and apply_scalar_binary
frameworks, so that all unary and binary functions get parallelised. This would also speed up the other functions and distributions that use them, so I think this would parallelise more of a given program than the individual glms.
I hope you are right, but I am very doubtful about this. So maybe you can hack up a quick and dirty prototype and provide an end-to-end performance test for a Stan model?
The issue is that you end up running your program is super small slices in parallel and that is hard for me to imagine to be fast. Hence, a prototype showing the merits of this approach is really what's needed from my view to proceed.
(I really appreciate more TBB use - don't get me wrong - but we don't want to waste our resources obviously)
I think it would suffice to see what effect does parallelization of unary/binary functions have on a gradient of an _lpdf function. That would probably give us enough information. Not sure which would be a good candidate.
Thanks for the comments both! Either way, this will take some benchmarking to see if it's beneficial. I'll work up a rough implementation for unary and binary functions and then we can figure out what some good models for testing would be
Benchmarking some lpdfs which use this is a start, but not sufficient I am afraid. While a lot of work is going on in the lpdfs, these often do not constitute most of the work as I have seen. That is I keep on encouraging people to move more stuff into reduce_sum
.
This is why I would target the glm functions which should indeed represent the bulk of work in a Stan program in case you use that.
... but let's find out!
Other than that this plays well with the other map_rect / reduce_sum functions as we just throw tasks at the TBB and everything is designed such that we can use nested parallelism.
I'd like to see is an extension of the existing evaluation to more parallel top level jobs to see what happens when cores saturate:
The current eval was for only 1 top-level job. What I want is a proxy for running parallel chains vs. parallelizing within a chain.
Yeah, I would like to setup - but don't have time - something like this:
reduce_sum
Then we run a reduce_sum over the outer array (corresponding to chains) and these then call an inner reduce_sum
for each chain.
That will be interesting to evaluate, I think.
@wds15 Can I borrow your eyes for a minute?
I've got a basic implementation in apply_scalar_unary
(eigen types only, fixed 2 cores). While it's working for Eigen types of fvar<double>
and fvar<fvar<double>>
, it segfaults when passed an Eigen type of var
. Is this something you encountered with reduce_sum
?
The branch is up here
It's a bit more tedious to do parallel reverse mode. So you have to do what we do for reduce_sum
. In essence:
Oh, righto. That is tedious.
But thanks!
This is why I am interested in a variadic map... which should allow us to implement what you are doing much more easily.
I wonder what we would need to do to get rid of the deep copy. Been trying to think about this but that feels off topic so we should talk about it somewhere else
I hadn't realised the amount that needs to be plumbed in for each implementation with reverse mode, you're right that a more general implementation will be a much better idea. I'll start getting my head around how reduce_sum
handles things and see what I can do.
I think I've got a general framework (for primitives only at this point).
The function to be parallelised is passed as a functor which takes the index as an input. Within that functor, the user can specify which objects get iterated through:
decltype(auto) func = [&](size_t index){
outv[index] = stan::math::exp(inv[index]);
};
Equivalently, this also allows the user to only iterate over some of the arguments (i.e. a binary function with a container and scalar as an input):
decltype(auto) func_binary = [&](size_t index){
outv2[index] = stan::math::log_diff_exp(inv[index], offset);
};
This functor is then passed to a function, map_variadic
, along with the length of the loop and any arguments to be passed:
template<typename F, typename... Args>
void map_variadic (F&& f, size_t length, Args&&... args) {
tbb::parallel_for(
tbb::blocked_range<size_t>(0, length),
[&f,&args...](const tbb::blocked_range<size_t>& r) {
for (size_t i = r.begin(); i < r.end(); ++i) {
f(i);
}
});
}
Putting it in a self-contained example, this looks like:
#include <stan/math.hpp>
#include <tbb/parallel_for.h>
#include <tbb/blocked_range.h>
#include <iostream>
#include <cmath>
template<typename F, typename... Args>
void map_variadic (F&& f, size_t length, Args&&... args) {
tbb::parallel_for(
tbb::blocked_range<size_t>(0, length),
[&f,&args...](const tbb::blocked_range<size_t>& r) {
for (size_t i = r.begin(); i < r.end(); ++i) {
f(i);
}
});
}
int main() {
Eigen::VectorXd inv = Eigen::VectorXd::Random(10000);
Eigen::VectorXd outv = Eigen::VectorXd::Random(10000);
Eigen::VectorXd outv2 = Eigen::VectorXd::Random(10000);
double offset = 0.5;
decltype(auto) func = [&](size_t index){
outv[index] = stan::math::exp(inv[index]);
};
decltype(auto) func_binary = [&](size_t index){
outv2[index] = stan::math::log_diff_exp(inv[index], offset);
};
map_variadic(func, inv.size(), inv, outv);
map_variadic(func_binary, inv.size(), inv, outv2, offset);
std::cout << "Serial Unary: " << stan::math::exp(inv[0]) << std::endl
<< "Parallel Unary: " << outv[0] << std::endl
<< "Serial Binary: " << stan::math::log_diff_exp(inv[0], offset) << std::endl
<< "Parallel Binary: " << outv2[0] << std::endl;
}
Which returns:
Serial Unary: 1.97462
Parallel Unary: 1.97462
Serial Binary: -1.12117
Parallel Binary: -1.12117
I still need to figure out the reverse side of things, but how does that look as a starting point?
Starts to look very nice. What I don't see are the deep copies and the nested autodiff bits. Moreover, note that for reduce_sum
the passed in functor is given via a type argument only. This prevents that any internal state of the functor is hidden from reduce_sum
. The concern is that internal var
s would not be deep copied which makes the approach invalid. @nhuurre suggested on the forums a way to handle closures which would work with reduce_sum
.
It's a bit delicate to get reverse mode multi-threaded things to work given the way our AD tape is structured.
BTW, the reason why I am very interested in a apply-reduce framework on the back of reduce_sum is that this will lead to a compressed AD tape which improves CPU cache use usually (reduce_sum
operates in slices of things and computes for the sub-slice the pre-computed gradient; so the reverse sweep is done on smaller data structures which hopefully live longer in the CPU cache).
I'm trying with deep copies but I'm still getting segfaults, even without returning the final result. I was expecting it to at least run without segfaulting, and then be able to extract the values and gradients.
Am I misunderstanding the order of things here? This is quickly getting deeper than I've gone into the reverse mode before, so I'm still catching up on the mechanics of everything.
#include <stan/math.hpp>
#include <tbb/parallel_for.h>
#include <tbb/blocked_range.h>
template <typename Return, typename F, typename ArgsTuple>
auto map_variadic_var(Return&& result, F&& f, size_t length,
int grainsize, ArgsTuple&& args_tuple) {
tbb::parallel_for(
tbb::blocked_range<size_t>(0, length),
[&result,&f,&args_tuple](const tbb::blocked_range<size_t>& r) {
// Initialize nested autodiff stack
const stan::math::nested_rev_autodiff begin_nest;
// Create nested autodiff copies that do not point
// back to main autodiff stack
auto args_tuple_local_copy = apply(
[&](auto&&... args) {
return std::tuple<decltype(stan::math::deep_copy_vars(args))...>(
stan::math::deep_copy_vars(args)...);
},
args_tuple);
// Parallelise evaluation of copied vars
for (size_t i = r.begin(); i < r.end(); ++i) {
f(i, args_tuple_local_copy);
}
});
}
int main() {
stan::math::vector_v inv = Eigen::VectorXd::Random(10000);
stan::math::vector_v outv(10000);
// Functor with tuple indexing to denote variable(s) to be
// manipulated
decltype(auto) func = [&](size_t index, auto&& in_arg){
return stan::math::exp(std::get<0>(in_arg)[index]);
};
map_variadic_var(outv, func, inv.size(), 1, std::forward_as_tuple(inv));
}
you still have to extract the value (result of the function) and the gradient. This then has to be put into a precomputed gradient thing. You should follow closely what we did for reduce_sum
.
EDIT: The thing is that the different AD tapes in the different AD tapes must never be interlinked. As such you can use AD on a sub-thread only in full independence of the others. That implies to deep copy things to the AD tape in the thread, do the computation, then extract all stuff as doubles only and finally put this together on the original thread.
It's a pain, but the only way we can use the current AD tape design. Anything else requires a huge refactor of the AD system from my understanding.
EDIT2: And I am almost glad that things crash on you as it prevents bad things to happen if used wrongly. So I consider this good ;).
Ah that makes sense, thanks for clarifying. Always happy when my code crashing is a good thing haha
Sorry to keep throwing code at you, but I've 'ported' the approach over to a similar method used in reduce_sum
, and while the prim version is working like a dream, I still can't get the reverse version to run without segfaulting. Would you mind having a look when you've got a minute? No rush or anything.
The branch is up here, and I've got two small tests in there for prim & var
inputs.
You need to save the gradient for each element. You can not accumulate these, but need it for every element.
I'm guessing you mean the gradient for the result of the function? like:
// Copy calculated gradient
adjs__[i] = sub_v.adj();
Or by every element, do you mean every input?
But I suppose my main question is why would it segfault if I don't extract the gradients? The segfault suggests I'm wandering into memory that I shouldn't, whereas not extracting the gradients would just return a result with an incorrect gradient, right?
You need to allocate the varis of the arguments for each element. I think you did not get that quite right.
Unfortunately I still can't get it running without segfaulting. I've made sure that the gradients are being copied instead of accumulated and that the varis for all args have been allocated and copied into, but still no dice.
The updated branch is up here.
It does work for me:
[14:52:41][sebi@Sebastians-MBP:~/work/math-andrjohns]$ ./runTests.py -j6 test/unit/math/rev/functor/map_var_test.cpp
------------------------------------------------------------
make -j6 test/unit/math/rev/functor/map_var_test
clang++ -DSTAN_THREADS -std=c++1y -Wno-unknown-warning-option -Wno-tautological-compare -Wno-sign-compare -D_REENTRANT -I lib/tbb_2019_U8/include -O3 -I . -I lib/eigen_3.3.7 -I lib/boost_1.72.0 -I lib/sundials_5.2.0/include -I lib/gtest_1.8.1/include -I lib/gtest_1.8.1 -I lib/gtest_1.8.1/include -I lib/gtest_1.8.1 -DBOOST_DISABLE_ASSERTS -c -o test/unit/math/rev/functor/map_var_test.o test/unit/math/rev/functor/map_var_test.cpp
clang++ -DSTAN_THREADS -std=c++1y -Wno-unknown-warning-option -Wno-tautological-compare -Wno-sign-compare -D_REENTRANT -I lib/tbb_2019_U8/include -O3 -I . -I lib/eigen_3.3.7 -I lib/boost_1.72.0 -I lib/sundials_5.2.0/include -I lib/gtest_1.8.1/include -I lib/gtest_1.8.1 -DBOOST_DISABLE_ASSERTS -Wl,-L,"/Users/sebi/work/math-andrjohns/lib/tbb" -Wl,-rpath,"/Users/sebi/work/math-andrjohns/lib/tbb" test/unit/math/rev/functor/map_var_test.o lib/gtest_1.8.1/src/gtest_main.cc lib/gtest_1.8.1/src/gtest-all.o lib/tbb/libtbb.dylib lib/tbb/libtbbmalloc.dylib lib/tbb/libtbbmalloc_proxy.dylib -o test/unit/math/rev/functor/map_var_test
------------------------------------------------------------
test/unit/math/rev/functor/map_var_test --gtest_output="xml:test/unit/math/rev/functor/map_var_test.xml"
Running main() from lib/gtest_1.8.1/src/gtest_main.cc
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from map_variadic
[ RUN ] map_variadic.tbb
[ OK ] map_variadic.tbb (15 ms)
[----------] 1 test from map_variadic (15 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (15 ms total)
[ PASSED ] 1 test.
But I had to move the test into test/ directories which ensures that the needed dependencies are managed accordingly. Also don't forget to set STAN_THREADS.
However, it's not yet there, but you are on a good way. What I think makes sense is to have as input an array of T and in addition any shared arguments.
The precomputed gradient then has to have as partials the varis of T of the respective element and the varis of the shared argument (which must be repeated for every element).
I would suggest that we have as output format an array of type Tout (so a different type than what is the input is admissible)... you would just need to enforce that the sizes are all consistent as we would otherwise have ragged data structures.
This would be a huge feature to have as not everything will be easily cast into a straight reduce thing.
Argh, after all that it was just directories and compiler flags. Thanks for sorting that out!
What I think makes sense is to have as input an array of T and in addition any shared arguments.
Doesn't that limit the flexibility of the approach to only be able to slice across one argument? The way I've set things up now allows for arbitrary slicing, but the output has to be known and passed to the map. Is there a benefit to restricting the inputs like that?
Can u explain arbitrary slicing a bit more? What gets sliced and how? And what does not get sliced?
I'll preface this by saying that 'slicing' was probably the wrong word here. The way I've envisioned the facility is that the user/dev can pass multiple arguments and specify which get iterated over.
So they specify a struct, which has an operator ()
that takes an index and any number of arguments. Within this struct, they specify which arguments are iterated and which are used 'as-is':
struct example_fun {
template <typename Arg1, typename Arg2, typename Arg3>
auto operator()(size_t index, Arg1&& arg1, Arg2&& arg2, Arg3&& arg3) {
return stan::math::distance(arg1[index], arg2) + arg3[index];
}
};
The call to map_variadic
then passes the output container, grainsize, and input arguments. The output container is used to determine the type of output and map_variadic_impl
specialisation to call.
stan::math::map_variadic<example_fun>(out_vec, 1, &msgs, in_arg1, in_arg2, in_arg3);
How does that sound?
To have out_vec
passed into the map thing is odd to me. Why not find out the output type via decltype
?
The choice to pass an index into the map (and the entire container over which we iterate) is very inefficient, since this means that you need to copy - for every element - all the varis
of the thing we iterate over. So I would much prefer if we only pass in the element i
and not more (in addition with the shared arguments).
This sounds like you want to start a design doc for this?
Such a map
thing would be very useful! The map_variadic
name is not nice to my eyes. Since this will be a really cool map
... I would name it just like that map
!
For consistency with other things we did, my suggestion would be to have std::vector<Tin>
for the input and std::vector<Tout>
for the output. Or at least this is what should be exposed to users... a more general signature can be made available for internal use in stan-math, I think.
To have out_vec passed into the map thing is odd to me. Why not find out the output type via decltype?
Because the backend is parallel_for
, this is just distributing iterations of a loop. At each iteration a scalar is returned, while we can decltype
the scalar, there's no easy way to deduce the type and shape of container. It also makes it far easier to determine the number of output vari
without having to do anything crazy with the parameter pack
So I would much prefer if we only pass in the element i and not more (in addition with the shared arguments).
While that would certainly be more efficient, I can't think of a way to do it without restricting things to only iterate over one argument, which would cut down on the flexibility considerably. If you've got suggestions on some things to try, I'd be happy to look into it though.
For consistency with other things we did, my suggestion would be to have std::vector
for the input and std::vector for the output
That's a good idea, will do
Because the backend is parallel_for, this is just distributing iterations of a loop. At each iteration a scalar is returned, while we can decltype the scalar, there's no easy way to deduce the type and shape of container. It also makes it far easier to determine the number of output vari without having to do anything crazy with the parameter pack
It should be possible to return anything from the function - not just a scalar. That does make it a bit harder to code as you need multiple precomputed_gradients
; so a Jacobian would be calculated per output element... though we can enable this later if that‘s too hard for now.
While that would certainly be more efficient, I can't think of a way to do it without restricting things to only iterate over one argument, which would cut down on the flexibility considerably. If you've got suggestions on some things to try, I'd be happy to look into it though.
I you restrict the input to be std::vector<Tin>
, then that‘s not too hard to do, no? In any case, we really have to do this as we otherwise have for N iteration items N*N output varis... which will limited the performance way too much (a map is already not so efficient as there is no reduce step which cuts down on the output varis).
N iteration items N*N output varis.
Is that the case here? The varis are allocated only once prior to iteration, so it's just one vari allocated per var
vari** varis = ChainableStack::instance_->memalloc_.alloc_array<vari*>(
num_vars_all_terms);
Put the user provided function can access any of the varis. Hence, you need to insert all that into the AD tape, since the user function had access to it. This is why I would only pass in the ith element and not more.
Bump... is this moving?
Yep. The framework itself is in a place that I'm happy with, the plan is to add it to a couple of functions to check that the gradients are all consistent, and do some benchmarking. After those, I'll write up a design doc to get some wider discussion about a good way to implement/expose/use this.
Also, I can only do Stan stuff outside of work, so progress on things is generally piecemeal during the week
No rush... take your time. I was just curious.
If there is something to look at, just ping.
Got an interesting one for you. I've got things (kind of) working with an implementation that works with pointers rather than copies and added it to apply_scalar_unary
. However, the mix
test for a given function (in this case exp
) only passes intermittently. I could understand always passing or failing, but I'm not sure what it means to only fail sometimes.
The branch is up here if you could have a look
I'm not sure what it means to only fail sometimes.
In my experience, that's usually due to improper synchronization or undefined default initialization.
Thanks Bob! To clarify, by undefined default initialization I'm guessing you mean initializing the output container of var
? What do you mean by improper synchronization?
Apologies for the basic questions, I haven't touched much of the autodiff side of the Math library before
Description
Vectorised unary functions (e.g.,
exp
,log
) currently use either looping or Eigen's vectorised functions (for arithmetic types only) to evaluate containers. We could instead use the TBB'sparallel_for
functionality to evaluate the container in parallel.A minimal example of this would be applying
exp
to every element of a large Eigen vector:Benchmarking shows a performance of ~2x that of Eigen's vectorised
exp
, and ~5x that of a simple loop (benchmarking code at end of post):This could provide large performance improvements for containers of autodiff types (where Eigen's vectorised functions can't be used), or for containers of arithmetic types where a vectorised function isn't available in Eigen.
I still need to do more testing to identify the size at which parallelisation becomes beneficial, but things look pretty promising
Current Version:
v3.2.0
Code used for benchmarking: