Use TBB to parallelise unary functions of large containers

andrjohns commented 4 years ago

Description

Vectorised unary functions (e.g., exp, log) currently use either looping or Eigen's vectorised functions (for arithmetic types only) to evaluate containers. We could instead use the TBB's parallel_for functionality to evaluate the container in parallel.

A minimal example of this would be applying exp to every element of a large Eigen vector:

  using Eigen::VectorXd;
  VectorXd m_d = VectorXd::Random(10000000);
  VectorXd m_res(10000000);

  tbb::task_scheduler_init init;
  tbb::parallel_for(
    tbb::blocked_range<size_t>(0, m_d.size()), 
    [&m_d,&m_res](const tbb::blocked_range<size_t>& r) {
      for (size_t i = r.begin(); i < r.end(); ++i)
        m_res[i] = std::exp(m_d[i]);
    });

Benchmarking shows a performance of ~2x that of Eigen's vectorised exp, and ~5x that of a simple loop (benchmarking code at end of post):

Run on (16 X 3700 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 64 KiB (x8)
  L2 Unified 512 KiB (x8)
  L3 Unified 8192 KiB (x2)
Load Average: 1.53, 0.87, 0.79
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
ParExp       10259973 ns     10225812 ns           71
EigExp       23918425 ns     23917527 ns           28
LoopExp      53930647 ns     53929832 ns           10

This could provide large performance improvements for containers of autodiff types (where Eigen's vectorised functions can't be used), or for containers of arithmetic types where a vectorised function isn't available in Eigen.

I still need to do more testing to identify the size at which parallelisation becomes beneficial, but things look pretty promising

Current Version:

v3.2.0

Code used for benchmarking:

#include <stan/math/mix.hpp>
#include <benchmark/benchmark.h>
#include <tbb/parallel_for.h>
#include <tbb/blocked_range.h>

static void ParExp(benchmark::State& state) {
  using Eigen::VectorXd;
  VectorXd m_d = VectorXd::Random(10000000);
  VectorXd m_res(10000000);

  for (auto _ : state) {
    tbb::task_scheduler_init init;
    tbb::parallel_for(
      tbb::blocked_range<size_t>(0, m_d.size()), 
      [&m_d,&m_res](const tbb::blocked_range<size_t>& r) {
        for (size_t i = r.begin(); i < r.end(); ++i)
          m_res[i] = std::exp(m_d[i]);
      });
  }
}
BENCHMARK(ParExp);

static void EigExp(benchmark::State& state) {
  using Eigen::VectorXd;
  VectorXd m_d = VectorXd::Random(10000000);
  VectorXd m_res(10000000);

  for (auto _ : state) {
    m_res.array() = m_d.array().exp();
  }
}
BENCHMARK(EigExp);

static void LoopExp(benchmark::State& state) {
  using Eigen::VectorXd;
  VectorXd m_d = VectorXd::Random(10000000);
  VectorXd m_res(10000000);

  for (auto _ : state) {
    for(size_t i = 0; i < m_d.size(); ++i) {
      m_res[i] = std::exp(m_d[i]);
    }
  }
}
BENCHMARK(LoopExp);

BENCHMARK_MAIN();

rok-cesnovar commented 4 years ago

very nice! If its not too much work, want to try out lgamma also?

Also is it possible to use Eigen's vectorized form inside the parallel for on a slice of the vector?

bob-carpenter commented 4 years ago

How many cores did the evaluation use?

How is this going to work when other things are being sent to TBB around this, like map-rect and its descendants?

How are the autodiff types synchronized? The benchmarks looked to be only for double values.

I'd suggest putting the random generation code outside the blocks being tested so it's not included in their timing and all functions are called on the same inputs.

wds15 commented 4 years ago

How do you suggest to integrate this into math? I mean, would you fire off the parallel version when the container size is sufficiently large... or would you provide additional function definitions which would also allow to specify a grainsize (which essentially also controls this chunking).

Other than that this plays well with the other map_rect / reduce_sum functions as we just throw tasks at the TBB and everything is designed such that we can use nested parallelism.

In all honesty, I think what would be of great interest is a variadic map (just like reduce_sum, but a 1-1 map thing). The issue here is to figure out how to concatenate outputs.

Another thing would be to provide parallel lpdfs / lpmfs which kick in for large data sets. Maybe the glm functions are actually even better suited for that given that more work is happening. Right now the glm functions are accelerated on the GPU, but multiple CPUs can help as well.

Given that you are the master, @andrjohns, of the apply* functions: What I think can be cool is to use parallel_reduce from the TBB to speed up vectorised unary reduce functions. This could be implemented on the back of reduce_sum presumably.

t4c1 commented 4 years ago

+1 to using Eigen expressions within the parallel for instead of loop + std::exp

andrjohns commented 4 years ago

@rok-cesnovar

If its not too much work, want to try out lgamma also?

That saw some really nice speed-ups too:

ParLgamma_dbl    30094603 ns     30093781 ns           23
LoopLgamma_dbl  318332120 ns    318325288 ns            2

Also is it possible to use Eigen's vectorized form inside the parallel for on a slice of the vector?

I'm not sure, to be honest, since parallel_for has to be passed a loop for it to parallelise. One approach would be do something like:

m_res.segment(i, i+10) = m_d.segment(i, i+10).array().exp();

But would have to ensure that each iteration doesn't overlap, somehow. @t4c1 any ideas here?

@bob-carpenter

How many cores did the evaluation use?

This was using the default, and letting the TBB automatically choose (I have 8 physical cores available). It looks like the parallelised exp is slower than Eigen's (but faster than a loop) for 2 cores:

ParExp_dbl       27989277 ns     27987591 ns           25
EigExp_dbl       23901004 ns     23899449 ns           28
LoopExp_dbl      53516822 ns     53513638 ns           10

But faster for 4 cores:

ParExp_dbl       14673224 ns     14671461 ns           46
EigExp_dbl       23816895 ns     23814757 ns           27
LoopExp_dbl      53724174 ns     53722104 ns           10

How is this going to work when other things are being sent to TBB around this, like map-rect and its descendants?

As Sebastion mentioned, the TBB handles this kind of scheduling.

How are the autodiff types synchronized? The benchmarks looked to be only for double values.

I can't test autodiff types yet, as the perf-math library is segfaulting when assigning values to vars (i.e., just var x = 5.0; segfaults). I can post benchmarks once I've chased that down.

I'd suggest putting the random generation code outside the blocks being tested so it's not included in their timing and all functions are called on the same inputs.

The generation was already separate from the timing, but I moved to a single vector of inputs shared by all runs, and the timings were consistent.

@wds15

How do you suggest to integrate this into math?

To be honest, I hadn't thought that far ahead (got caught up in whether I could, not whether I should). My (likely simplistic) perspective was to use the parallel version depending on container size (similar to the GPU code). I feel like there's some value to 'naive' parallelism which users can just take advantage of without changing their code.

As for your other suggestions, I need to get a bit more familiar with the options in the TBB before I can talk about other applications with much confidence. I just saw the unary functions as a good first step since they can be easily represented as parallelisable loop. We can even build this into binary scalar functions, so things like lbeta(Matrix, Matrix) could be parallelised in a similar way. But I'm happy to go another direction that would generalise better.

rok-cesnovar commented 4 years ago

But would have to ensure that each iteration doesn't overlap

Doesn't TBB ensure that the blocked ranges don't overlap?

parallel version depending on container size (similar to the GPU code)

I would actually like to get rid of that as the values are arbitrary and would need to be modified every time some performance improvement is made on the sequential or parallel version (for the GPU Cholesky decomposition N for switching is still 1250 though recent improvements have lowered that to at least half that).

andrjohns commented 4 years ago

@rok-cesnovar & @t4c1 Spoke too soon, parallel_for gives you the first index and upper bound for the loop:

(size_t i = r.begin(); i < r.end(); ++i)

So I can use that to segment the Eigen vector and use Eigen's exp:

  m_res.segment(r.begin()+1, r.end()-r.begin()-1).array()
    = m_d.segment(r.begin()+1, r.end()-r.begin()-1).array().exp();

While there's no noticeable difference when using all 8-cores for the parallelism:

ParExp_dbl        9824678 ns      9816867 ns           72
ParExp_Eig_dbl    9719649 ns      9693837 ns           72
EigExp_dbl       23886641 ns     23887389 ns           27

It performs much better when dropping down to 2 cores:

ParExp_dbl       33495995 ns     33495759 ns           20
ParExp_Eig_dbl   12505482 ns     12504933 ns           52
EigExp_dbl       23954809 ns     23953778 ns           28

Which will be good for getting benefits for those with lower core-count CPUs

t4c1 commented 4 years ago

My guess is that on 8 cores exp is so fast, it saturates memory bus.

wds15 commented 4 years ago

Is your 8 core run improving in speedup with larger container sizes? Otherwise the grainsize could be important to tune a bit to tell the TBB what are good chunk sizes here.

My only concern with parallelising these little functions is that it may not matter after all in a big stan program. Think about Amdahl's law. You have to run a large percentage of your code in parallel to get any useful scaling. Pimping exp which is just one of many operations, is not very promising.

This is why I suggested to use the TBB for the glm functions.

I really don't want to discourage you, but maybe think about an entire Stan program where this will be useful.

andrjohns commented 4 years ago

I see what you mean, but I feel like this would benefit a wider range of users than the glm functions. Something like this can be built into the apply_scalar_unary and apply_scalar_binary frameworks, so that all unary and binary functions get parallelised. This would also speed up the other functions and distributions that use them, so I think this would parallelise more of a given program than the individual glms.

wds15 commented 4 years ago

I hope you are right, but I am very doubtful about this. So maybe you can hack up a quick and dirty prototype and provide an end-to-end performance test for a Stan model?

The issue is that you end up running your program is super small slices in parallel and that is hard for me to imagine to be fast. Hence, a prototype showing the merits of this approach is really what's needed from my view to proceed.

(I really appreciate more TBB use - don't get me wrong - but we don't want to waste our resources obviously)

rok-cesnovar commented 4 years ago

I think it would suffice to see what effect does parallelization of unary/binary functions have on a gradient of an _lpdf function. That would probably give us enough information. Not sure which would be a good candidate.

andrjohns commented 4 years ago

Thanks for the comments both! Either way, this will take some benchmarking to see if it's beneficial. I'll work up a rough implementation for unary and binary functions and then we can figure out what some good models for testing would be

wds15 commented 4 years ago

Benchmarking some lpdfs which use this is a start, but not sufficient I am afraid. While a lot of work is going on in the lpdfs, these often do not constitute most of the work as I have seen. That is I keep on encouraging people to move more stuff into reduce_sum.

This is why I would target the glm functions which should indeed represent the bulk of work in a Stan program in case you use that.

... but let's find out!

bob-carpenter commented 4 years ago

Other than that this plays well with the other map_rect / reduce_sum functions as we just throw tasks at the TBB and everything is designed such that we can use nested parallelism.

I'd like to see is an extension of the existing evaluation to more parallel top level jobs to see what happens when cores saturate:

32 / 16 / 8 / 4 / 2 / 1 parallel top-level jobs
parallel execution of jobs vs. serial execution of the jobs (don't know if we can control lower-level parallelism granularity)

The current eval was for only 1 top-level job. What I want is a proxy for running parallel chains vs. parallelizing within a chain.

wds15 commented 4 years ago

Yeah, I would like to setup - but don't have time - something like this:

take a model written with reduce_sum
formulate it with a raw parameter vector on the unconstrained scale
then we can just declare an array of these parameter vectors which then represent the different chains and we apply randomly chosen offsets to these to mimic different initials

Then we run a reduce_sum over the outer array (corresponding to chains) and these then call an inner reduce_sum for each chain.

That will be interesting to evaluate, I think.

andrjohns commented 4 years ago

@wds15 Can I borrow your eyes for a minute?

I've got a basic implementation in apply_scalar_unary (eigen types only, fixed 2 cores). While it's working for Eigen types of fvar<double> and fvar<fvar<double>>, it segfaults when passed an Eigen type of var. Is this something you encountered with reduce_sum?

The branch is up here

wds15 commented 4 years ago

It's a bit more tedious to do parallel reverse mode. So you have to do what we do for reduce_sum. In essence:

start a new nested section
do deep copies of the operands
calculate the result
extract the gradient
insert a precomputed_gradient on the original AD tape

andrjohns commented 4 years ago

Oh, righto. That is tedious.

andrjohns commented 4 years ago

But thanks!

wds15 commented 4 years ago

This is why I am interested in a variadic map... which should allow us to implement what you are doing much more easily.

SteveBronder commented 4 years ago

I wonder what we would need to do to get rid of the deep copy. Been trying to think about this but that feels off topic so we should talk about it somewhere else

andrjohns commented 4 years ago

I hadn't realised the amount that needs to be plumbed in for each implementation with reverse mode, you're right that a more general implementation will be a much better idea. I'll start getting my head around how reduce_sum handles things and see what I can do.

andrjohns commented 4 years ago

I think I've got a general framework (for primitives only at this point).

The function to be parallelised is passed as a functor which takes the index as an input. Within that functor, the user can specify which objects get iterated through:

  decltype(auto) func = [&](size_t index){
    outv[index] = stan::math::exp(inv[index]);
  };

Equivalently, this also allows the user to only iterate over some of the arguments (i.e. a binary function with a container and scalar as an input):

  decltype(auto) func_binary = [&](size_t index){
    outv2[index] = stan::math::log_diff_exp(inv[index], offset);
  };

This functor is then passed to a function, map_variadic, along with the length of the loop and any arguments to be passed:

template<typename F, typename... Args>
void map_variadic (F&& f, size_t length, Args&&... args) {
  tbb::parallel_for(
    tbb::blocked_range<size_t>(0, length), 
    [&f,&args...](const tbb::blocked_range<size_t>& r) {
      for (size_t i = r.begin(); i < r.end(); ++i) {
        f(i);
      }
  });
}

Putting it in a self-contained example, this looks like:

#include <stan/math.hpp>
#include <tbb/parallel_for.h>
#include <tbb/blocked_range.h>
#include <iostream>
#include <cmath>

template<typename F, typename... Args>
void map_variadic (F&& f, size_t length, Args&&... args) {
  tbb::parallel_for(
    tbb::blocked_range<size_t>(0, length), 
    [&f,&args...](const tbb::blocked_range<size_t>& r) {
      for (size_t i = r.begin(); i < r.end(); ++i) {
        f(i);
      }
  });
}

int main() {
  Eigen::VectorXd inv = Eigen::VectorXd::Random(10000);
  Eigen::VectorXd outv = Eigen::VectorXd::Random(10000);
  Eigen::VectorXd outv2 = Eigen::VectorXd::Random(10000);
  double offset = 0.5;

  decltype(auto) func = [&](size_t index){
    outv[index] = stan::math::exp(inv[index]);
  };
  decltype(auto) func_binary = [&](size_t index){
    outv2[index] = stan::math::log_diff_exp(inv[index], offset);
  };

  map_variadic(func, inv.size(), inv, outv);
  map_variadic(func_binary, inv.size(), inv, outv2, offset);

  std::cout << "Serial Unary: " << stan::math::exp(inv[0]) << std::endl 
            << "Parallel Unary: " << outv[0] << std::endl
            << "Serial Binary: " << stan::math::log_diff_exp(inv[0], offset) << std::endl
            << "Parallel Binary: " << outv2[0] << std::endl;
}

Which returns:

Serial Unary: 1.97462
Parallel Unary: 1.97462
Serial Binary: -1.12117
Parallel Binary: -1.12117

andrjohns commented 4 years ago

I still need to figure out the reverse side of things, but how does that look as a starting point?

wds15 commented 4 years ago

Starts to look very nice. What I don't see are the deep copies and the nested autodiff bits. Moreover, note that for reduce_sum the passed in functor is given via a type argument only. This prevents that any internal state of the functor is hidden from reduce_sum. The concern is that internal vars would not be deep copied which makes the approach invalid. @nhuurre suggested on the forums a way to handle closures which would work with reduce_sum.

It's a bit delicate to get reverse mode multi-threaded things to work given the way our AD tape is structured.

BTW, the reason why I am very interested in a apply-reduce framework on the back of reduce_sum is that this will lead to a compressed AD tape which improves CPU cache use usually (reduce_sum operates in slices of things and computes for the sub-slice the pre-computed gradient; so the reverse sweep is done on smaller data structures which hopefully live longer in the CPU cache).

andrjohns commented 4 years ago

I'm trying with deep copies but I'm still getting segfaults, even without returning the final result. I was expecting it to at least run without segfaulting, and then be able to extract the values and gradients.

Am I misunderstanding the order of things here? This is quickly getting deeper than I've gone into the reverse mode before, so I'm still catching up on the mechanics of everything.

#include <stan/math.hpp>
#include <tbb/parallel_for.h>
#include <tbb/blocked_range.h>

template <typename Return, typename F, typename ArgsTuple>
auto map_variadic_var(Return&& result, F&& f, size_t length,
                  int grainsize, ArgsTuple&& args_tuple) {
  tbb::parallel_for(
    tbb::blocked_range<size_t>(0, length), 
    [&result,&f,&args_tuple](const tbb::blocked_range<size_t>& r) {
      // Initialize nested autodiff stack
      const stan::math::nested_rev_autodiff begin_nest;

      // Create nested autodiff copies that do not point
      //   back to main autodiff stack
      auto args_tuple_local_copy = apply(
        [&](auto&&... args) {
          return std::tuple<decltype(stan::math::deep_copy_vars(args))...>(
              stan::math::deep_copy_vars(args)...);
        },
      args_tuple);

      // Parallelise evaluation of copied vars
      for (size_t i = r.begin(); i < r.end(); ++i) {
        f(i, args_tuple_local_copy);
      }
  });
}

int main() {
  stan::math::vector_v inv = Eigen::VectorXd::Random(10000);
  stan::math::vector_v outv(10000);

  // Functor with tuple indexing to denote variable(s) to be 
  //   manipulated
  decltype(auto) func = [&](size_t index, auto&& in_arg){
    return stan::math::exp(std::get<0>(in_arg)[index]);
  };

  map_variadic_var(outv, func, inv.size(), 1, std::forward_as_tuple(inv));
}

wds15 commented 4 years ago

you still have to extract the value (result of the function) and the gradient. This then has to be put into a precomputed gradient thing. You should follow closely what we did for reduce_sum.

EDIT: The thing is that the different AD tapes in the different AD tapes must never be interlinked. As such you can use AD on a sub-thread only in full independence of the others. That implies to deep copy things to the AD tape in the thread, do the computation, then extract all stuff as doubles only and finally put this together on the original thread.

It's a pain, but the only way we can use the current AD tape design. Anything else requires a huge refactor of the AD system from my understanding.

EDIT2: And I am almost glad that things crash on you as it prevents bad things to happen if used wrongly. So I consider this good ;).

andrjohns commented 4 years ago

Ah that makes sense, thanks for clarifying. Always happy when my code crashing is a good thing haha

andrjohns commented 4 years ago

Sorry to keep throwing code at you, but I've 'ported' the approach over to a similar method used in reduce_sum, and while the prim version is working like a dream, I still can't get the reverse version to run without segfaulting. Would you mind having a look when you've got a minute? No rush or anything.

The branch is up here, and I've got two small tests in there for prim & var inputs.

wds15 commented 4 years ago

You need to save the gradient for each element. You can not accumulate these, but need it for every element.

andrjohns commented 4 years ago

I'm guessing you mean the gradient for the result of the function? like:

              // Copy calculated gradient
              adjs__[i] = sub_v.adj();

Or by every element, do you mean every input?

andrjohns commented 4 years ago

But I suppose my main question is why would it segfault if I don't extract the gradients? The segfault suggests I'm wandering into memory that I shouldn't, whereas not extracting the gradients would just return a result with an incorrect gradient, right?

wds15 commented 4 years ago

You need to allocate the varis of the arguments for each element. I think you did not get that quite right.

andrjohns commented 4 years ago

Unfortunately I still can't get it running without segfaulting. I've made sure that the gradients are being copied instead of accumulated and that the varis for all args have been allocated and copied into, but still no dice.

The updated branch is up here.

wds15 commented 4 years ago

It does work for me:

[14:52:41][sebi@Sebastians-MBP:~/work/math-andrjohns]$ ./runTests.py -j6 test/unit/math/rev/functor/map_var_test.cpp
------------------------------------------------------------
make -j6 test/unit/math/rev/functor/map_var_test
clang++ -DSTAN_THREADS -std=c++1y -Wno-unknown-warning-option -Wno-tautological-compare -Wno-sign-compare -D_REENTRANT       -I lib/tbb_2019_U8/include -O3  -I . -I lib/eigen_3.3.7 -I lib/boost_1.72.0 -I lib/sundials_5.2.0/include -I lib/gtest_1.8.1/include -I lib/gtest_1.8.1 -I lib/gtest_1.8.1/include -I lib/gtest_1.8.1      -DBOOST_DISABLE_ASSERTS        -c -o test/unit/math/rev/functor/map_var_test.o test/unit/math/rev/functor/map_var_test.cpp
clang++ -DSTAN_THREADS -std=c++1y -Wno-unknown-warning-option -Wno-tautological-compare -Wno-sign-compare -D_REENTRANT       -I lib/tbb_2019_U8/include -O3  -I . -I lib/eigen_3.3.7 -I lib/boost_1.72.0 -I lib/sundials_5.2.0/include -I lib/gtest_1.8.1/include -I lib/gtest_1.8.1     -DBOOST_DISABLE_ASSERTS             -Wl,-L,"/Users/sebi/work/math-andrjohns/lib/tbb" -Wl,-rpath,"/Users/sebi/work/math-andrjohns/lib/tbb"  test/unit/math/rev/functor/map_var_test.o lib/gtest_1.8.1/src/gtest_main.cc lib/gtest_1.8.1/src/gtest-all.o lib/tbb/libtbb.dylib lib/tbb/libtbbmalloc.dylib lib/tbb/libtbbmalloc_proxy.dylib         -o test/unit/math/rev/functor/map_var_test
------------------------------------------------------------
test/unit/math/rev/functor/map_var_test --gtest_output="xml:test/unit/math/rev/functor/map_var_test.xml"
Running main() from lib/gtest_1.8.1/src/gtest_main.cc
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from map_variadic
[ RUN      ] map_variadic.tbb
[       OK ] map_variadic.tbb (15 ms)
[----------] 1 test from map_variadic (15 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (15 ms total)
[  PASSED  ] 1 test.

But I had to move the test into test/ directories which ensures that the needed dependencies are managed accordingly. Also don't forget to set STAN_THREADS.

However, it's not yet there, but you are on a good way. What I think makes sense is to have as input an array of T and in addition any shared arguments.

The precomputed gradient then has to have as partials the varis of T of the respective element and the varis of the shared argument (which must be repeated for every element).

I would suggest that we have as output format an array of type Tout (so a different type than what is the input is admissible)... you would just need to enforce that the sizes are all consistent as we would otherwise have ragged data structures.

This would be a huge feature to have as not everything will be easily cast into a straight reduce thing.

andrjohns commented 4 years ago

Argh, after all that it was just directories and compiler flags. Thanks for sorting that out!

What I think makes sense is to have as input an array of T and in addition any shared arguments.

Doesn't that limit the flexibility of the approach to only be able to slice across one argument? The way I've set things up now allows for arbitrary slicing, but the output has to be known and passed to the map. Is there a benefit to restricting the inputs like that?

wds15 commented 4 years ago

Can u explain arbitrary slicing a bit more? What gets sliced and how? And what does not get sliced?

andrjohns commented 4 years ago

I'll preface this by saying that 'slicing' was probably the wrong word here. The way I've envisioned the facility is that the user/dev can pass multiple arguments and specify which get iterated over.

So they specify a struct, which has an operator () that takes an index and any number of arguments. Within this struct, they specify which arguments are iterated and which are used 'as-is':

struct example_fun {
  template <typename Arg1, typename Arg2, typename Arg3>
  auto operator()(size_t index, Arg1&& arg1, Arg2&& arg2, Arg3&& arg3) {
    return stan::math::distance(arg1[index], arg2) + arg3[index];
  }
};

The call to map_variadic then passes the output container, grainsize, and input arguments. The output container is used to determine the type of output and map_variadic_impl specialisation to call.

  stan::math::map_variadic<example_fun>(out_vec, 1, &msgs, in_arg1, in_arg2, in_arg3);

How does that sound?

wds15 commented 4 years ago

To have out_vec passed into the map thing is odd to me. Why not find out the output type via decltype?

The choice to pass an index into the map (and the entire container over which we iterate) is very inefficient, since this means that you need to copy - for every element - all the varis of the thing we iterate over. So I would much prefer if we only pass in the element i and not more (in addition with the shared arguments).

This sounds like you want to start a design doc for this?

Such a map thing would be very useful! The map_variadic name is not nice to my eyes. Since this will be a really cool map... I would name it just like that map!

For consistency with other things we did, my suggestion would be to have std::vector<Tin> for the input and std::vector<Tout> for the output. Or at least this is what should be exposed to users... a more general signature can be made available for internal use in stan-math, I think.

andrjohns commented 4 years ago

To have out_vec passed into the map thing is odd to me. Why not find out the output type via decltype?

Because the backend is parallel_for, this is just distributing iterations of a loop. At each iteration a scalar is returned, while we can decltype the scalar, there's no easy way to deduce the type and shape of container. It also makes it far easier to determine the number of output vari without having to do anything crazy with the parameter pack

So I would much prefer if we only pass in the element i and not more (in addition with the shared arguments).

While that would certainly be more efficient, I can't think of a way to do it without restricting things to only iterate over one argument, which would cut down on the flexibility considerably. If you've got suggestions on some things to try, I'd be happy to look into it though.

For consistency with other things we did, my suggestion would be to have std::vector for the input and std::vector for the output

That's a good idea, will do

wds15 commented 4 years ago

Because the backend is parallel_for, this is just distributing iterations of a loop. At each iteration a scalar is returned, while we can decltype the scalar, there's no easy way to deduce the type and shape of container. It also makes it far easier to determine the number of output vari without having to do anything crazy with the parameter pack

It should be possible to return anything from the function - not just a scalar. That does make it a bit harder to code as you need multiple precomputed_gradients; so a Jacobian would be calculated per output element... though we can enable this later if that‘s too hard for now.

While that would certainly be more efficient, I can't think of a way to do it without restricting things to only iterate over one argument, which would cut down on the flexibility considerably. If you've got suggestions on some things to try, I'd be happy to look into it though.

I you restrict the input to be std::vector<Tin>, then that‘s not too hard to do, no? In any case, we really have to do this as we otherwise have for N iteration items N*N output varis... which will limited the performance way too much (a map is already not so efficient as there is no reduce step which cuts down on the output varis).

andrjohns commented 4 years ago

N iteration items N*N output varis.

Is that the case here? The varis are allocated only once prior to iteration, so it's just one vari allocated per var

    vari** varis = ChainableStack::instance_->memalloc_.alloc_array<vari*>(
        num_vars_all_terms);

wds15 commented 4 years ago

Put the user provided function can access any of the varis. Hence, you need to insert all that into the AD tape, since the user function had access to it. This is why I would only pass in the ith element and not more.

wds15 commented 4 years ago

Bump... is this moving?

andrjohns commented 4 years ago

Yep. The framework itself is in a place that I'm happy with, the plan is to add it to a couple of functions to check that the gradients are all consistent, and do some benchmarking. After those, I'll write up a design doc to get some wider discussion about a good way to implement/expose/use this.

Also, I can only do Stan stuff outside of work, so progress on things is generally piecemeal during the week

wds15 commented 4 years ago

No rush... take your time. I was just curious.

If there is something to look at, just ping.

andrjohns commented 4 years ago

Got an interesting one for you. I've got things (kind of) working with an implementation that works with pointers rather than copies and added it to apply_scalar_unary. However, the mix test for a given function (in this case exp) only passes intermittently. I could understand always passing or failing, but I'm not sure what it means to only fail sometimes.

The branch is up here if you could have a look

bob-carpenter commented 4 years ago

I'm not sure what it means to only fail sometimes.

In my experience, that's usually due to improper synchronization or undefined default initialization.

andrjohns commented 4 years ago

Thanks Bob! To clarify, by undefined default initialization I'm guessing you mean initializing the output container of var? What do you mean by improper synchronization?

Apologies for the basic questions, I haven't touched much of the autodiff side of the Math library before

stan-dev / math