Within-chain parallelization via "reduce_sum"

paul-buerkner commented 4 years ago

See the blog post of Sebastian Weber (@wds15): https://statmodeling.stat.columbia.edu/2020/05/05/easy-within-chain-parallelisation-in-stan/

wds15 commented 4 years ago

Great to see this here.

It will likely take a moment until it lands in brms given that RStan is still at 2.19. Should we maybe prepare in the meantime a short vignette which shows with an example how the Stan code from the current brms needs to be modified in order to run with CmdStan? Is that an option? As alternative a wiki page maybe?

paul-buerkner commented 4 years ago

Would issue #891 help with this? :-)

wds15 commented 4 years ago

Obviously yes.

mike-lawrence commented 4 years ago

Did the tutorial/wiki that @wds15 mentioned ever get created?

paul-buerkner commented 4 years ago

Inspired by @wds15 talk at StanCon, I started thinking about reduce_sum in brms a little more and I would like your (read: anybody's) input.

1) Over what variable the automatic slicing should be done? In a standard brms model, we could slice over y (the response variable) or over mu (the mean predictor term). A standard likelihood of brms could look as follows:

vector[N] y;
vector[N] mu;
real sigma;
target += normal_lpdf(y | mu, sigma);

As far as I understand, slicing over mu would be computationally prefferable as y can be copied by reference even if passed to reduce_sum via an additional argument, while mu would be passed by value if it was an additional argument. Does anybody know how big of an influence this decision has?

2) I know it is preferrable to parallelize as much as possible, but given that the computation of the predictor term mu may involve so many different variables that the housekeeping required to stuff them all into reduce_sum would be too much to handle for brms I think. Given that I currently plan to only use reduce_sum over the likelihood call, with mu being computed outside of it in a serial way, would reduce_sum still be worthwhile? Does there exist some benchmark data for this case?

wds15 commented 4 years ago

Yep, slice over mu, not over y is much better.
It would still be worthwhile if the computational cost of the likelihood is very large compared to compiling mu serially (lieklihoods with gamma function calls like the Poisson should work ok and similar ones...but probably not a normal lpdf). This is probably a good starting point. Still, is it thinkable to write the entire model block out as a function and just pass everything into it?

I can't offer any benchmarks on this and I would suggest to just find out... I am looking forward to these results (speak: results someone else worked out). It's all about Amdahl's law as I explained...

paul-buerkner commented 4 years ago

Thanks! Let's aim high and parallelize the whole model block. Take, for example, the model block of a simple varying intercept-slope model, with priors removed (we don't want to include priors in reduce_sum, do we?):

model {
  // initialize linear predictor term
  vector[N] mu = Intercept + Xc * b;
  for (n in 1:N) {
    // add more terms to the linear predictor
    mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n];
  }
  // likelihood including all constants
  if (!prior_only) {
    target += normal_lpdf(Y | mu, sigma);
  }
}

Here is how it would look like

functions {
  real partial_log_lik(real[] Y, int start, int end, real Intercept, 
                       matrix fXc, vector b, int[] fJ_1, vector fZ_1_1, 
                       vector fZ_1_2, vector r_1_1, vector r_1_2,
                       real sigma) {
    int N = end - start + 1;
    matrix[N, cols(fXc)] Xc = fXc[start:end];
    int J_1[N] = fJ_1[start:end];
    vector[N] Z_1_1 = fZ_1_1[start:end];
    vector[N] Z_1_2 = fZ_1_2[start:end];
    real ptarget = 0;
    // initialize linear predictor term
    vector[N] mu = Intercept + Xc * b;
    for (n in 1:N) {
      // add more terms to the linear predictor
      mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n];
    }
    ptarget += normal_lpdf(Y | mu, sigma);
    return ptarget;
  }
}
model {
  // likelihood including all constants
  if (!prior_only) {
    target += reduce_sum(partial_log_lik, Y, 1, ...)
  }
}

where ... are all the additional arguments passed to partial_log_lik. I do a lot of housekeeping at the start of the partial_log_lik where I subset the original vectors etc. so that they become the partial vectors etc that we need. The full vectors to be subsetted get the prefix f so that the subsetted vectors have the original names. That way only the definition header of partial_log_lik is different from the original model block's code, which should make the required changes to brms' Stan code generation much less cumbersome.

Above, I use Y as the first argument not because it is necessarily the best to reduce over but because it is the only thing that is always available no matter how the model looks like.

@wds15 What do you think of this approach? Am I missing something important in the spec?

wds15 commented 4 years ago

This starts to look very good! My thoughts on this:

we first should try this out without all the code gen to see if this really gives us the cool effect we want. I would manually rewrite a generated brms model.
the line matrix[N, cols(fXc)] Xc = fXc[start:end]; is very expensive for the AD tape. The Xc will be declared as a parameter and add significantly to the AD cost. To avoid this, you would have to avoid declaring Xc, but rather write the expression with it directly... so replace vector[N] mu = Intercept + Xc * b; with vector[N] mu = Intercept + fXc[start:end] * b;
yes, priors should not go into the partial sum function.

paul-buerkner commented 4 years ago

I will write a brms branch where I support just a small subset of brms models in the above described way. That should take just a few hours but would enable us to rapidly test a lot of cases (instead of doing all the hand writing).

With regard to Xc, can you explain why this specification is a problem for the AD tape while the same for, say, Z_1_1 is not? Also, your suggestions would kill the principle of leaving the fundamental model block unchanged and just adding a definition header, so I would like to avoid it if possible.

wds15 commented 4 years ago

The Z_* already come in as parameters and it looks to me as if doing this re-indexing (basically this is what it is) makes it easier for you to work within code-gen of brms.... but any real (real, vector, matrix) in a user functions is always instantiated as var (even if you assign data to it). At that point Xc is written to the AD tape. If you instead do what I suggested, then Xc is never written to the AD tape, since you use it in a matrix-vector multiplication. Doing so keeps the AD tape clean from the matrix and does a very much optimised matrix-vector product where only terms of the size of the vector end up on the AD tape.

However, if it is really hard to integrate that in your code gen, then just keep it as is for now (there is hope that the Stan compiler gets smarter and the problem goes away automatically).

paul-buerkner commented 4 years ago

I see. So, in brms, Z_* is data as well so I suspect the same would apply to those variables as well, right? That is, ideally don't redefine them but use them directly. The problem with this would be that Z_1_1[start:end] would be repeated in every evaluation of the loop, which is probably not what we want either...

I will go ahead and do the header-only-change for now, even if inefficient in some cases so that we have a version to try out.

wds15 commented 4 years ago

Oh... if that's also data... then yes, avoid the temporary re-assignemnts if you can.

So better use instead of Z_1_1[n] do this: fZ_1_1[start + n - 1]

Then you keep this thing as data.

paul-buerkner commented 4 years ago

This is going to be a code-generation nightmare :-D Let's see if I can find a principled way to enable this indexing on the fly.

wds15 commented 4 years ago

I usually write partial sum functions with a loop over the slice, but with two indices. One spanning n=1...N for N being the slice size and another one for the global context n_overall=n + start - 1. With these two defined it's often easy to write down what you want... but I understand that this may be harder for your brms code gen things.

paul-buerkner commented 4 years ago

Thanks. I will see what I can make possible in a first step.

wds15 commented 4 years ago

Maybe start this by taking care of Xc... which is of the order of N * number of regressors.... but Z is just of the size of N.

paul-buerkner commented 4 years ago

Yeah, but in order to make that scale to all the complexity of brms models, we should consistently use one or the other approach all the way through and not mix things up. I will play around with it and report back once there is something that works (ish).

paul-buerkner commented 4 years ago

Ok, one follow up question, given that the header-change approach would be so much simpler: Do you know of any plans to make the Stan compiler smarter in definining variables as data (when appropriate) in user-defined functions?

wds15 commented 4 years ago

I don't know... with some luck this is even already available with the recently release stanc3 optimisations which can be turned on. I have not been following this that close enough to know if it is there or is still in the making.

I would have hoped that a simple replace - text approach would work to start with at least.

rok-cesnovar commented 4 years ago

Do you know of any plans to make the Stan compiler smarter in definining variables as data (when appropriate) in user-defined functions?

This is in 2.24 already. If you specify --O (not --o) it will optimize this case. Optimizations are currently experimental, but at least these optimizations work as they should (at least as far as we tested them off course).

rok-cesnovar commented 4 years ago

So

functions {
real foo(data matrix X, vector paramv) {
   matrix[2,2] Xs = X[1:2];
   return sum( Xs * paramv);
}
}

without optimization:

template <typename T1__>
stan::promote_args_t<T1__>
foo(const Eigen::Matrix<double, -1, -1>& X,
    const Eigen::Matrix<T1__, -1, 1>& paramv, std::ostream* pstream__) {
  using local_scalar_t__ = stan::promote_args_t<T1__>;
  const static bool propto__ = true;
  (void) propto__;
  local_scalar_t__ DUMMY_VAR__(std::numeric_limits<double>::quiet_NaN());
  (void) DUMMY_VAR__;  // suppress unused var warning

  try {
    Eigen::Matrix<local_scalar_t__, -1, -1> Xs;
    Xs = Eigen::Matrix<local_scalar_t__, -1, -1>(2, 2);
    stan::math::fill(Xs, DUMMY_VAR__);

    current_statement__ = 1;
    assign(Xs, nil_index_list(),
      rvalue(X, cons_list(index_min_max(1, 2), nil_index_list()), "X"),
      "assigning variable Xs");
    current_statement__ = 2;
    return sum(multiply(Xs, paramv));
  } catch (const std::exception& e) {
    stan::lang::rethrow_located(e, locations_array__[current_statement__]);
      // Next line prevents compiler griping about no return
      throw std::runtime_error("*** IF YOU SEE THIS, PLEASE REPORT A BUG ***"); 
  }

}

with optimization

template <typename T1__>
stan::promote_args_t<T1__>
foo(const Eigen::Matrix<double, -1, -1>& X,
    const Eigen::Matrix<T1__, -1, 1>& paramv, std::ostream* pstream__) {
  using local_scalar_t__ = stan::promote_args_t<T1__>;
  const static bool propto__ = true;
  (void) propto__;
  local_scalar_t__ DUMMY_VAR__(std::numeric_limits<double>::quiet_NaN());
  (void) DUMMY_VAR__;  // suppress unused var warning

  try {
    Eigen::Matrix<double, -1, -1> lcm_sym2__;
    double lcm_sym1__;
    {
      Eigen::Matrix<double, -1, -1> Xs;
      Xs = Eigen::Matrix<double, -1, -1>(2, 2);
      stan::math::fill(Xs, std::numeric_limits<double>::quiet_NaN());

      assign(lcm_sym2__, nil_index_list(),
        rvalue(X, cons_list(index_min_max(1, 2), nil_index_list()), "X"),
        "assigning variable lcm_sym2__");
      current_statement__ = 2;
      return sum(multiply(lcm_sym2__, paramv));
    }
  } catch (const std::exception& e) {
    stan::lang::rethrow_located(e, locations_array__[current_statement__]);
      // Next line prevents compiler griping about no return
      throw std::runtime_error("*** IF YOU SEE THIS, PLEASE REPORT A BUG ***"); 
  }

}

Note that Xs is a matrix of doubles which is a huge win in this case.

wds15 commented 4 years ago

Thanks @rok-cesnovar ... but a requirement is that you declare things as

real foo(data matrix X, vector paramv) {
   matrix[2,2] Xs = X[1:2];
   return sum( Xs * paramv);
}

So the X must be declared as data matrix and one also can only call it with data at that point, of course.

Is that an option for you @paul-buerkner ?

(BTW, this is super cool to know... I have been waiting for this feature since ages)

rok-cesnovar commented 4 years ago

Yes, you need the data qualifier.

Otherwise stanc3 cant infer the type. In theory it probably could check with what inputs it is used and create extra signatures. But not atm.

paul-buerkner commented 4 years ago

Nice! So I just need to put the data qualifier before the variable type and things should be fine? That would be awesome!

rok-cesnovar commented 4 years ago

Yes + call the cmdstan_model with an argument. I will post an example call when I am back at my laptop.

paul-buerkner commented 4 years ago

Great, thanks @rok-cesnovar! @wds15 After some work today, I got a simple version working already, which used the subsetting on the fly approach, you advocated before knowing of the new stanc optimization. I will post the version tomorrow or so when I am convinced of it. I am no longer sure which of the two discussed versions would be easier to implement and maintain. I will play around with it more.

wds15 commented 4 years ago

If you get it to work with subsetting as suggested first, then this will be more flexible... since the data declaration in the arguments makes any function less flexible which could limit things wrt to missing data stuff, for example.

rok-cesnovar commented 4 years ago

Forgot about this, sorry..

library(cmdstanr)

file_path <- file.path(cmdstan_path(), "examples", "bernoulli", "bernoulli.stan")
mod <- cmdstan_model(file_path, stanc_options = list(O=TRUE))

You can run cmdstan_model with quiet = FALSE to double check.

You should see something like (note the --O):

--- Translating Stan model to C++ code ---
bin/stanc --O --name='bernoulli_model' --o=/tmp/Rtmp2EuAeC/model-1c3a2fc5c314.hpp /tmp/Rtmp2EuAeC/model-1c3a2fc5c314.stan

paul-buerkner commented 4 years ago

brms now supports threading via reduce_sum for most models. However, this is super experimental and a lot of cleaning will be required. Here is an example for a very simple model:

data("sleepstudy", package = "lme4")

make_stancode(Reaction ~ Days + (Days | Reaction), sleepstudy,
              family = gaussian(), threads = 2)

// generated with brms 2.13.10
functions {
  /* integer sequence of values
   * Args: 
   *   start: starting integer
   *   end: ending integer
   * Returns: 
   *   an integer sequence from start to end
   */ 
  int[] sequence(int start, int end) { 
    int seq[end - start + 1];
    for (n in 1:num_elements(seq)) {
      seq[n] = n + start - 1;
    }
    return seq; 
  } 
  // compute partial sums of the log-likelihood
  real partial_log_lik(int[] seq, int start, int end, vector Y, matrix Xc, vector b, real Intercept, real sigma, int[] J_1, vector Z_1_1, vector Z_1_2, vector r_1_1, vector r_1_2) {
    real ptarget = 0;
    int N = end - start + 1;
    // initialize linear predictor term
    vector[N] mu = Intercept + Xc[start:end] * b;
    for (n in 1:N) {
      // add more terms to the linear predictor
      int nn = n + start - 1;
      mu[n] += r_1_1[J_1[nn]] * Z_1_1[nn] + r_1_2[J_1[nn]] * Z_1_2[nn];
    }
    ptarget += normal_lpdf(Y[start:end] | mu, sigma);
    return ptarget;
  }
}
data {
  int<lower=1> N;  // total number of observations
  vector[N] Y;  // response variable
  int<lower=1> K;  // number of population-level effects
  matrix[N, K] X;  // population-level design matrix
  // data for group-level effects of ID 1
  int<lower=1> N_1;  // number of grouping levels
  int<lower=1> M_1;  // number of coefficients per level
  int<lower=1> J_1[N];  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_1_1;
  vector[N] Z_1_2;
  int<lower=1> NC_1;  // number of group-level correlations
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
  int Kc = K - 1;
  matrix[N, Kc] Xc;  // centered version of X without an intercept
  vector[Kc] means_X;  // column means of X before centering
  for (i in 2:K) {
    means_X[i - 1] = mean(X[, i]);
    Xc[, i - 1] = X[, i] - means_X[i - 1];
  }
}
parameters {
  vector[Kc] b;  // population-level effects
  real Intercept;  // temporary intercept for centered predictors
  real<lower=0> sigma;  // residual SD
  vector<lower=0>[M_1] sd_1;  // group-level standard deviations
  matrix[M_1, N_1] z_1;  // standardized group-level effects
  cholesky_factor_corr[M_1] L_1;  // cholesky factor of correlation matrix
}
transformed parameters {
  matrix[N_1, M_1] r_1;  // actual group-level effects
  // using vectors speeds up indexing in loops
  vector[N_1] r_1_1;
  vector[N_1] r_1_2;
  // compute actual group-level effects
  r_1 = (diag_pre_multiply(sd_1, L_1) * z_1)';
  r_1_1 = r_1[, 1];
  r_1_2 = r_1[, 2];
}
model {
  // likelihood including all constants
  if (!prior_only) {
    int seq[N] = sequence(1, N);
    target += reduce_sum(partial_log_lik, seq, 1, Y, Xc, b, Intercept, sigma, J_1, Z_1_1, Z_1_2, r_1_1, r_1_2);
  }
  // priors including all constants
  target += student_t_lpdf(Intercept | 3, 288.7, 59.3);
  target += student_t_lpdf(sigma | 3, 0, 59.3)
    - 1 * student_t_lccdf(0 | 3, 0, 59.3);
  target += student_t_lpdf(sd_1 | 3, 0, 59.3)
    - 2 * student_t_lccdf(0 | 3, 0, 59.3);
  target += std_normal_lpdf(to_vector(z_1));
  target += lkj_corr_cholesky_lpdf(L_1 | 1);
}
generated quantities {
  // actual population-level intercept
  real b_Intercept = Intercept - dot_product(means_X, b);
  // compute group-level correlations
  corr_matrix[M_1] Cor_1 = multiply_lower_tri_self_transpose(L_1);
  vector<lower=-1,upper=1>[NC_1] cor_1;
  // extract upper diagonal of correlation matrix
  for (k in 1:M_1) {
    for (j in 1:(k - 1)) {
      cor_1[choose(k - 1, 2) + j] = Cor_1[j, k];
    }
  }
}

fit1 <- brm(Reaction ~ Days + (Days | Reaction), sleepstudy,
            family = gaussian(), threads = 2, 
            chains = 1, backend = "cmdstanr")

On my windows machine threading slows down sampling drastically which may be because of windows, the simple model, or my implementation of reduce_sum. @wds15 Would you mind running some benchmarks on the brms implementation so that we can get an understanding how well we are doing at the moment?

wds15 commented 4 years ago

Hmm... I did run a poisson example and I also see unreasonable slow downs. I need to debug this for a moment.

paul-buerkner commented 4 years ago

Thank you for looking into it!

wds15 commented 4 years ago

Interesting... so I looked at the Poisson example epilepsy:

fit1 <- brm(count ~ zAge + zBase * Trt + (1|patient),
            data = epilepsy, family = poisson(),
            threads = 1,
            chains = 1, backend = "cmdstanr")

and single thread runtime is 3.8s while the plain reduce_sum runtime was a whopping 30s. I tried a few things, but what made it finally work is to simply introduce a grainsize which I increased from the fixed value of 1 to 60 or so. Then the performance was ok (4.9s with 1 core / ~3s with 4 cores), but the example is still too small for real benefits (but at least I was able to beat single thread runtime).

So we need the grainsize... and it needs to be picked by the user.

... looking at this more closely, I think that what would really help is to slice not over the data items 1-N, but instead slice over the patients. To do that one would probably have to resort the data or create some clever index maps - which I am not sure if this is something which can be done in brms. If it's a possibility, then we should try that. The intuition behind that is that doing so minimizes unnecessary copies of the random effects (in fact, you would slice over the random effect) and also the work done per unit increases which makes very likely setting grainsize less important. Should this be an option, then I would just make it a convention that the random effect with the largest number of levels is picked to slice over (that's my guess which we should test a bit).

BTW... not for now... but it would be nice to let users choose between reduce_sum and reduce_sum_static. The static version is deterministic and setting grainsize well is a must, while reduce_sum should work ok with mediocre set grainsizes and adapt better to more/less CPU cores.

Overall this looks super promising!

paul-buerkner commented 4 years ago

Thanks you for looking into that! So grainsize it is then. How do we choose it appropriately? Or rather, how do we advice users to choose it?

Slicing over random effects is not an option for brms (in my current opinion), because in general, we not only have one but multiple (nested or crossed) effects and we can obviously slice only over one. Adding even more structure, such a splines or other special terms, and we won't be able to slice appropriately over any partical kind of parameter vector. And even if we would, we would only slice over one but not over the several others . There will be stuff copied over no matter what we do. I understand that this is something where we gain speed for some simple models but I don't see this being an option for brms unfortunately.

I am happy to also support reduce_sum_static. Is there any difference other than the grainsize? We also need to think of a user-interface that allows to intuitively specify everything relevant in one place, that is, the number of threads, the grainsize, and the static option. I am thinking of an argument call threading (or so) and a corresponding function which takes and validates all the arguments to fully handle threading.

wds15 commented 4 years ago

The random effects of the model usually groups the data into larger chunks. When we slice over these, we gain a few things here:

choosing grainsize should be far less important as more work is done per unit
we will not be in-efficient in terms of copying things as the sliced over random effect only is doubled

Maybe I will try this out on the epilepsy example to have a test drive of it.

In any case: We can add this additional bit later once convinced of the merits of it. What is there already is very useful as it looks to me.

reduce_sum_static will always form the exact same expression graph such that always the same partial sums are formed and summed together which makes everything exactly reproducible. The chunk size will always be close to, but never larger than grainsize.

reduce_sum uses grainsize just as a rough order for the chunk size and it tries to adapt things to the given CPUs and to the current load to make best use of the available resources (but the result is non deterministic).

Choosing grainsize... the Intel TBB answer: https://software.intel.com/en-us/node/506060 ... practically it's hard for me to give a definite answer... maybe start with # of data rows / ( 2 * maximal # of cores being used )... but don't nail me on that; it's very dependent on your problem & CPU you use.

paul-buerkner commented 4 years ago

Thanks for the additional information! This is really helpful.

With regard to the slicing by random effects, I don't doubt the merits of this (but would still love to see some benchmarks). However, I don't see at this point how I can make this possible without a huge amount of special case coding that will haunt me for years when having to maintain that. If someone comes up with an idea how this can be represented without substantially breaking the existing structure of the code generation, I will consider it of course. But just to manage expectations, I think the slicing by random effects is relatively unlikely to make it into brms but perhaps there is an elegant way of doing it.

jgabry commented 4 years ago

This is really cool. Thank you both for working on it! I just wanted to chime in with one suggestion about being cautious when advertising this feature:

Basically, unless we have really solid advice on getting the settings right (e.g. grainsize) we should discourage people from using it unless they really have a problem that is computationally too demanding without it. Within chain parallelization is such an alluring feature that people are going to want to use it even when they will spend more time trying to get it right than if they just ran the model once without it. That is, there's a risk that in their effort to save time with within chain parallelization users will actually end up wasting time having to try different settings, some of which make it slower than not using it at all.

So I definitely think it's worth offering this, but I suggest advertising with caution until it basically works out of the box with not much user tweaking required. Just imagine the number of forum posts and github issues that will come in if people get really excited about this but then it requires a lot of handholding to get right.

paul-buerkner commented 4 years ago

I definitely agree with being cautious! Even without this feature, we are kind of getting overrun with brms questions already...

paul-buerkner commented 4 years ago

I have updated the branch once more to give the user control over the grainsize. Here is an example:

set.seed(1234)
dat <- data.frame(
  y = rnbinom(1000, size = 10, mu = 5),
  x1 = rnorm(1000),
  x2 = rnorm(1000),
  g = factor(rep(1:100, each = 10))
)

fit1 <- brm(y ~ s(x1) + s(x2) + (1 | g), dat, family = negbinomial(), 
            chains = 1, backend = "cmdstanr", threads = NULL,
            control = list(adapt_delta = 0.95))
fit2 <- brm(y ~ s(x1) + s(x2) + (1 | g), dat, family = negbinomial(), 
            chains = 1, backend = "cmdstanr",
            threads = threading(4, grainsize = 250),
            control = list(adapt_delta = 0.95))

In this case, the speedups are quite ok (130 vs. 67 seconds in my latest run). However, performance gains seem to be super variable, especially across likelihood family (and surely across model complexity but I haven't tested this extensively yet). For example, when changing negbinomial to poisson in the above example, I barely see any speedups.

I really think we need more and systematic numerical experiments at least of the size of a student project (or did you do something like this already?). Given the relevance of within-chain parallelization for Stan and PPLs in general, this may even become a paper. No matter the details, we definitely need to learn much more about the behavior of reduce_sum I think.

wds15 commented 4 years ago

I agree in that we need to setup good defaults in brms along with clear documentation. Given how brms is applying reduce_sum it seems that grainsize=1 (our current recommended starting value) is not an ideal choice for many problems. So my thinking is

we need to define as to when this stuff gets actually useful. I see people apply paralleisation to tiny problems where its really a lot better to not bother with reduce_sum. Runtimes should be >10minutes, I think, at least.
I would prefer if the grainsize default depends only on the data size, but not on the # of CPUs
the grainsize should be either linear in the data set size or go with the square root of the data set size... this is my intuition which we should benchmark
ideally the grainsize also depends on the sampling distribution, since some are far more costly to calculate (=> smaller grainsize) than others (a normal lpdf is super cheap and you need large grainsize)... but to start I would actually like to avoid this level of engineering.
a scaling study would be good here. Hard scaling (fixed problem size, increasing CPU core count) and soft scaling (fixed work per CPU and increasing CPU count). Soft scaling lets you find out how big you can scale the problem size and still counter it with CPU power

I agree in that a full publication would be great and it‘s really worth publishing this, but we need a sutdent for that.

How about we (maybe even I) setup a small simulation study... though I am rather out of time and would welcome anyones hand here.

In any case - this is amazing to see this landing in brms!

paul-buerkner commented 4 years ago

Aki suggested we might want to ask on twitter or discourse if anyone is interested and has the time to work with us on the experiments.

mike-lawrence commented 4 years ago

This is stuff I think I can help with. I’be done lots of hierarchical multivariate simulation work. I also happened to just get access to gce’s 224 core instances, so can explore to pretty large scales. Lemme read back through this thread (only read the last two entries) and get up to speed on what parameters and scenarios you’re looking to explore.

paul-buerkner commented 4 years ago

Nice! That is great1 I think we need to have a call at some point to actually define a proper scope. Right now the discussion is mostly driven from having identified we need to understand reduce_sum better but we didn't define what exactly we are planning to do. I will put you on my list of people to invite to the call.

rok-cesnovar commented 4 years ago

@mike-lawrence great!

This is a really interesting problem. It is pretty difficult to do such tuning for a general problem, but if we limit the space to a still vast but less general space of brms models, I think its possible to get useful results for users and a publication.

The input (data) is:

problem size & complexity
cache size & speed
CPU cores
cost of parallelization (cost of spawning/joining threads + copying to the thread-local storage)

The tunable parameters are:

grainsize
number of threads

wds15 commented 4 years ago

Is it for brms an option to have a routine which does some quick profiling to test on the fly for a good grainsize given some problem?

Other that that we should keep it simple to get going. It’s about getting something reasonable to work for almost any model...not the optimum.

Ideally an advanced user can always provide a little extra code to get far more optimized parallelization.

paul-buerkner commented 4 years ago

Can you describe a little more what you mean by quick profiling on the fly?

rok-cesnovar commented 4 years ago

Basically to try out a few grainsizes with small numbers of iterations and pick the fastest one. This is probably a bit out of scope for brms and would rather be a part of the reduce_sum backend (if a user pick grainsize = -1 for example).

Something like:

try grainsize = 1
try some other grainsizes proportional to N and sqrt(N)

Given the iterative nature of MCMC, we could explore that.

Its not quite the same, but in the GPU implementation tuning research field, a quasi-random search guided with good prior knowledge outperforms any "smart" auto-tuning algorithms (based on simulate annealing, etc.) because of the cost of those algorithms.

I agree that the goal is always going to be good performance not optimal.

paul-buerkner commented 4 years ago

Thanks! That makes sense. I am not sure how I could put this into brms though. As you point out, it should probably better be part of reduce_sum itself if feasible.

mike-lawrence commented 4 years ago

Would it work to simply run with method="optimizing" and a single iteration?

rok-cesnovar commented 4 years ago

I think that a few optimization iterations would give a good estimate yes. There is some variance in execution time so we dont want to put too much in a single measurement.

wds15 commented 4 years ago

There should not be too much variance in the measurements for problems where this actually makes sense to use. These problems are large and there should not be too much variation.

paul-buerkner / brms

Within-chain parallelization via "reduce_sum" #892