Within-chain parallelization via "reduce_sum"

paul-buerkner commented 4 years ago

See the blog post of Sebastian Weber (@wds15): https://statmodeling.stat.columbia.edu/2020/05/05/easy-within-chain-parallelisation-in-stan/

jgabry commented 4 years ago

Would it work to simply run with method="optimizing" and a single iteration?

Not with brms (doesn't support optimization), but yeah with Stan in general running a few iterations of optimization could be useful.

paul-buerkner commented 4 years ago

For this purpose, I could theoretically support optimization internally but not make it user facing I guess. Still not sure where this profiling thing should live ideally.

jgabry commented 4 years ago

I think ideally brms wouldn't have to do this and it would be done in the background by reduce_sum().

wds15 commented 4 years ago

I think a better place would be Stan services for such a profiling thing.

Anyway, I think we can make a first version available with just a reasonably large grainsize as default. This should make things work for most data sets and problems is my expectation...which should be put to a rough test before being released. This, we should aim for simple and robust as a start. I am happy to suggest a benchmark for this.

paul-buerkner commented 4 years ago

Some benchmarks would be really nice! Right now, when I try threading in brms with a model with some new structure I can barely predict what will happen and we definitely need to get a better understanding before we advocate or recommend anything.

wds15 commented 4 years ago

I just noticed that you added the parallelisation to GP models... which is great to have it their, but I would be curious to see an example. In case the cholesky decomposition is outside of the reduce step, which I would expect, then not a lot of the program runs in parallel. So only when the bulk of the work of a brms program is in the reduce step we can expect good speedups.

paul-buerkner commented 4 years ago

Try out gp(..., k = <number of basis functions>, gr = FALSE) for approximate GPs without grouping. There we make full use of reduce_sum. You need to use the latest github version though as I just optimized the code yesterday.

More generally, brms now allows reduce_sum for all models that are supportable in theory, that is, all models without residual autocorrelation. For some of them, not much speed up is expected but at least it is possible to specify them that way.

paul-buerkner commented 4 years ago

Here is an example which could be interesting to better unstand the improvements through threading. The density of the Conway-Maxwell-Poisson (COM-Poisson) distribution is computationally very expensive and even smallish data set take ages to fit, which renders it a good canditate for threading. Unfortunately, in the one example I have run so far, speedups are quite small (around 30% speedup with 4 threads), which I find super strange as I would have expected much more speedup. However, there is also substantial variability between different runs for this family so my results may not be representative. Here is the code in case anybody is interested to run more thorough experiments for this model:

data(Articles, package = "Rchoice")

# Focusing only on the students with at least one publication
phdpublish <- subset(Articles, art > 0)
phdpublish <- transform(phdpublish, art = art - 1)

# Standardise all non-binary covariates
phdpublish <- cbind(
  phdpublish[, 1:3], 
  scale(phdpublish[, -(1:3)])
)

# model without threading
fit1 <- brm(
  art ~ fem + mar + kid5 + phd + ment,
  data = phdpublish,
  family = brmsfamily("com_poisson"),
  prior= prior(normal(0, 5), class = "b"),
  chains = 1,
  backend = "cmdstanr"
)
summary(fit1)

# model with threading
fit2 <- brm(
  art ~ fem + mar + kid5 + phd + ment,
  data = phdpublish,
  family = brmsfamily("com_poisson"),
  prior= prior(normal(0, 5), class = "b"),
  chains = 1,
  backend = "cmdstanr",
  threads = threading(4, grainsize = 80)
)
summary(fit2)

wds15 commented 4 years ago

I would not go straight to 4 cores, but start with 2. I am getting for a fixed seed, which you really need, these numbers:

1 core (no reduce sum, but compiled with threading) 160s
2 cores: 130s
3 cores: 110s

It's clearly faster, but not amazing. I don't see immediately where this is coming from... certainly this density is different than the others as it's calculated in a very basic way with lots of if's and no vectorization.

paul-buerkner commented 4 years ago

Thank you for checking out this example! Indeed the results don't look so amazing, although this model should be an ideal canditate.

The likelihood is extremely costly to evaluate.
No vectorization is implemented and so the likelihood runs in a loop over observations without sharing terms common to each pointwise likelihood (there are not really one in this case anyway).
There are no big chunks of parameters being unnecessarily passed by value to each CPU.

So I would think speed ups should be really high. @wds do you have an intuition on why this is not the case?

wds15 commented 4 years ago

Maybe there is a function being called which triggers thread synchronization (like the lgamma did). It could be worthwhile to file an issue for this in stan-math... unless @rok-cesnovar or @bbales2 sees anything immediate.

rok-cesnovar commented 4 years ago

Nothing jumps out, sorry. Will try to profile a bit, definitely an interesting case.

wds15 commented 4 years ago

@paul-buerkner What could be the case with this distribution is that it is extremely memory bound performance here. I am speculating here that what is done is to create a very large AD stack with not too complicated things to calculate for each term. Since memory access is vastly slower than CPU calculations, the parallelism does not buy too much. If this is an explanation here, then the speedups should improve on machine with larger CPU caches.

paul-buerkner commented 4 years ago

Thank you for the possible explanation. The more I delve into these things and ask you guys questions, the more I realize how difficult the topic actually is. Thank you for helping me on that journey. I think we should really set up a more formal study as this seems to have so many layers that we (and users) should be aware of. Should we schedule a call for this purpose? So far, I have @wds, @rok-cesnovar, @mike-lawrence on my list. Anybody I missed you may be interested in being involved in this project at this point?

wds15 commented 4 years ago

A meeting would probably be good.. it's just that my availability is quite bad. Basically I could join this in the evenings (not this week, but next week)... let's work this out by email? BTW.. it's @wds15 for my github me.

It would help if you, @paul-buerkner, could define upfront what the goals are here from your perspective.

From my view we need

setup sensible defaults for most stuff in brms
give a sensible guidance as to when to use and when NOT to use reduce_sum (point to Amdahls law for expectations)
give a quick recipe to users (in the form of an example possibly) which lets users find out if their problem works ok with reduce_sum and when it does NOT work well

All of the 3 things should not be too much work to get done. This is intentionally not aiming to give users the best possible performance for their models. Instead users need to understand when to use it and when NOT to use it.

Optionally we should eventually work out a path how users can tune the performance easily (possibly with some stancode extra stuff), but that's probably not for now.

paul-buerkner commented 4 years ago

Sorry for misspelling your github name. The goals you mentioned are mostly what I had thought of as well. I will formulate them in my own words in order of my percieved importance.

(Related to your 2). Understand which (aspects of) models benefit from the use of reduce_sum and which do hinder its use. Understand the degrees to which this is the case, to give users guidance which combination of model aspects make it likely that reduce_sum will be beneficial from their problem (where we need to argue what kind of speedup we see as beneficial). This is of course an open ended problem so we need to restrict is somehow to certain general purpose aspects of models. Still, I think the project would strongly benefit if it tried to make claims beyond brms' implementation even if we end up using brms for most of the Stan code generation. I agree with your notion that the the results should probably be formulated in some YES not NO decision for the user to reduce complexity but still degrees of improvements beyond yes or no could be very interesting in themselves.
(Related to your 3). Develop an ideally automated diagnostic which can infer if reduce_sum is likely to be beneficial for the problem based on what we learned from 1. Since brms knows so much about its own models, we should be able to develop at least a rough automatic diagnostic for this puporse.
(Related to your 1). Set sensible defaults for grainsize and potentially other things related to reduce_sum when the user chooses to apply it.

I am less optimistic than you about how much work this will require. At least, it sounds like a lot of work to me when we want to get things right in a comprehensive enough manner.

wds15 commented 4 years ago

Hmm... I am not yet sure I understand your motivations here. So you seem to suggest that we actually need to understand deeply which models will run well and which ones will run not so well with reduce_sum - without having run them. I don't quite see the need to be able for us to tell this upfront. As you see this topic is really complicated. The basic mechanism behind this is Amdahl's law which limits us due to the fact that our program only runs partially in parallel. The crux is that we do not have any way to determine the exact fraction of code running in parallel and which runs serially in a given Stan program (though maybe we can even get to that by running with prior_PD=FALSE/TRUE).

In any case, from a user perspective we need to answer "Does my program benefit from reduce_sum?". The quickest way to find out about that is to just try it out. That's quick to do and gives you the best answer you can get, since only this is specific to your model, your data-set, your CPU, your OS, your compiler. These are way too many things for us to figure out as to how they interact and a quick test drive will tell you. So related to your point 2, I would just run a benchmark - ideally as fast as possible (based on subsets of the data).

The important thing is that we get this first test-drive to answer this question setup in a robust way. So this must not depend on a wrongly chosen grainsize, for example.

The other thing we have to make sure is to manage expectations from users. Doubling the speed with 3 cores is already a good thing; getting more is very much dependent on details.

So an empirical answer to the speedup question is the way to go from my view, which should be manageable.

jgabry commented 4 years ago

In any case, from a user perspective we need to answer "Does my program benefit from reduce_sum?". The quickest way to find out about that is to just try it out.

If there's no way to tell ahead of time, then doesn't this result in a lot of users just wasting their time? Unless we think the number of cases where it's beneficial is definitely greater than the number cases it isn't, in which case we could maybe justify it by saying that on average it's worth trying.

Or am I misunderstanding? Do you mean that they could just run their model for a few iterations to find out? If so that seems much easier to justify recommending.

wds15 commented 4 years ago

@jgabry We need to caution users before using reduce_sum that it is only intended for big models. I also mean to just run the model with a few iterations with/without threading to see if they get anywhere useful (doing that with brms is simple given how easy brms generates the model code). We should probably investigate what is a sufficient iteration number for this and if we can sub-sample to smaller data-sets for this as well. In any case, I do not know what the "average" model is in brms, but I would assume that many models run by users should not run reduce_sum, because the data-sets are too small and it's not worth the trouble... still, for those cases where you really need reduce_sum and you got the resources, then it's a killer feature.

@paul-buerkner I was suggesting to use as slicing argument one of the random effects (ideally the fastest running random effect or even all random effects lumped together). I do understand why this would not be an ideal fit into your code gen pipeline, but I wanted to test this if it's worth the trouble... and it really is as it looks like. Below is a very simple example demonstrating the utility of doing so. It's rather simple minded and coded in a brute-force way, but you should get the gist of it and we can hopefully make it more generic. The non-optimized model gives me 1.07x speedup on 3 cores (basically nothing) while the optimized thing runs 2.4x faster on 3 cores. Either brms can code gen this structure directly... or we make it easy for users to supply model-specific partial sum functions.

This little example shows - unfortunately - how memory bound the performance is... at least for the normal lpdf case. Things may look different for a Poisson regression, which I would try next.

normal_threads-opt_stan.txt bench-normal_R.txt

jgabry commented 4 years ago

still, for those cases where you really need reduce_sum and you got the resources, then it's a killer feature.

Yeah definitely!

paul-buerkner commented 4 years ago

Thank you @wds15 for your comments. The points you raise are prescisely the reason why I wanted to schedule a call with you and the others who expressed interest in investigating reduce_sum more closely. I have not gotten a response from you, but perhaps I have sent it to the wrong email address? Would you mind checking whether you have received the email from me? Or send me an email from the address where I can reach you?

With regard to building something that predicts how well reduce_sum will work, I see that the user simply trying it out in some quick way is an easy and probably preferable way to do it. But if we want to go beyond that (independent of a particular brms user applying reduce_sum), learning more about the underlying aspects and how they play with each other would be very worthwhile, don't you think? (hence the call to clarify the sensibility and scope of this idea)

With regard to the random effects slicing. Of course, I do see your point but I am worried that the results of your simple models you try things out with won't carry over to more complex cases, e.g., with multiple grouping factors etc. So too simple examples in an (as you say) very complex environment, are perhaps not the best way to dertermine the merits of certain aspects of reduce_sum in models where it really matters (or perhaps they are, I don't know). To better understand this, I think we really need something a bit more systematic, hence my idea to discuss the potential of a formal project investigating this.

wds15 commented 4 years ago

The model with just one random effect is probably over-simplifying things in many circumstances, yes... but the example is designed to hit exactly into the memory issues we see with reduce_sum when using shared arguments for random effects. The problem with doing so is that for every partial sum evaluation you end up copying all random effects for each evaluation. Thus, this is very wasteful as you only need a subset of the random effects for a given partial sum. The normal likelihood is very cheap to evaluate such that memory overhead becomes quickly visible.

Now, turning to a poisson likelihood instead, then I would expect things to change. A Poisson likelihood must evaluate a log-gamma call which is very expensive such that this should offset any memory overhead you have due to wasteful copying. Running a problem with 10^4 rows and 100 groups (and a standard log-link) gives me these timings (units are seconds):

> ## running with grainsize=10*N=10^5
> time1["elapsed"]
elapsed 
145.091 
> ## running with 3 cores, grainsize=200
> time2["elapsed"]
elapsed 
 57.095 
> time1["elapsed"]/time2["elapsed"]
 elapsed 
2.541221 
> ## running optimized version, grainsize=1
> time3["elapsed"]
elapsed 
 47.867 
> time1["elapsed"]/time3["elapsed"]
 elapsed 
3.031128 
>

Thus, the in-efficient copying of random effects is still hurting us here, but not as much any more, since the calculation takes some time. To put this into perspective: We gain with an out-of-the-box brms program 2.5x on 3 cores. That is fantastic!

More random effects (with many levels) will increase the memory copying burden more, of course. All in all these results are very promising to me.

bench-poisson_R.txt poisson_threads-opt_stan.txt

wds15 commented 4 years ago

Thinking about the problem of copying all random effects for each partial sums, I figured that we should evaluate how increasing the number of chunks does slow down the program when run on a single core. For the Poisson model this gives expected results in that the brms auto-generated (with copying) code does significantly slow down when increasing the number of chunks:

poisson_brms_slowdown

The normal multi-level model behaves for 2 chunks a bit odd where it slows down a lot, but then everything is as before:

normal_brms_slowdown

From this perspective we should choose the grainsize around 1/10th of the number of rows (more chunks allows for more granular load balancing), but I assume that this is problem specific, but a graph like this will help choosing the grainsize a lot. This will minimize the additional overhead due to chunking when using random effects as part of the shared arguments.

To do the above plots, I used the reduce_sum_static to have well-defined partial sum sizes.

paul-buerkner commented 4 years ago

Thanks! These results are very helpful!

jgabry commented 4 years ago

Thanks! These results are very helpful!

Indeed, thanks Sebastian!

wds15 commented 4 years ago

I am glad it helps.

I was still bothered by the COM-Poisson model, which I wanted to understand better. First, providing in brms a non-normalized version of this density would make sampling this density incredibly faster (just a few seconds total sampling time)! There is not need for any parallelization in the example if you avoid calculating the normalizing constant Z.

Anyway, what turned out to make this model scale well is to rewrite the calculation of the normalizing constant. The original code calculates the normalization constant by summing terms until new terms become small enough. Moreover, the log factorial was calculated as a log sum (term by term). I now fixed the number of terms to 20 and I used the log-gamma function to calculate the factorial. Now the timings are much better (for 250 warmup / 250 iterations):

> time1["elapsed"]
elapsed 
 33.532 
> ## running with 3 cores, grainsize=N/8=80
> time2["elapsed"]
elapsed 
  9.197 
> time1["elapsed"]/time2["elapsed"]
 elapsed 
3.645972 
>

I think we see more than 3x speedup on 3 cores due to a better fit into the most inner CPU caches.

To me this shows that deeply nested AD tapes inside each chunk are not good for the performance. After all we do a nested AD sweep with each chunk evaluation (but I am still surprised to see such a slowdown; hopefully one can improve this in stan-math).

New log_Z function (if 20 terms are enough... I don't know!!!):

  // log normalizing constant of the COM Poisson distribution
  // implementation inspired by code of Ben Goodrich
  // Args:
  //   log_mu: log location parameter
  //   shape: positive shape parameter
  real log_Z_com_poisson(real log_mu, real nu) {
    int num_terms = 20;
    vector[num_terms] log_Z;
    //real lfac = 0;
    //real term;
    //int M;
    //real log_thres;
    if (nu == 1) {
      return exp(log_mu);
    }
    // nu == 0 or Inf will fail in this parameterization
    if (nu <= 0) {
      reject("nu must be positive");
    }
    if (nu == positive_infinity()) {
      reject("nu must be finite");
    }
    /*
    if (log_mu * nu >= log(1.5) && log_mu >= log(1.5)) {
      return log_Z_com_poisson_approx(log_mu, nu);
    }
    */
    // direct computation of the truncated series
    /*
    M = 10000;
    log_thres = log(1e-16);
    // check if the Mth term of the series is small enough
    if (nu * (M * log_mu - lgamma(M + 1)) > log_thres) {
      reject("nu is too close to zero.");
    }
    log_Z = log1p_exp(nu * log_mu);  // first 2 terms of the series
    lfac = 0;
    term = 0;
    k = 2;
    while (term > log_thres) { 
      lfac += log(k);
      term = nu * (k * log_mu - lfac);
      log_Z = log_sum_exp(log_Z, term);
      k += 1;
    }
    */
    log_Z[1] = 0.0;
    log_Z[2] = nu * log_mu;
    for(k in 2:(num_terms-1)) {
      //lfac += log(1.0*k);
      //log_Z[k+1] = nu * (k * log_mu - lfac);
      log_Z[k+1] = nu * (k * log_mu - lgamma(k+1));
    }
    return log_sum_exp(log_Z);
  }

paul-buerkner commented 4 years ago

Thank you for looking into it and for the interesting insights about the relevance of the AD tape.

Unfortunately, we cannot just provide an unnormalized density of the COM poisson distribution and hope something correct will come out as the normalizing constant depends on the parameters. The normalizing constant is the single main problem that bugs this otherwise really nice distribution. Also, just running the first 20 terms will not work if the mean/mode of the distribution is big enough (e.g., > 20) so that a lot of terms > 20 have some relevant contributions.

That said, I don't claim that the current COM poisson implementation is anywhere near optimal (I keep it semi-internal for a reason after all). To improve the implementation and ideally provide a mean, rather than a mode parameterization, is a whole project on its own that I would love to tackle at some point, but it may be too much to address is at the same time with reduce_sum (other than learning from the COM poisson distribution what is difficult for reduce_sum).

wds15 commented 4 years ago

Oh dear... must have been late yesterday. Of course, normalization constants with parameters need to stay.

Anyway, I played a bit with the implementation and now things look much better to me. For 50 warmup / 50 iterations I get these timings:

> ## original brms, 1 core, running with grainsize=10*N=6400
> time1["elapsed"]
elapsed 
103.215 
> ## tuned version, 1 core, running with grainsize=10*N=6400
> time1c["elapsed"]
elapsed 
 37.055 
> ## running with 3 cores, grainsize=1
> time2["elapsed"]
elapsed 
 17.903 
> time1c["elapsed"]/time2["elapsed"]
 elapsed 
2.069765

So on 3 cores we get a doubling in speed - that's fair.

This is the Stan function implementing the calculation of the normalizing constant:

  // log normalizing constant of the COM Poisson distribution
  // implementation inspired by code of Ben Goodrich
  // Args:
  //   log_mu: log location parameter
  //   shape: positive shape parameter
  real log_Z_com_poisson(real log_mu, real nu) {
    real log_Z;
    //real lfac;
    //real term;
    int k = 2;
    int M = 10000;
    int converged = 0;
    int num_terms = 50;
    //real log_thres = log(1e-16); approx -36
    if (nu == 1) {
      return exp(log_mu);
    }
    // nu == 0 or Inf will fail in this parameterization
    if (nu <= 0) {
      reject("nu must be positive");
    }
    if (nu == positive_infinity()) {
      reject("nu must be finite");
    }
    if (log_mu * nu >= log(1.5) && log_mu >= log(1.5)) {
      return log_Z_com_poisson_approx(log_mu, nu);
    }
    // direct computation of the truncated series
    // check if the Mth term of the series is small enough
    if (nu * (M * log_mu - lgamma(M + 1)) > -36.0) {
      reject("nu is too close to zero.");
    }
    log_Z = log1p_exp(nu * log_mu);  // first 2 terms of the series
    //lfac = 0;
    //term = 0;
    //k = 2;
    while (converged == 0) {
      vector[num_terms+1] log_Z_terms;
      int i = 1;
      log_Z_terms[1] = log_Z;
      while(i <= num_terms) {
        log_Z_terms[i+1] = nu * (k * log_mu - lgamma(k + 1));
        k += 1;
        if ( log_Z_terms[i+1] <= -36.0 ) {
          converged = 1;
          break;
        }
        i += 1;
      }
      log_Z = log_sum_exp(log_Z_terms[1:i]);
    }
    return log_Z;
  }

paul-buerkner commented 4 years ago

Nice! Thank you so much! Can you explain what you did change except for using the lgamma function instead of incrementing log values?

To clarify, I see that computing log_Z has been split up into nested while loops where the inner parts uses maximally 50 terms. What is the advantage this creates?

wds15 commented 4 years ago

Doing so takes advantage of the vectorized log sum exp. in the version before you were adding each term by term and now it’s in chunks of 50 terms and then we take advantage of the analytic gradients for log sum exp in vectorized form. Using 50 terms was a guess, it depends on how many terms get added up most of the time, but it seems to work just fine.

paul-buerkner commented 4 years ago

I have updated the COM poisson implementation in brms according to your suggestions. Thank you again :-)

wds15 commented 4 years ago

There is an off-by-one issue which drops the last term in the series (which is anyway zero) in the function I wrote earlier, see the corrected version below.

... but now, there is a far bigger problem which we should try to catch for users: The data-set used in your COM-Poisson example is sorted by outcome:

> phdpublish$art
  [1]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 [26]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 [51]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 [76]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
[101]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
[126]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
[151]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
[176]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
[201]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
[226]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  1  1  1
[251]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
[276]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
[301]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
[326]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
[351]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
[376]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
[401]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  2
[426]  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
[451]  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
[476]  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
[501]  2  2  2  2  2  2  2  2  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
[526]  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
[551]  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
[576]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
[601]  4  4  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  6  6  6  6  6  6
[626]  6  6  6  6  6  6  7  8  8  9 10 11 11 15 18
>

thus, the computational load per partial sum is vastly different - and even worse, the partial sum computational "heavyness" is not large for the first ones and gets increasingly larger for the very few last lines. At least, this is what I think is happening. When I now do

set.seed(56456)
phdpublish_unsorted  <- phdpublish[sample.int(N),]

then things improve more:

> ## original brms, 1 core, running with grainsize=10*N=6400
> time1["elapsed"]
elapsed 
103.215 
> ## tuned version, 1 core, running with grainsize=10*N=6400
> time1c["elapsed"]
elapsed 
  66.96 
> ## running with 3 cores, grainsize=50 (static)
> time2["elapsed"]
elapsed 
  26.64 
> time1c["elapsed"]/time2["elapsed"]
 elapsed 
2.513514 
>

It's no surprise that things do not work well if the computational load is not even across the data-set.

Here is the slightly improved COM-Poisson:

  // log normalizing constant of the COM Poisson distribution
  // implementation inspired by code of Ben Goodrich
  // Args:
  //   log_mu: log location parameter
  //   shape: positive shape parameter
  real log_Z_com_poisson(real log_mu, real nu) {
    real log_Z;
    //real lfac;
    //real term;
    int k = 2;
    int M = 10000;
    int num_terms = 50;
    //real log_thres = log(1e-16); approx -36
    if (nu == 1) {
      return exp(log_mu);
    }
    // nu == 0 or Inf will fail in this parameterization
    if (nu <= 0) {
      reject("nu must be positive");
    }
    if (nu == positive_infinity()) {
      reject("nu must be finite");
    }
    if (log_mu * nu >= log(1.5) && log_mu >= log(1.5)) {
      return log_Z_com_poisson_approx(log_mu, nu);
    }
    // direct computation of the truncated series
    // check if the Mth term of the series is small enough
    if (nu * (M * log_mu - lgamma(M + 1)) > -36.0) {
      reject("nu is too close to zero.");
    }
    log_Z = log1p_exp(nu * log_mu);  // first 2 terms of the series
    //lfac = 0;
    //term = 0;
    //k = 2;
    while (1) {
      vector[num_terms+1] log_Z_terms;
      int i = 1;
      log_Z_terms[1] = log_Z;
      while(i <= num_terms) {
        log_Z_terms[i+1] = nu * (k * log_mu - lgamma(k + 1));
        k += 1;
        if ( log_Z_terms[i+1] <= -36.0 ) {
          // series is converged
          return log_sum_exp(log_Z_terms[1:(i+1)]);
        }
        i += 1;
      }
      log_Z = log_sum_exp(log_Z_terms);
    }
    return negative_infinity();
  }

paul-buerkner commented 4 years ago

Thanks you for the call today. Here is a list of what needs to be done before being able to merge this feature (please make me aware of stuff I forgot):

[x] Add unit tests
[x] Choose a good default grainsize (Sebastian will provide suggestions)
[x] Allow to update cmdstanr models without recompilation (currently does not work as it should)
[ ] Add some examples illustrating the use of reduce_sum and how to choose the grainsize (Sebastian will provide examples)
[x] Make clear that reduce_sum_static is required for exact reproducibility
[x] Make sure it is clear that this feature is experimental and dampen the hopes of users
[x] Make users aware that the data should not be ordered according to complexity of the pointwise likelihood contributions

@wds15 Changing the grainsize without recompiling the model does not yet work because updating brms model with cmdstanr backend does not yet work (but will work hopefully before I merge this PR). So, for now, it would be ok to show the toy examples with, say, 4 or 5 different grainsizes and just recompiling the models. Would that be ok for you as well?

The grainsize can be controlled via the threading function as shown in the example below:

# this model just serves as an illustration
# threading may not actually speed things up here
fit <- brm(count ~ zAge + zBase * Trt + (1|patient),
           data = epilepsy, family = negbinomial(),
           chains = 1, threads = threading(2, grainsize = 100),
           backend = "cmdstanr")
summary(fit)

jgabry commented 4 years ago

Allow to update cmdstanr models without recompilation (currently does not work as it should)

Can you say more about this? What kind of update? Is this something we need to change in cmdstanr or is it related to how brms is interfacing with cmdstanr? I can help with this if necessary.

wds15 commented 4 years ago

@paul-buerkner I can work around the recompilation issue by either going brute-force (doing many compiles and then just take out the compilation time) or be clever about it by converting things directly to cmdstanr style (brms spits out the stan code and the stan data). The brute-force way is brms cleaner while the second approach will work with what you have. Any preference?

paul-buerkner commented 4 years ago

@jgabry I think it is just a brms issue. I will let you know once I fixed the parts on the brms side. Let's see if there is still some stuff remaining afterwards.

@wds15 I prefer you using the brute-force way for now. I will convert the code to the new version once I fixed the update problem.

paul-buerkner commented 4 years ago

Updating threads and grainsize is now possible without recompilation. Here is an example:

# this model just serves as an illustration
# threading may not actually speed things up here
fit1 <- brm(count ~ zAge + zBase * Trt + (1|patient),
           data = epilepsy, family = negbinomial(),
           chains = 1, threads = threading(2, grainsize = 100),
           backend = "cmdstanr")

fit2 <- update(fit1, threads = threading(2, grainsize = 50))

jgabry commented 4 years ago

Nice!

paul-buerkner commented 4 years ago

Unit tests are now in place as well. After adding some illustrative examples and deciding on the default grainsize, the reduce_sum branch is ready for review/merging.

wds15 commented 4 years ago

Cool beans! I will try to hurry up with my bits.

paul-buerkner commented 4 years ago

No worries i will not be able to work much next week anyway.

wds15 notifications@github.com schrieb am So., 6. Sept. 2020, 20:33:

Cool beans! I will try to hurry up with my bits.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/paul-buerkner/brms/issues/892#issuecomment-687864509, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCW2AD6HZOQH6OTRF76AMLSEPIXXANCNFSM4M2PFYVQ .

wds15 commented 4 years ago

Below is a small code snippet which can be used by users to evaluate their models. Maybe it is possible to include such code into a brms example section?

The notebook can be rendered as spin file. This creates multiple versions of the report. Parameters which can be varied are the sample size N, the number of groups G and the likelihood (normal or poisson).

The report creates in essence two plots: One demonstrating the slowdown due to increasing the chunking. The other demonstrating at multiple grainsizes (2xdefault, default, default/2) the change in execution time. The report runs the benchmarks at 25 and 50 iterations each. The point is that the number of iterations need to be large enough for a stable estimate such that doubling the number of iterations should result in similar curves for a given problem - otherwise users need to increase these further.

The function benchmark_threading is relatively generic and should be useful to many other users. One point one could improve with the function is to readout ESS/s for lp__ from the cmdstan output; right now I use the total runtime as a metric.

The default grainsize of number of rows / 2 * number of physical cores seems to work ok from what I see so far.

BTW... at first I was looking at the speedup vs cores plots mainly... but then I noticed that it's a lot better to look at the absolute runtime, since the runtime at 1 core can go up with smaller grainsizes such that higher relative speedups to that get easier, but the total wallclock time is larger.

Have a look at the codes and let me know what you think. I did follow the tidyverse approaches which I hope fits into brms.

evaluate-tuning-20200908.zip

paul-buerkner commented 4 years ago

Thank you so much @wds15! I am still on vacation but will take a more detailed look in a week or so. With regard to making this a brms example, I prefer adding a few dependencies to the package as possible just for this example. How easy is it to strip the code of its dependency of dplyr and tidyr? The other dependencies are either part of brms already, such as ggplot2, or can easily be replaced such as the dependency on mvtnorm.

Just on minor clarification: The default rule should be #rows / (2 * #cores) right? Above, you didn't specify brackets but from the context I assume that was a typo.

wds15 commented 4 years ago

Oh... I would have expected dplyr and tidyr to be already in your list of deps... but if not, then of course - let's take them out. This will make the code look quite a bit different, but fine for me. So I should use *apply stuff, I suppose.

I meant #rows / (2 * #cores).. sure. cores is the number of physical cores available. Maybe we should even put there

max(100, #rows / (2 * #cores))

to always have at leas a grainsize of 100.

paul-buerkner commented 4 years ago

Thanks for the clarifications and for offering to adjust the code. I agree tidyverse makes it more pretty but for an example in the brms doc it may have just too many new dependencies.

wds15 notifications@github.com schrieb am Sa., 12. Sept. 2020, 19:54:

Oh... I would have expected dplyr and tidyr to be already in your list of deps... but if not, then of course - let's take them out. This will make the code look quite a bit different, but fine for me. So I should use *apply stuff, I suppose.

I meant #rows / (2 * #cores).. sure. cores is the number of physical cores available. Maybe we should even put there

max(100, #rows / (2 * #cores))

to always have at leas a grainsize of 100.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/paul-buerkner/brms/issues/892#issuecomment-691524203, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCW2AEKZB6V6G2RLFRW5H3SFOYV7ANCNFSM4M2PFYVQ .

wds15 commented 4 years ago

Here is a version which does not require anything else but just ggplot2. It's a bit more verbose to code up. Maybe this is better suited for a demo as part of brms? Just a thought.

evaluate-tuning-base-20200914.zip

paul-buerkner commented 4 years ago

Thank you! It looks this should almost be a small vignette for brms, not an example or demo. Also, at the moment, it does not seem to be self-contained as it uses a "params" variable that seems to come out of your folder structure(?). What I am wondering is if we need this detailed vignette/demo already now or if it makes more sense to add it later once threading in brms is less of an experimental thing. If we decide to add it later, the reduce_sum feature would be ready to merge now as everything else in in place.

wds15 commented 4 years ago

"params" is set to some default values when you run this script without anything. So it should run out of the box.

I am inclined to say... "release right away" as experimental, but then I fear we are overwhelmed with Qs from users.

How about I turn this into a mini-vignette for a still experimental feature "reduce_sum"? Some doc in form of a vignette should be a lot better than none. For the sake of simplicity I would reduce to just one likelihood presumably.

So unless you want to release right away a new brms, I can make the next days an attempt to prepare such a mini vignette.

I am definitely fine with releasing without this, but most material is already there such that it's not too much effort now to have a mini vignette (unless I overlook something here).

paul-buerkner commented 4 years ago

I would prefer releasing with vignette. Thank you!

wds15 notifications@github.com schrieb am So., 20. Sept. 2020, 15:52:

"params" is set to some default values when you run this script without anything. So it should run out of the box.

I am inclined to say... "release right away" as experimental, but then I fear we are overwhelmed with Qs from users.

How about I turn this into a mini-vignette for a still experimental feature "reduce_sum"? Some doc in form of a vignette should be a lot better than none. For the sake of simplicity I would reduce to just one likelihood presumably.

So unless you want to release right away a new brms, I can make the next days an attempt to prepare such a mini vignette.

I am definitely fine with releasing without this, but most material is already there such that it's not too much effort now to have a mini vignette (unless I overlook something here).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/paul-buerkner/brms/issues/892#issuecomment-695783802, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCW2AD6QBBP5IB7EAAPG2LSGX3IDANCNFSM4M2PFYVQ .

wds15 commented 4 years ago

Great. Let's do that... it's probably easier if I fork brms and then make a PR against your repo (or you grant me directly access to this repo...whatever you prefer).

paul-buerkner / brms

Within-chain parallelization via "reduce_sum" #892