Closed paul-buerkner closed 4 years ago
Great to see this here.
It will likely take a moment until it lands in brms given that RStan is still at 2.19. Should we maybe prepare in the meantime a short vignette which shows with an example how the Stan code from the current brms needs to be modified in order to run with CmdStan? Is that an option? As alternative a wiki page maybe?
Would issue #891 help with this? :-)
Obviously yes.
Did the tutorial/wiki that @wds15 mentioned ever get created?
Inspired by @wds15 talk at StanCon, I started thinking about reduce_sum
in brms a little more and I would like your (read: anybody's) input.
1) Over what variable the automatic slicing should be done? In a standard brms model, we could slice over y
(the response variable) or over mu
(the mean predictor term). A standard likelihood of brms could look as follows:
vector[N] y;
vector[N] mu;
real sigma;
target += normal_lpdf(y | mu, sigma);
As far as I understand, slicing over mu
would be computationally prefferable as y
can be copied by reference even if passed to reduce_sum
via an additional argument, while mu
would be passed by value if it was an additional argument. Does anybody know how big of an influence this decision has?
2) I know it is preferrable to parallelize as much as possible, but given that the computation of the predictor term mu
may involve so many different variables that the housekeeping required to stuff them all into reduce_sum
would be too much to handle for brms I think. Given that I currently plan to only use reduce_sum
over the likelihood call, with mu
being computed outside of it in a serial way, would reduce_sum
still be worthwhile? Does there exist some benchmark data for this case?
Yep, slice over mu, not over y is much better.
It would still be worthwhile if the computational cost of the likelihood is very large compared to compiling mu serially (lieklihoods with gamma function calls like the Poisson should work ok and similar ones...but probably not a normal lpdf). This is probably a good starting point. Still, is it thinkable to write the entire model block out as a function and just pass everything into it?
I can't offer any benchmarks on this and I would suggest to just find out... I am looking forward to these results (speak: results someone else worked out). It's all about Amdahl's law as I explained...
Thanks! Let's aim high and parallelize the whole model block. Take, for example, the model block of a simple varying intercept-slope model, with priors removed (we don't want to include priors in reduce_sum, do we?):
model {
// initialize linear predictor term
vector[N] mu = Intercept + Xc * b;
for (n in 1:N) {
// add more terms to the linear predictor
mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n];
}
// likelihood including all constants
if (!prior_only) {
target += normal_lpdf(Y | mu, sigma);
}
}
Here is how it would look like
functions {
real partial_log_lik(real[] Y, int start, int end, real Intercept,
matrix fXc, vector b, int[] fJ_1, vector fZ_1_1,
vector fZ_1_2, vector r_1_1, vector r_1_2,
real sigma) {
int N = end - start + 1;
matrix[N, cols(fXc)] Xc = fXc[start:end];
int J_1[N] = fJ_1[start:end];
vector[N] Z_1_1 = fZ_1_1[start:end];
vector[N] Z_1_2 = fZ_1_2[start:end];
real ptarget = 0;
// initialize linear predictor term
vector[N] mu = Intercept + Xc * b;
for (n in 1:N) {
// add more terms to the linear predictor
mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n];
}
ptarget += normal_lpdf(Y | mu, sigma);
return ptarget;
}
}
model {
// likelihood including all constants
if (!prior_only) {
target += reduce_sum(partial_log_lik, Y, 1, ...)
}
}
where ...
are all the additional arguments passed to partial_log_lik
.
I do a lot of housekeeping at the start of the partial_log_lik where I subset the original vectors etc. so that they become the partial vectors etc that we need. The full vectors to be subsetted get the prefix f
so that the subsetted vectors have the original names. That way only the definition header of partial_log_lik
is different from the original model block's code, which should make the required changes to brms' Stan code generation much less cumbersome.
Above, I use Y
as the first argument not because it is necessarily the best to reduce over but because it is the only thing that is always available no matter how the model looks like.
@wds15 What do you think of this approach? Am I missing something important in the spec?
This starts to look very good! My thoughts on this:
matrix[N, cols(fXc)] Xc = fXc[start:end];
is very expensive for the AD tape. The Xc will be declared as a parameter and add significantly to the AD cost. To avoid this, you would have to avoid declaring Xc
, but rather write the expression with it directly... so replace vector[N] mu = Intercept + Xc * b;
with vector[N] mu = Intercept + fXc[start:end] * b;
I will write a brms branch where I support just a small subset of brms models in the above described way. That should take just a few hours but would enable us to rapidly test a lot of cases (instead of doing all the hand writing).
With regard to Xc
, can you explain why this specification is a problem for the AD tape while the same for, say, Z_1_1
is not? Also, your suggestions would kill the principle of leaving the fundamental model block unchanged and just adding a definition header, so I would like to avoid it if possible.
The Z_*
already come in as parameters and it looks to me as if doing this re-indexing (basically this is what it is) makes it easier for you to work within code-gen of brms.... but any real (real, vector, matrix) in a user functions is always instantiated as var (even if you assign data to it). At that point Xc
is written to the AD tape. If you instead do what I suggested, then Xc
is never written to the AD tape, since you use it in a matrix-vector multiplication. Doing so keeps the AD tape clean from the matrix and does a very much optimised matrix-vector product where only terms of the size of the vector end up on the AD tape.
However, if it is really hard to integrate that in your code gen, then just keep it as is for now (there is hope that the Stan compiler gets smarter and the problem goes away automatically).
I see. So, in brms, Z_*
is data as well so I suspect the same would apply to those variables as well, right? That is, ideally don't redefine them but use them directly. The problem with this would be that Z_1_1[start:end]
would be repeated in every evaluation of the loop, which is probably not what we want either...
I will go ahead and do the header-only-change for now, even if inefficient in some cases so that we have a version to try out.
Oh... if that's also data... then yes, avoid the temporary re-assignemnts if you can.
So better use instead of
Z_1_1[n]
do this:
fZ_1_1[start + n - 1]
Then you keep this thing as data.
This is going to be a code-generation nightmare :-D Let's see if I can find a principled way to enable this indexing on the fly.
I usually write partial sum functions with a loop over the slice, but with two indices. One spanning n=1...N
for N
being the slice size and another one for the global context n_overall=n + start - 1
. With these two defined it's often easy to write down what you want... but I understand that this may be harder for your brms code gen things.
Thanks. I will see what I can make possible in a first step.
Maybe start this by taking care of Xc... which is of the order of N * number of regressors.... but Z is just of the size of N.
Yeah, but in order to make that scale to all the complexity of brms models, we should consistently use one or the other approach all the way through and not mix things up. I will play around with it and report back once there is something that works (ish).
Ok, one follow up question, given that the header-change approach would be so much simpler: Do you know of any plans to make the Stan compiler smarter in definining variables as data (when appropriate) in user-defined functions?
I don't know... with some luck this is even already available with the recently release stanc3 optimisations which can be turned on. I have not been following this that close enough to know if it is there or is still in the making.
I would have hoped that a simple replace - text approach would work to start with at least.
Do you know of any plans to make the Stan compiler smarter in definining variables as data (when appropriate) in user-defined functions?
This is in 2.24 already. If you specify --O (not --o) it will optimize this case. Optimizations are currently experimental, but at least these optimizations work as they should (at least as far as we tested them off course).
So
functions {
real foo(data matrix X, vector paramv) {
matrix[2,2] Xs = X[1:2];
return sum( Xs * paramv);
}
}
without optimization:
template <typename T1__>
stan::promote_args_t<T1__>
foo(const Eigen::Matrix<double, -1, -1>& X,
const Eigen::Matrix<T1__, -1, 1>& paramv, std::ostream* pstream__) {
using local_scalar_t__ = stan::promote_args_t<T1__>;
const static bool propto__ = true;
(void) propto__;
local_scalar_t__ DUMMY_VAR__(std::numeric_limits<double>::quiet_NaN());
(void) DUMMY_VAR__; // suppress unused var warning
try {
Eigen::Matrix<local_scalar_t__, -1, -1> Xs;
Xs = Eigen::Matrix<local_scalar_t__, -1, -1>(2, 2);
stan::math::fill(Xs, DUMMY_VAR__);
current_statement__ = 1;
assign(Xs, nil_index_list(),
rvalue(X, cons_list(index_min_max(1, 2), nil_index_list()), "X"),
"assigning variable Xs");
current_statement__ = 2;
return sum(multiply(Xs, paramv));
} catch (const std::exception& e) {
stan::lang::rethrow_located(e, locations_array__[current_statement__]);
// Next line prevents compiler griping about no return
throw std::runtime_error("*** IF YOU SEE THIS, PLEASE REPORT A BUG ***");
}
}
with optimization
template <typename T1__>
stan::promote_args_t<T1__>
foo(const Eigen::Matrix<double, -1, -1>& X,
const Eigen::Matrix<T1__, -1, 1>& paramv, std::ostream* pstream__) {
using local_scalar_t__ = stan::promote_args_t<T1__>;
const static bool propto__ = true;
(void) propto__;
local_scalar_t__ DUMMY_VAR__(std::numeric_limits<double>::quiet_NaN());
(void) DUMMY_VAR__; // suppress unused var warning
try {
Eigen::Matrix<double, -1, -1> lcm_sym2__;
double lcm_sym1__;
{
Eigen::Matrix<double, -1, -1> Xs;
Xs = Eigen::Matrix<double, -1, -1>(2, 2);
stan::math::fill(Xs, std::numeric_limits<double>::quiet_NaN());
assign(lcm_sym2__, nil_index_list(),
rvalue(X, cons_list(index_min_max(1, 2), nil_index_list()), "X"),
"assigning variable lcm_sym2__");
current_statement__ = 2;
return sum(multiply(lcm_sym2__, paramv));
}
} catch (const std::exception& e) {
stan::lang::rethrow_located(e, locations_array__[current_statement__]);
// Next line prevents compiler griping about no return
throw std::runtime_error("*** IF YOU SEE THIS, PLEASE REPORT A BUG ***");
}
}
Note that Xs is a matrix of doubles which is a huge win in this case.
Thanks @rok-cesnovar ... but a requirement is that you declare things as
real foo(data matrix X, vector paramv) {
matrix[2,2] Xs = X[1:2];
return sum( Xs * paramv);
}
So the X
must be declared as data matrix
and one also can only call it with data at that point, of course.
Is that an option for you @paul-buerkner ?
(BTW, this is super cool to know... I have been waiting for this feature since ages)
Yes, you need the data qualifier.
Otherwise stanc3 cant infer the type. In theory it probably could check with what inputs it is used and create extra signatures. But not atm.
Nice! So I just need to put the data
qualifier before the variable type and things should be fine? That would be awesome!
Yes + call the cmdstan_model with an argument. I will post an example call when I am back at my laptop.
Great, thanks @rok-cesnovar! @wds15 After some work today, I got a simple version working already, which used the subsetting on the fly approach, you advocated before knowing of the new stanc optimization. I will post the version tomorrow or so when I am convinced of it. I am no longer sure which of the two discussed versions would be easier to implement and maintain. I will play around with it more.
If you get it to work with subsetting as suggested first, then this will be more flexible... since the data
declaration in the arguments makes any function less flexible which could limit things wrt to missing data stuff, for example.
Forgot about this, sorry..
library(cmdstanr)
file_path <- file.path(cmdstan_path(), "examples", "bernoulli", "bernoulli.stan")
mod <- cmdstan_model(file_path, stanc_options = list(O=TRUE))
You can run cmdstan_model with quiet = FALSE to double check.
You should see something like (note the --O):
--- Translating Stan model to C++ code ---
bin/stanc --O --name='bernoulli_model' --o=/tmp/Rtmp2EuAeC/model-1c3a2fc5c314.hpp /tmp/Rtmp2EuAeC/model-1c3a2fc5c314.stan
brms now supports threading via reduce_sum
for most models. However, this is super experimental and a lot of cleaning will be required. Here is an example for a very simple model:
data("sleepstudy", package = "lme4")
make_stancode(Reaction ~ Days + (Days | Reaction), sleepstudy,
family = gaussian(), threads = 2)
// generated with brms 2.13.10
functions {
/* integer sequence of values
* Args:
* start: starting integer
* end: ending integer
* Returns:
* an integer sequence from start to end
*/
int[] sequence(int start, int end) {
int seq[end - start + 1];
for (n in 1:num_elements(seq)) {
seq[n] = n + start - 1;
}
return seq;
}
// compute partial sums of the log-likelihood
real partial_log_lik(int[] seq, int start, int end, vector Y, matrix Xc, vector b, real Intercept, real sigma, int[] J_1, vector Z_1_1, vector Z_1_2, vector r_1_1, vector r_1_2) {
real ptarget = 0;
int N = end - start + 1;
// initialize linear predictor term
vector[N] mu = Intercept + Xc[start:end] * b;
for (n in 1:N) {
// add more terms to the linear predictor
int nn = n + start - 1;
mu[n] += r_1_1[J_1[nn]] * Z_1_1[nn] + r_1_2[J_1[nn]] * Z_1_2[nn];
}
ptarget += normal_lpdf(Y[start:end] | mu, sigma);
return ptarget;
}
}
data {
int<lower=1> N; // total number of observations
vector[N] Y; // response variable
int<lower=1> K; // number of population-level effects
matrix[N, K] X; // population-level design matrix
// data for group-level effects of ID 1
int<lower=1> N_1; // number of grouping levels
int<lower=1> M_1; // number of coefficients per level
int<lower=1> J_1[N]; // grouping indicator per observation
// group-level predictor values
vector[N] Z_1_1;
vector[N] Z_1_2;
int<lower=1> NC_1; // number of group-level correlations
int prior_only; // should the likelihood be ignored?
}
transformed data {
int Kc = K - 1;
matrix[N, Kc] Xc; // centered version of X without an intercept
vector[Kc] means_X; // column means of X before centering
for (i in 2:K) {
means_X[i - 1] = mean(X[, i]);
Xc[, i - 1] = X[, i] - means_X[i - 1];
}
}
parameters {
vector[Kc] b; // population-level effects
real Intercept; // temporary intercept for centered predictors
real<lower=0> sigma; // residual SD
vector<lower=0>[M_1] sd_1; // group-level standard deviations
matrix[M_1, N_1] z_1; // standardized group-level effects
cholesky_factor_corr[M_1] L_1; // cholesky factor of correlation matrix
}
transformed parameters {
matrix[N_1, M_1] r_1; // actual group-level effects
// using vectors speeds up indexing in loops
vector[N_1] r_1_1;
vector[N_1] r_1_2;
// compute actual group-level effects
r_1 = (diag_pre_multiply(sd_1, L_1) * z_1)';
r_1_1 = r_1[, 1];
r_1_2 = r_1[, 2];
}
model {
// likelihood including all constants
if (!prior_only) {
int seq[N] = sequence(1, N);
target += reduce_sum(partial_log_lik, seq, 1, Y, Xc, b, Intercept, sigma, J_1, Z_1_1, Z_1_2, r_1_1, r_1_2);
}
// priors including all constants
target += student_t_lpdf(Intercept | 3, 288.7, 59.3);
target += student_t_lpdf(sigma | 3, 0, 59.3)
- 1 * student_t_lccdf(0 | 3, 0, 59.3);
target += student_t_lpdf(sd_1 | 3, 0, 59.3)
- 2 * student_t_lccdf(0 | 3, 0, 59.3);
target += std_normal_lpdf(to_vector(z_1));
target += lkj_corr_cholesky_lpdf(L_1 | 1);
}
generated quantities {
// actual population-level intercept
real b_Intercept = Intercept - dot_product(means_X, b);
// compute group-level correlations
corr_matrix[M_1] Cor_1 = multiply_lower_tri_self_transpose(L_1);
vector<lower=-1,upper=1>[NC_1] cor_1;
// extract upper diagonal of correlation matrix
for (k in 1:M_1) {
for (j in 1:(k - 1)) {
cor_1[choose(k - 1, 2) + j] = Cor_1[j, k];
}
}
}
fit1 <- brm(Reaction ~ Days + (Days | Reaction), sleepstudy,
family = gaussian(), threads = 2,
chains = 1, backend = "cmdstanr")
On my windows machine threading slows down sampling drastically which may be because of windows, the simple model, or my implementation of reduce_sum
. @wds15 Would you mind running some benchmarks on the brms implementation so that we can get an understanding how well we are doing at the moment?
Hmm... I did run a poisson example and I also see unreasonable slow downs. I need to debug this for a moment.
Thank you for looking into it!
Interesting... so I looked at the Poisson example epilepsy
:
fit1 <- brm(count ~ zAge + zBase * Trt + (1|patient),
data = epilepsy, family = poisson(),
threads = 1,
chains = 1, backend = "cmdstanr")
and single thread runtime is 3.8s while the plain reduce_sum
runtime was a whopping 30s. I tried a few things, but what made it finally work is to simply introduce a grainsize
which I increased from the fixed value of 1 to 60 or so. Then the performance was ok (4.9s with 1 core / ~3s with 4 cores), but the example is still too small for real benefits (but at least I was able to beat single thread runtime).
So we need the grainsize... and it needs to be picked by the user.
... looking at this more closely, I think that what would really help is to slice not over the data items 1-N, but instead slice over the patients. To do that one would probably have to resort the data or create some clever index maps - which I am not sure if this is something which can be done in brms. If it's a possibility, then we should try that. The intuition behind that is that doing so minimizes unnecessary copies of the random effects (in fact, you would slice over the random effect) and also the work done per unit increases which makes very likely setting grainsize
less important. Should this be an option, then I would just make it a convention that the random effect with the largest number of levels is picked to slice over (that's my guess which we should test a bit).
BTW... not for now... but it would be nice to let users choose between reduce_sum
and reduce_sum_static
. The static version is deterministic and setting grainsize
well is a must, while reduce_sum
should work ok with mediocre set grainsizes
and adapt better to more/less CPU cores.
Overall this looks super promising!
Thanks you for looking into that! So grainsize
it is then. How do we choose it appropriately? Or rather, how do we advice users to choose it?
Slicing over random effects is not an option for brms (in my current opinion), because in general, we not only have one but multiple (nested or crossed) effects and we can obviously slice only over one. Adding even more structure, such a splines or other special terms, and we won't be able to slice appropriately over any partical kind of parameter vector. And even if we would, we would only slice over one but not over the several others . There will be stuff copied over no matter what we do. I understand that this is something where we gain speed for some simple models but I don't see this being an option for brms unfortunately.
I am happy to also support reduce_sum_static
. Is there any difference other than the grainsize? We also need to think of a user-interface that allows to intuitively specify everything relevant in one place, that is, the number of threads
, the grainsize
, and the static
option. I am thinking of an argument call threading
(or so) and a corresponding function which takes and validates all the arguments to fully handle threading.
The random effects of the model usually groups the data into larger chunks. When we slice over these, we gain a few things here:
grainsize
should be far less important as more work is done per unitMaybe I will try this out on the epilepsy example to have a test drive of it.
In any case: We can add this additional bit later once convinced of the merits of it. What is there already is very useful as it looks to me.
reduce_sum_static
will always form the exact same expression graph such that always the same partial sums are formed and summed together which makes everything exactly reproducible. The chunk size will always be close to, but never larger than grainsize
.
reduce_sum
uses grainsize
just as a rough order for the chunk size and it tries to adapt things to the given CPUs and to the current load to make best use of the available resources (but the result is non deterministic).
Choosing grainsize... the Intel TBB answer: https://software.intel.com/en-us/node/506060 ... practically it's hard for me to give a definite answer... maybe start with # of data rows / ( 2 * maximal # of cores being used )... but don't nail me on that; it's very dependent on your problem & CPU you use.
Thanks for the additional information! This is really helpful.
With regard to the slicing by random effects, I don't doubt the merits of this (but would still love to see some benchmarks). However, I don't see at this point how I can make this possible without a huge amount of special case coding that will haunt me for years when having to maintain that. If someone comes up with an idea how this can be represented without substantially breaking the existing structure of the code generation, I will consider it of course. But just to manage expectations, I think the slicing by random effects is relatively unlikely to make it into brms but perhaps there is an elegant way of doing it.
This is really cool. Thank you both for working on it! I just wanted to chime in with one suggestion about being cautious when advertising this feature:
Basically, unless we have really solid advice on getting the settings right (e.g. grainsize) we should discourage people from using it unless they really have a problem that is computationally too demanding without it. Within chain parallelization is such an alluring feature that people are going to want to use it even when they will spend more time trying to get it right than if they just ran the model once without it. That is, there's a risk that in their effort to save time with within chain parallelization users will actually end up wasting time having to try different settings, some of which make it slower than not using it at all.
So I definitely think it's worth offering this, but I suggest advertising with caution until it basically works out of the box with not much user tweaking required. Just imagine the number of forum posts and github issues that will come in if people get really excited about this but then it requires a lot of handholding to get right.
I definitely agree with being cautious! Even without this feature, we are kind of getting overrun with brms questions already...
I have updated the branch once more to give the user control over the grainsize. Here is an example:
set.seed(1234)
dat <- data.frame(
y = rnbinom(1000, size = 10, mu = 5),
x1 = rnorm(1000),
x2 = rnorm(1000),
g = factor(rep(1:100, each = 10))
)
fit1 <- brm(y ~ s(x1) + s(x2) + (1 | g), dat, family = negbinomial(),
chains = 1, backend = "cmdstanr", threads = NULL,
control = list(adapt_delta = 0.95))
fit2 <- brm(y ~ s(x1) + s(x2) + (1 | g), dat, family = negbinomial(),
chains = 1, backend = "cmdstanr",
threads = threading(4, grainsize = 250),
control = list(adapt_delta = 0.95))
In this case, the speedups are quite ok (130 vs. 67 seconds in my latest run). However, performance gains seem to be super variable, especially across likelihood family (and surely across model complexity but I haven't tested this extensively yet). For example, when changing negbinomial
to poisson
in the above example, I barely see any speedups.
I really think we need more and systematic numerical experiments at least of the size of a student project (or did you do something like this already?). Given the relevance of within-chain parallelization for Stan and PPLs in general, this may even become a paper. No matter the details, we definitely need to learn much more about the behavior of reduce_sum
I think.
I agree in that we need to setup good defaults in brms along with clear documentation. Given how brms is applying reduce_sum it seems that grainsize=1 (our current recommended starting value) is not an ideal choice for many problems. So my thinking is
I agree in that a full publication would be great and it‘s really worth publishing this, but we need a sutdent for that.
How about we (maybe even I) setup a small simulation study... though I am rather out of time and would welcome anyones hand here.
In any case - this is amazing to see this landing in brms!
Aki suggested we might want to ask on twitter or discourse if anyone is interested and has the time to work with us on the experiments.
This is stuff I think I can help with. I’be done lots of hierarchical multivariate simulation work. I also happened to just get access to gce’s 224 core instances, so can explore to pretty large scales. Lemme read back through this thread (only read the last two entries) and get up to speed on what parameters and scenarios you’re looking to explore.
Nice! That is great1 I think we need to have a call at some point to actually define a proper scope. Right now the discussion is mostly driven from having identified we need to understand reduce_sum
better but we didn't define what exactly we are planning to do. I will put you on my list of people to invite to the call.
@mike-lawrence great!
This is a really interesting problem. It is pretty difficult to do such tuning for a general problem, but if we limit the space to a still vast but less general space of brms models, I think its possible to get useful results for users and a publication.
The input (data) is:
The tunable parameters are:
Is it for brms an option to have a routine which does some quick profiling to test on the fly for a good grainsize given some problem?
Other that that we should keep it simple to get going. It’s about getting something reasonable to work for almost any model...not the optimum.
Ideally an advanced user can always provide a little extra code to get far more optimized parallelization.
Can you describe a little more what you mean by quick profiling on the fly?
Basically to try out a few grainsizes with small numbers of iterations and pick the fastest one. This is probably a bit out of scope for brms and would rather be a part of the reduce_sum backend (if a user pick grainsize = -1 for example).
Something like:
Given the iterative nature of MCMC, we could explore that.
Its not quite the same, but in the GPU implementation tuning research field, a quasi-random search guided with good prior knowledge outperforms any "smart" auto-tuning algorithms (based on simulate annealing, etc.) because of the cost of those algorithms.
I agree that the goal is always going to be good performance not optimal.
Thanks! That makes sense. I am not sure how I could put this into brms though. As you point out, it should probably better be part of reduce_sum itself if feasible.
Would it work to simply run with method="optimizing" and a single iteration?
I think that a few optimization iterations would give a good estimate yes. There is some variance in execution time so we dont want to put too much in a single measurement.
There should not be too much variance in the measurements for problems where this actually makes sense to use. These problems are large and there should not be too much variation.
See the blog post of Sebastian Weber (@wds15): https://statmodeling.stat.columbia.edu/2020/05/05/easy-within-chain-parallelisation-in-stan/