Closed bbbales2 closed 3 years ago
Thanks Ben. I will be opening all relevant PRs in a few days and then make another call for feedback. The Math PR should work standalone and can be done independently, while cmdstan/stanc3 PRs will be interconnected.
Math PR Open: https://github.com/stan-dev/math/pull/2265 Stanc3 PR Open: https://github.com/stan-dev/stanc3/pull/775
Both PRs are preliminary and await feedback before polishing and doc-ing.
Cmdstan branch ready: https://github.com/stan-dev/cmdstan/tree/profiling
Example of use:
data {
int<lower=0> N;
int<lower=0> K;
int<lower=0,upper=1> y[N];
matrix[N, K] X;
}
parameters {
real alpha;
vector[K] beta;
}
model {
{
profile("normal_lpdf alpha");
target += normal_lpdf(alpha | 0, 1);
}
{
profile("normal_lpdf beta");
target += normal_lpdf(beta | 0, 1);
}
{
profile("bernoulli GLM");
target += bernoulli_logit_glm_lpmf(y | X, alpha, beta);
}
}
after running returns:
Iteration: 1900 / 2000 [ 95%] (Sampling)
Iteration: 2000 / 2000 [100%] (Sampling)
Elapsed Time: 0.036 seconds (Warm-up)
0.042 seconds (Sampling)
0.078 seconds (Total)
profile name: fwd pass, rev pass
bernoulli GLM: 0.0409984, 0.000453154
normal_lpdf alpha: 0.00238057, 0.000497744
normal_lpdf beta: 0.00349398, 0.00053457
@rok-cesnovar is there a Stan branch? I gotta clone these things using git clone --depth=1 --single-branch
cuz my internet iz slow.
Is it just develop?
Its the profiling branch on all repos.
I'm getting the wrong compiler:
bbales2@tadpole:~/cmdstan-profile$ head -n20 bin/stanc
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<link rel="dns-prefetch" href="https://github.githubassets.com">
<link rel="dns-prefetch" href="https://avatars0.githubusercontent.com">
<link rel="dns-prefetch" href="https://avatars1.githubusercontent.com">
<link rel="dns-prefetch" href="https://avatars2.githubusercontent.com">
<link rel="dns-prefetch" href="https://avatars3.githubusercontent.com">
<link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">
<link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">
Command on build is:
curl -L https://github.com/rok-cesnovar/stanc3/releases/tag/profiling/linux-stanc -o bin/stanc --retry 5 --retry-delay 10
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 83478 0 83478 0 0 40444 0 --:--:-- 0:00:02 --:--:-- 40444
chmod +x bin/stanc
Ooo, but I found your stanc3 binaries over here: https://github.com/stan-dev/stanc3/actions/runs/431332425 and they work.
This is so clutch. I was playing around with a model and I put a timing thing in each block:
gq: 0.916925, 0
likelihood: 0.0116217, 0.00156159
prior: 0.0292104, 0.000513555
tdata: 0.000783834, 3.5e-08
tparm: 33.3017, 0.000547283
I don't think the reverse pass timing in tdata should have anything. Something weird must be happening there.
Also what are the units on all these printouts? Total time spent? Or seconds per gradient evaluation? Does this include warmup?
Currently, tdata is not supposed to work. I need to add a Math support for profile that doesn't add a vari.
Also what are the units on all these printouts? Total time spent? Or seconds per gradient evaluation? Does this include warmup?
Its total time spent in seconds, includes everything.
Thanks for the first batch of feedback!
I'm getting the wrong compiler:
Fixed
I don't think the reverse pass timing in tdata should have anything.
Fixed, in tdata and generated quantities we dont create a vari now.
Also added profile_file
./logistic sample data file=logistic.data.json profile_file=profile.csv
I tarballed the latest version, now available here: https://github.com/rok-cesnovar/cmdstan/releases/download/profiling/cmdstan-profiling.tgz
Its total time spent in seconds, includes everything.
I kinda think these should be per gradient. They should probably be split by warmup/sampling as well.
Also I want the distributions. Given we can have up to 1000x more gradients than lp evaluations, that really explodes the output size esp. on something we're trying to benchmark (which seems bad), but we could thin the output or something. This sounds like getting into Stan services lol.
Also I want the distributions.
I'll also be fine with summaries if that's what I get in the end, but while for some things we expect the per-gradient timing to be pretty constant (ODEs) for the higher order functions there could be interesting variation.
If I time my transformed parameters block like:
transformed parameters {
vector[N_1] r_1_1; // actual group-level effects
vector[N_2] r_2_1; // actual group-level effects
vector[N_3] r_3_1; // actual group-level effects
vector[N_4] r_4_1; // actual group-level effects
vector[N_5] r_5_1; // actual group-level effects
vector[N_6] r_6_1; // actual group-level effects
vector[N_7] r_7_1; // actual group-level effects
{
profile("transformed parameters");
r_1_1 = (sd_1[1] * (z_1[1]));
r_2_1 = (sd_2[1] * (z_2[1]));
r_3_1 = (sd_3[1] * (z_3[1]));
r_4_1 = (sd_4[1] * (z_4[1]));
r_5_1 = (sd_5[1] * (z_5[1]));
r_6_1 = (sd_6[1] * (z_6[1]));
r_7_1 = (sd_7[1] * (z_7[1]));
}
}
And then like:
transformed parameters {
vector[N_1] r_1_1; // actual group-level effects
vector[N_2] r_2_1; // actual group-level effects
vector[N_3] r_3_1; // actual group-level effects
vector[N_4] r_4_1; // actual group-level effects
vector[N_5] r_5_1; // actual group-level effects
vector[N_6] r_6_1; // actual group-level effects
vector[N_7] r_7_1; // actual group-level effects
profile("transformed parameters");
r_1_1 = (sd_1[1] * (z_1[1]));
r_2_1 = (sd_2[1] * (z_2[1]));
r_3_1 = (sd_3[1] * (z_3[1]));
r_4_1 = (sd_4[1] * (z_4[1]));
r_5_1 = (sd_5[1] * (z_5[1]));
r_6_1 = (sd_6[1] * (z_6[1]));
r_7_1 = (sd_7[1] * (z_7[1]));
}
I get different results. The second gives me timings for "transformed parameters" that are nearly as long as it took to do the whole inference.
I kinda think these should be per gradient.
Sure, that is doable. There is more to discuss there, specifically what to do with forward/reverse pass. We can have an unequal number of forward and reverse passes. Which then brings us to a few questions:
They should probably be split by warmup/sampling as well.
Hm, that would be a bigger change and would also mean we have to do this through services. That probably means pushing it to next release. Which I am fine with if that is what we decide.
Just curious, what is the added value of knowing the warmup/sampling split?
Also I want the distributions.
That is also doable, but would have a larger memory footprint or we would have to store the profile info in files, which again brings us to services and profile then has a bit of a larger effect. I do feel these might be useful yeah.
Maybe we can do basic profiling the way as its proposed here and do a verbose one via services.
Just curious, what is the added value of knowing the warmup/sampling split?
ODEs take different amounts of time in different parts of parameter space and warmup can spend a lot of time in weird places.
That probably means pushing it to next release. Maybe we can do basic profiling the way as its proposed here and do a verbose one via services.
If it's too hard to give raw timing stats or split by warmup/release, I'm definitely in for doing some sort of summary output now and expanding it in the future. Just gotta make sure and not do something that's super backwards incompatible or whatever.
Just gotta make sure and not do something that's super backwards incompatible or whatever.
Strongly agree.
I get different results.
This one is due to transform parameters actually being generated together with everything in the model block. Will look at how we can get around this problem. I suspect the end-of-TP block is going to be a bit of a challenge.
This is great I just timed the different parts of a bits of a hierarchical model (something generated from brms here):
name, total, fwd, rev
likelihood,46.8229,42.3114,4.51143
linear_model,35.5232,27.1445,8.37874
prior,0.901631,0.781348,0.120283
random effects,108.195,72.9507,35.2441
transformed parameters,0.666818,0.479504,0.187314
This is how the linear_model
and random_effects
things were defined:
{
profile("linear_model");
mu = Intercept + Xc * b;
}
{
profile("random effects");
for (n in 1:N) {
// add more terms to the linear predictor
mu[n] = r_1_1[J_1[n]] * Z_1_1[n] + r_2_1[J_2[n]] * Z_2_1[n] + r_3_1[J_3[n]] * Z_3_1[n] + r_4_1[J_4[n]] * Z_4_1[n] + r_5_1[J_5[n]] * Z_5_1[n] + r_6_1[J_6[n]] * Z_6_1[n] + r_7_1[J_7[n]] * Z_7_1[n];
}
}
and then switched it to use the glm functions which combine the linear model and likelihood bit:
name, total, fwd, rev
likelihood,31.4064,28.2638,3.14264
prior,0.698845,0.609689,0.0891561
random effects,61.1205,40.9512,20.1693
transformed parameters,0.457823,0.321136,0.136686
Edit: edited to make the labels in the code and output consistent
Can the timing be recursive?
Like a put a profile in my model block which contains other profiles.
Now I get:
name, total, fwd, rev
likelihood,31.4064,28.2638,3.14264
model,93.6092,70.1797,23.4295
prior,0.698845,0.609689,0.0891561
random effects,61.1205,40.9512,20.1693
transformed parameters,0.457823,0.321136,0.136686
But could we get:
name, total, fwd, rev
model,93.6092,70.1797,23.4295
- likelihood,31.4064,28.2638,3.14264
- random effects,61.1205,40.9512,20.1693
prior,0.698845,0.609689,0.0891561
transformed parameters,0.457823,0.321136,0.136686
And then if there's a profile inside a function, it might show two different entries depending on where it's used. Like how profilers let you group the output times.
Can the timing be recursive?
Yes, the only limitation for where you can place profiles is that you cant have
{
profile("name");
fun1();
profile("name");
fun2();
}
so two "active" profiles with the same name. This is allowed:
{
profile("name");
fun1();
}
{
fun2();
profile("name");
}
And name would be the sum of the two. You can also use it in a for loop.
But could we get:
Maybe, yeah.
This is so cool! Is the scope an RAII-like from the profile()
command to the end of the current block? Then you can always introduce braces to test sequences of commands. I really like how clean this design is, though I can imagine the implementation for reverse-mode's tricky as you'll have to manage the stack rather than the compiler.
The Stan transpiler should be able to check for active profiles and disallow duplicates.
If the design goes onto design-docs, I'm happy to review. I'm also happy to write the doc if you'd like help with that.
Is the scope an RAII-like from the profile() command to the end of the current block?
Yes. This works great and is very clean because blocks in Stan = blocks in C++. The only problem we have is that transformed parameters are not actually encapsulated in a block in C++ which means that profile("TP")
in the below example spills over in the model block as it does not get out of scope (it only goes out of scope at the end of the model block.
transformed parameters {
vector[N_1] r_1_1; // actual group-level effects
profile("TP");
r_1_1 = (sd_1[1] * (z_1[1]));
}
though I can imagine the implementation for reverse-mode's tricky as you'll have to manage the stack rather than the compiler
I opened a Math PR https://github.com/stan-dev/math/pull/2265 so you can see the Math side of thing. Its not review-ready yet, because I did not go into doxygen and tests before we finalize what we actually want.
If the design goes onto design-docs
I am doing the first round of feedback to hash out any trivial stuff I missed and to get additional ideas. I will open a design doc in a few days. Will make sure to tag you.
I'm also happy to write the doc if you'd like help with that.
Thank you! I will probably require help with docs as I am involved with a bunch of projects for 2.26 and might run out of time. And we want this doc-ed well.
The Stan transpiler should be able to check for active profiles and disallow duplicates.
For now, this triggers a runtime error (trying to re-active an already active profile).
Hmm, I wonder if we should group the output by what block they're run in?
Like transformed data gets run once with doubles, generated quantities gets run for every draw with doubles, transformed parameters is messy with the thing @nhuurre found, but the model block is every gradient.
And so then per-gradient numbers are weird, cause not everything is a gradient. I guess per-call?
I ran an ODE model with the intent to check if the other solvers were running faster:
functions {
vector dz_dt(real t, // time
vector z, // system state {prey, predator}
real alpha,
real beta,
real gamma,
real delta) {
real u = z[1];
real v = z[2];
vector[2] dzdt;
dzdt[1] = (alpha - beta * v) * u;
dzdt[2] = (-gamma + delta * u) * v;
return dzdt;
}
}
data {
int<lower = 0> N; // number of measurement times
real ts[N]; // measurement times > 0
real y_init[2]; // initial measured populations
real<lower = 0> y[N, 2]; // measured populations
}
parameters {
real<lower = 0> alpha;
real<lower = 0> beta;
real<lower = 0> gamma;
real<lower = 0> delta;
vector<lower = 0>[2] z_init; // initial population
real<lower = 0> sigma[2]; // measurement errors
}
transformed parameters {
vector[2] z[N] = ode_rk45_tol(dz_dt, z_init, 0, ts,
1e-5, 1e-3, 500,
alpha, beta, gamma, delta);
}
model {
{
profile("priors");
alpha ~ normal(1, 0.5);
gamma ~ normal(1, 0.5);
beta ~ normal(0.05, 0.05);
delta ~ normal(0.05, 0.05);
sigma ~ lognormal(-1, 1);
z_init ~ lognormal(log(10), 1);
}
{
profile("likelihood");
for (k in 1:2) {
y_init[k] ~ lognormal(log(z_init[k]), sigma[k]);
y[ , k] ~ lognormal(log(z[, k]), sigma[k]);
}
}
}
generated quantities {
vector[2] zb[N];
{
profile("adams");
zb = ode_adams_tol(dz_dt, z_init, 0, ts,
1e-5, 1e-3, 500,
alpha, beta, gamma, delta);
}
{
profile("bdf");
zb = ode_bdf_tol(dz_dt, z_init, 0, ts,
1e-5, 1e-3, 500,
alpha, beta, gamma, delta);
}
{
profile("rk45");
zb = ode_rk45_tol(dz_dt, z_init, 0, ts,
1e-5, 1e-3, 500,
alpha, beta, gamma, delta);
}
}
Anyway the idea is to see which solver is faster on the problem.
The output is:
name, total, fwd, rev
adams,0.117718,0.117718,0
bdf,0.115601,0.115601,0
likelihood,0.72739,0.675845,0.051545
priors,0.220735,0.207088,0.0136471
rk45,0.0271892,0.0271892,0
which definitely gives me the info I want, but all those numbers kinda come from different places and so maybe they should be arranged a bit more.
Like transformed data gets run once with doubles, generated quantities gets run for every draw with doubles, transformed parameters is messy..
Lets discuss this in the design doc. To me, the cumulative time is the easiest to process. For example, the per-run time could be the biggest for the transformed data but we know in the grand scheme of things its most likely negligible.
nested profiles and labeling the call-site (tdata, tparam, model)
Hm, these both sound like good-to-have yes. But, both of these can be achieved by the user labeling the profiles. So you could label GQ profiles as profile("GQ - rk45")
. By labeling them automatically we remove some of the freedom the user would have. For example one could want to profile both the tparam and GQ rk45 calls together. And we would prevent that with auto-grouping.
Same for
model,93.6092,70.1797,23.4295
- likelihood,31.4064,28.2638,3.14264
- random effects,61.1205,40.9512,20.1693
Just label the likelihood as either " - likelihood" or "model - likelihood" and if the profiles would be shown in the order they appear, both would be shown as you want them to.
We definitely need to order the profiles based on how they appear in the model. They are currently shown alphabetically, but that was not the intent. That is the order they show if I just iterate over the map.
Can we have the same tag in multiple blocks? Often the pieces I do are a combination of things going on in generated quantities, the model block and transformed parameters, so I'm not sure it makes sense to summarize by block.
Yes, the current implementation would accumulate profiles with the same name.
The only thing that is currently bothering me is how could we profile the parameters transforms (lower, upper, ...).
I dont have a clean solution for that at the moment so I am leaning towards not supporting that at this time.
Well, you could just allow a profile statement in the parameters block--but I don't think transforms are ever going to be a performance bottleneck.
This is the output now:
name,total_time,fwd_time,rev_time,chain_stack_total,nochain_stack_total,fwd_passes,AD_fwd_passes,AD_rev_passes
centering,5.335e-06,5.335e-06,0,0,0,1,0,0
fixed effects,1.72901,1.35503,0.373981,236822,118411000,1,118411,118411
likelihood,2.0658,1.9612,0.104593,118411,0,1,118411,118411
prior,0.205329,0.180555,0.0247746,1894576,0,1,118411,118411
random effects,5.15733,3.48409,1.67324,414438500,0,1,118411,118411
I tried to format this like the current Stan summary output:
Profile information for Stan model: mrp_prof_model
1 chain
Time Forward_Mode_Time Reverse_Mode_Time ChainStack NoChainStack NoADPasses AdPasses
centering 5.3e-06 5.3e-06 0 0 0 1 0
fixed effects 1.7 1.4 0.37 2.4e5 1.2e8 1 1.2e5
likelihood 2.1 2.0 0.10 1.2e5 0 1 1.2e5
prior 0.21 0.18 0.025 1.9e6 0 1 1.2e5
random effects 5.2 3.5 1.7 4.1e8 0 1 1.2e5
Notes:
I got rid of the FwdAdPass and RevADPass columns. Seems like they were duplicate. For every rev pass we do a forward pass
It looks like the AdPasses column, while being interesting, is very boring
I am a bit sketch about printing the timing information to stdout along with everything. That is going to look bad with multiple chains regardless of how we swing it. In that regard, I think profile_file
+ some sort of stansummary program is in order and no stdout printing
The header from the regular Stansummary looks better:
File Edit Options Buffers Tools Help
Inference for Stan model: mrp_prof_model
2 chains: each with iter=(1000,1000); warmup=(0,0); thin=(1,1); 2000 iterations saved.
Warmup took (4.7, 4.4) seconds, 9.0 seconds total
Sampling took (4.9, 5.4) seconds, 10 seconds total
It would be annoying to save a lot of that meta information in comments in our beautiful new CSV format though.
There are timing numbers here, and there are memory numbers here, and there are ad pass numbers here. Maybe we should separate out this information:
Profile information for Stan model: mrp_prof_model, 1 chain(s)
Timing information:
Time Forward_Mode_Time Reverse_Mode_Time
centering 5.3e-06 5.3e-06 0
fixed effects 1.7 1.4 0.37
likelihood 2.1 2.0 0.10
prior 0.21 0.18 0.025
random effects 5.2 3.5 1.7
Memory information:
ChainStack NoChainStack
centering 0 0
fixed effects 2.4e5 1.2e8
likelihood 1.2e5 0
prior 1.9e6 0
random effects 4.1e8 0
Autodiff information:
NoADPasses AdPasses
centering 1 0
fixed effects 1 1.2e5
likelihood 1 1.2e5
prior 1 1.2e5
random effects 1 1.2e5
Yeah on thinking more, I think just reporting the total is the easiest thing to think about. That way it's the total time/memory/autodiff calls within chains and also the totals across chains.
The autodiff information section -- there doesn't appear to be much information in here cause the profile statements are not in loops or anything. If you time something in a loop you'll get some variation.
I wonder where the likelihood is evaluated without AD?
Also I think write a separate stansummary for this stuff. I can help tomorrow. I think it should just have the sig_figs
, csv_filename
, and help
options. I don't think we want to print this stuff by default.
This is the output now:
name,total_time,fwd_time,rev_time,chain_stack_total,nochain_stack_total,fwd_passes,AD_fwd_passes,AD_rev_passes
centering,5.335e-06,5.335e-06,0,0,0,1,0,0
fixed effects,1.72901,1.35503,0.373981,236822,118411000,1,118411,118411
likelihood,2.0658,1.9612,0.104593,118411,0,1,118411,118411
prior,0.205329,0.180555,0.0247746,1894576,0,1,118411,118411
random effects,5.15733,3.48409,1.67324,414438500,0,1,118411,118411
I tried to format this like the current Stan summary output:
Profile information for Stan model: mrp_prof_model
1 chain
Time Forward_Mode_Time Reverse_Mode_Time ChainStack NoChainStack NoADPasses AdPasses
centering 5.3e-06 5.3e-06 0 0 0 1 0
fixed effects 1.7 1.4 0.37 2.4e5 1.2e8 1 1.2e5
likelihood 2.1 2.0 0.10 1.2e5 0 1 1.2e5
prior 0.21 0.18 0.025 1.9e6 0 1 1.2e5
random effects 5.2 3.5 1.7 4.1e8 0 1 1.2e5
Notes:
For every rev pass we do a forward pass
Yes, except if an exception occurs, then there is no reverse pass.
It looks like the AdPasses column, while being interesting, is very boring
Yes, but if we want to check how much time we spend per gradient eval or how much of the chain stack is used per gradient we need this info.
I am a bit sketch about printing the timing information to stdout along with everything. That is going to look bad with multiple chains regardless of how we swing it. In that regard, I think profile_file + some sort of stansummary program is in order and no stdout printing
Agree. What should the default profile_file
be? profile_output.csv
?
It would be annoying to save a lot of that meta information in comments in our beautiful new CSV format though.
We dont want to add that to the comments in my opinion.
There are timing numbers here, and there are memory numbers here, and there are ad pass numbers here. Maybe we should separate out this information:
I think we should yes. I like the timing/memory/AD splits.
Do we offer a way to summarize the outputs over chains? My instinct is to just offer a sum. I don't know how to express variance in 4 numbers.
I like the bin/profilesummary idea. I would maybe prefer if this would be handled by stansummary but we would then have to add comments to the CSV which I would prefer not to. So a separate one is fine I guess.
If you indent code blocks you gotta indent every line I think
What should the default profile_file be?
My gut reaction is none, but it would probably be confusing to add profile information and for that to appear nowhere. (Edit: like, to get profiling output you have to add profiling commands and tell the interface something -- pretty annoying)
Let's go with profile.csv
and see how annoying that is in the release candidate.
Yes, except if an exception occurs, then there is no reverse pass.
I still think ditch this column. We can keep it for the RC and check how people feel about it.
We dont want to add that to the comments in my opinion.
Sure, no meta information is fine with me.
I like the timing/memory/AD splits.
Do
I would maybe prefer if this would be handled by stansummary
It is annoying to add a program, but it would be at least as annoying to try to detect csv-types or add another flag to stansummary.
I still think ditch this column.
Ok. Its always easier to add things then to remove them. Maybe I am just overthinking how much ChainStack per gradient-eval is useful. If we have doubts we want it, its best to not have add it.
per gradient-eval
We can count reverse passes instead so this is the exact number and it easy to explain (no need to say "the number of reverse pass chain calls except for exceptions").
If we're ever in a situation where forward pass ADs differ much from reverse pass things have gone off the rails in other ways, so just one of the FwdAdCalls and RevAdCalls is necessary
Stanc3 PR is in, Cmdstan PR is ready to be merged now I believe and the docs PR that adds basic information is also ready. Expanded docs will be added during the freeze.
Summary:
This is to track the progress of the timing/profiling project.
A prototype is implemented following the design here
Current Version:
v2.25.0