2.26 timing/profiling meta issue

bbbales2 commented 3 years ago

Summary:

This is to track the progress of the timing/profiling project.

A prototype is implemented following the design here

[x] Share a cmdstan/stanc3 build with the prototype timing/profiling features in place so that people can (current build here)
- Evaluate the language design
- Evaluate the I/O
[x] Finalize the language and I/O designs https://github.com/stan-dev/design-docs/pull/31
Pull requests to:
- [x] Math https://github.com/stan-dev/math/pull/2265
- [x] ~Stan https://github.com/stan-dev/stan/pull/2997~
- [x] Stanc3 https://github.com/stan-dev/stanc3/pull/775
- [x] cmdstan https://github.com/stan-dev/cmdstan/pull/968
- [x] cmdstanr?
- [ ] cmdstanpy?
Finish Pulls
- [x] Math
- [x] ~Stan~
- [x] Stanc3
- [x] cmdstan
Documentation
- [ ] User's guide (for an example)
- [ ] Reference guide (since these are language features, a description of how they're used goes there)
- [ ] Function reference (should have the functions here even though it's duplicating reference guide)
- [ ] Case study?

Current Version:

v2.25.0

rok-cesnovar commented 3 years ago

Thanks Ben. I will be opening all relevant PRs in a few days and then make another call for feedback. The Math PR should work standalone and can be done independently, while cmdstan/stanc3 PRs will be interconnected.

rok-cesnovar commented 3 years ago

Math PR Open: https://github.com/stan-dev/math/pull/2265 Stanc3 PR Open: https://github.com/stan-dev/stanc3/pull/775

Both PRs are preliminary and await feedback before polishing and doc-ing.

Cmdstan branch ready: https://github.com/stan-dev/cmdstan/tree/profiling

Example of use:

data {
  int<lower=0> N;
  int<lower=0> K;
  int<lower=0,upper=1> y[N];
  matrix[N, K] X;
}
parameters {
  real alpha;
  vector[K] beta;
}
model {
  {
    profile("normal_lpdf alpha");
    target += normal_lpdf(alpha | 0, 1);
  }

  {
    profile("normal_lpdf beta");
    target += normal_lpdf(beta | 0, 1);
  }
  {
    profile("bernoulli GLM");
    target += bernoulli_logit_glm_lpmf(y | X, alpha, beta);
  }
}

after running returns:

Iteration: 1900 / 2000 [ 95%]  (Sampling)
Iteration: 2000 / 2000 [100%]  (Sampling)

 Elapsed Time: 0.036 seconds (Warm-up)
               0.042 seconds (Sampling)
               0.078 seconds (Total)

profile name: fwd pass, rev pass
bernoulli GLM: 0.0409984, 0.000453154
normal_lpdf alpha: 0.00238057, 0.000497744
normal_lpdf beta: 0.00349398, 0.00053457

bbbales2 commented 3 years ago

@rok-cesnovar is there a Stan branch? I gotta clone these things using git clone --depth=1 --single-branch cuz my internet iz slow.

bbbales2 commented 3 years ago

Is it just develop?

rok-cesnovar commented 3 years ago

Its the profiling branch on all repos.

bbbales2 commented 3 years ago

I'm getting the wrong compiler:

bbales2@tadpole:~/cmdstan-profile$ head -n20 bin/stanc

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
  <link rel="dns-prefetch" href="https://github.githubassets.com">
  <link rel="dns-prefetch" href="https://avatars0.githubusercontent.com">
  <link rel="dns-prefetch" href="https://avatars1.githubusercontent.com">
  <link rel="dns-prefetch" href="https://avatars2.githubusercontent.com">
  <link rel="dns-prefetch" href="https://avatars3.githubusercontent.com">
  <link rel="dns-prefetch" href="https://github-cloud.s3.amazonaws.com">
  <link rel="dns-prefetch" href="https://user-images.githubusercontent.com/">

Command on build is:

curl -L https://github.com/rok-cesnovar/stanc3/releases/tag/profiling/linux-stanc -o bin/stanc --retry 5 --retry-delay 10
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 83478    0 83478    0     0  40444      0 --:--:--  0:00:02 --:--:-- 40444
chmod +x bin/stanc

bbbales2 commented 3 years ago

Ooo, but I found your stanc3 binaries over here: https://github.com/stan-dev/stanc3/actions/runs/431332425 and they work.

bbbales2 commented 3 years ago

This is so clutch. I was playing around with a model and I put a timing thing in each block:

gq: 0.916925, 0
likelihood: 0.0116217, 0.00156159
prior: 0.0292104, 0.000513555
tdata: 0.000783834, 3.5e-08
tparm: 33.3017, 0.000547283

I don't think the reverse pass timing in tdata should have anything. Something weird must be happening there.

Also what are the units on all these printouts? Total time spent? Or seconds per gradient evaluation? Does this include warmup?

rok-cesnovar commented 3 years ago

Currently, tdata is not supposed to work. I need to add a Math support for profile that doesn't add a vari.

Also what are the units on all these printouts? Total time spent? Or seconds per gradient evaluation? Does this include warmup?

Its total time spent in seconds, includes everything.

rok-cesnovar commented 3 years ago

Thanks for the first batch of feedback!

I'm getting the wrong compiler:

Fixed

I don't think the reverse pass timing in tdata should have anything.

Fixed, in tdata and generated quantities we dont create a vari now.

Also added profile_file

./logistic sample data file=logistic.data.json profile_file=profile.csv

I tarballed the latest version, now available here: https://github.com/rok-cesnovar/cmdstan/releases/download/profiling/cmdstan-profiling.tgz

bbbales2 commented 3 years ago

Its total time spent in seconds, includes everything.

I kinda think these should be per gradient. They should probably be split by warmup/sampling as well.

Also I want the distributions. Given we can have up to 1000x more gradients than lp evaluations, that really explodes the output size esp. on something we're trying to benchmark (which seems bad), but we could thin the output or something. This sounds like getting into Stan services lol.

bbbales2 commented 3 years ago

Also I want the distributions.

I'll also be fine with summaries if that's what I get in the end, but while for some things we expect the per-gradient timing to be pretty constant (ODEs) for the higher order functions there could be interesting variation.

bbbales2 commented 3 years ago

If I time my transformed parameters block like:

transformed parameters {
  vector[N_1] r_1_1;  // actual group-level effects                                                                                                                                                                
  vector[N_2] r_2_1;  // actual group-level effects                                                                                                                                                                
  vector[N_3] r_3_1;  // actual group-level effects                                                                                                                                                                
  vector[N_4] r_4_1;  // actual group-level effects                                                                                                                                                                
  vector[N_5] r_5_1;  // actual group-level effects                                                                                                                                                                
  vector[N_6] r_6_1;  // actual group-level effects                                                                                                                                                                
  vector[N_7] r_7_1;  // actual group-level effects                                                                                                                                                                
  {
    profile("transformed parameters");
    r_1_1 = (sd_1[1] * (z_1[1]));
    r_2_1 = (sd_2[1] * (z_2[1]));
    r_3_1 = (sd_3[1] * (z_3[1]));
    r_4_1 = (sd_4[1] * (z_4[1]));
    r_5_1 = (sd_5[1] * (z_5[1]));
    r_6_1 = (sd_6[1] * (z_6[1]));
    r_7_1 = (sd_7[1] * (z_7[1]));
  }
}

And then like:

transformed parameters {
  vector[N_1] r_1_1;  // actual group-level effects                                                                                                                                                                
  vector[N_2] r_2_1;  // actual group-level effects                                                                                                                                                                
  vector[N_3] r_3_1;  // actual group-level effects                                                                                                                                                                
  vector[N_4] r_4_1;  // actual group-level effects                                                                                                                                                                
  vector[N_5] r_5_1;  // actual group-level effects                                                                                                                                                                
  vector[N_6] r_6_1;  // actual group-level effects                                                                                                                                                                
  vector[N_7] r_7_1;  // actual group-level effects
  profile("transformed parameters");
  r_1_1 = (sd_1[1] * (z_1[1]));
  r_2_1 = (sd_2[1] * (z_2[1]));
  r_3_1 = (sd_3[1] * (z_3[1]));
  r_4_1 = (sd_4[1] * (z_4[1]));
  r_5_1 = (sd_5[1] * (z_5[1]));
  r_6_1 = (sd_6[1] * (z_6[1]));
  r_7_1 = (sd_7[1] * (z_7[1]));
}

I get different results. The second gives me timings for "transformed parameters" that are nearly as long as it took to do the whole inference.

rok-cesnovar commented 3 years ago

I kinda think these should be per gradient.

Sure, that is doable. There is more to discuss there, specifically what to do with forward/reverse pass. We can have an unequal number of forward and reverse passes. Which then brings us to a few questions:

do we count only those that succeed in both the forward and reverse pass?
if no, do we do per fwd pass and per rev pass?
would reporting n_fwd, n_rev work? So the users can then calculate it themselves?

They should probably be split by warmup/sampling as well.

Hm, that would be a bigger change and would also mean we have to do this through services. That probably means pushing it to next release. Which I am fine with if that is what we decide.

Just curious, what is the added value of knowing the warmup/sampling split?

Also I want the distributions.

That is also doable, but would have a larger memory footprint or we would have to store the profile info in files, which again brings us to services and profile then has a bit of a larger effect. I do feel these might be useful yeah.

Maybe we can do basic profiling the way as its proposed here and do a verbose one via services.

bbbales2 commented 3 years ago

Just curious, what is the added value of knowing the warmup/sampling split?

ODEs take different amounts of time in different parts of parameter space and warmup can spend a lot of time in weird places.

That probably means pushing it to next release. Maybe we can do basic profiling the way as its proposed here and do a verbose one via services.

If it's too hard to give raw timing stats or split by warmup/release, I'm definitely in for doing some sort of summary output now and expanding it in the future. Just gotta make sure and not do something that's super backwards incompatible or whatever.

rok-cesnovar commented 3 years ago

Just gotta make sure and not do something that's super backwards incompatible or whatever.

Strongly agree.

I get different results.

This one is due to transform parameters actually being generated together with everything in the model block. Will look at how we can get around this problem. I suspect the end-of-TP block is going to be a bit of a challenge.

bbbales2 commented 3 years ago

This is great I just timed the different parts of a bits of a hierarchical model (something generated from brms here):

name, total, fwd, rev
likelihood,46.8229,42.3114,4.51143
linear_model,35.5232,27.1445,8.37874
prior,0.901631,0.781348,0.120283
random effects,108.195,72.9507,35.2441
transformed parameters,0.666818,0.479504,0.187314

This is how the linear_model and random_effects things were defined:

{                                                                                                                                                                                                            
  profile("linear_model");                                                                                                                                                                                     
  mu = Intercept + Xc * b;                                                                                                                                                                                     
}

{
  profile("random effects");
  for (n in 1:N) {
    // add more terms to the linear predictor                                                                                                                                                                  
    mu[n] = r_1_1[J_1[n]] * Z_1_1[n] + r_2_1[J_2[n]] * Z_2_1[n] + r_3_1[J_3[n]] * Z_3_1[n] + r_4_1[J_4[n]] * Z_4_1[n] + r_5_1[J_5[n]] * Z_5_1[n] + r_6_1[J_6[n]] * Z_6_1[n] + r_7_1[J_7[n]] * Z_7_1[n];
  }
}

and then switched it to use the glm functions which combine the linear model and likelihood bit:

name, total, fwd, rev
likelihood,31.4064,28.2638,3.14264
prior,0.698845,0.609689,0.0891561
random effects,61.1205,40.9512,20.1693
transformed parameters,0.457823,0.321136,0.136686

Edit: edited to make the labels in the code and output consistent

bbbales2 commented 3 years ago

Can the timing be recursive?

Like a put a profile in my model block which contains other profiles.

Now I get:

name, total, fwd, rev
likelihood,31.4064,28.2638,3.14264
model,93.6092,70.1797,23.4295
prior,0.698845,0.609689,0.0891561
random effects,61.1205,40.9512,20.1693
transformed parameters,0.457823,0.321136,0.136686

But could we get:

name, total, fwd, rev
model,93.6092,70.1797,23.4295
   - likelihood,31.4064,28.2638,3.14264
   - random effects,61.1205,40.9512,20.1693
prior,0.698845,0.609689,0.0891561
transformed parameters,0.457823,0.321136,0.136686

And then if there's a profile inside a function, it might show two different entries depending on where it's used. Like how profilers let you group the output times.

rok-cesnovar commented 3 years ago

Can the timing be recursive?

Yes, the only limitation for where you can place profiles is that you cant have

{
  profile("name");
  fun1();
  profile("name");
  fun2();  
}

so two "active" profiles with the same name. This is allowed:

{
  profile("name");
  fun1();
}
{
  fun2();  
  profile("name");
}

And name would be the sum of the two. You can also use it in a for loop.

But could we get:

Maybe, yeah.

bob-carpenter commented 3 years ago

This is so cool! Is the scope an RAII-like from the profile() command to the end of the current block? Then you can always introduce braces to test sequences of commands. I really like how clean this design is, though I can imagine the implementation for reverse-mode's tricky as you'll have to manage the stack rather than the compiler.

The Stan transpiler should be able to check for active profiles and disallow duplicates.

If the design goes onto design-docs, I'm happy to review. I'm also happy to write the doc if you'd like help with that.

rok-cesnovar commented 3 years ago

Is the scope an RAII-like from the profile() command to the end of the current block?

Yes. This works great and is very clean because blocks in Stan = blocks in C++. The only problem we have is that transformed parameters are not actually encapsulated in a block in C++ which means that profile("TP") in the below example spills over in the model block as it does not get out of scope (it only goes out of scope at the end of the model block.

transformed parameters {
  vector[N_1] r_1_1;  // actual group-level effects                                                                                                                                                                
  profile("TP");
  r_1_1 = (sd_1[1] * (z_1[1]));
}

though I can imagine the implementation for reverse-mode's tricky as you'll have to manage the stack rather than the compiler

I opened a Math PR https://github.com/stan-dev/math/pull/2265 so you can see the Math side of thing. Its not review-ready yet, because I did not go into doxygen and tests before we finalize what we actually want.

If the design goes onto design-docs

I am doing the first round of feedback to hash out any trivial stuff I missed and to get additional ideas. I will open a design doc in a few days. Will make sure to tag you.

I'm also happy to write the doc if you'd like help with that.

Thank you! I will probably require help with docs as I am involved with a bunch of projects for 2.26 and might run out of time. And we want this doc-ed well.

The Stan transpiler should be able to check for active profiles and disallow duplicates.

For now, this triggers a runtime error (trying to re-active an already active profile).

bbbales2 commented 3 years ago

Hmm, I wonder if we should group the output by what block they're run in?

Like transformed data gets run once with doubles, generated quantities gets run for every draw with doubles, transformed parameters is messy with the thing @nhuurre found, but the model block is every gradient.

And so then per-gradient numbers are weird, cause not everything is a gradient. I guess per-call?

I ran an ODE model with the intent to check if the other solvers were running faster:

functions {
  vector dz_dt(real t,       // time                                                                                                                                                                                                                                                                                                                                                        
               vector z,     // system state {prey, predator}                                                                                                                                                                                                                                                                                                                               
               real alpha,
               real beta,
               real gamma,
               real delta) {
    real u = z[1];
    real v = z[2];

    vector[2] dzdt;

    dzdt[1] = (alpha - beta * v) * u;
    dzdt[2] = (-gamma + delta * u) * v;

    return dzdt;
  }
}
data {
  int<lower = 0> N;          // number of measurement times                                                                                                                                                                                                                                                                                                                                 
  real ts[N];                // measurement times > 0                                                                                                                                                                                                                                                                                                                                       
  real y_init[2];            // initial measured populations                                                                                                                                                                                                                                                                                                                                
  real<lower = 0> y[N, 2];   // measured populations                                                                                                                                                                                                                                                                                                                                        
}
parameters {
  real<lower = 0> alpha;
  real<lower = 0> beta;
  real<lower = 0> gamma;
  real<lower = 0> delta;
  vector<lower = 0>[2] z_init;  // initial population                                                                                                                                                                                                                                                                                                                                       
  real<lower = 0> sigma[2];   // measurement errors                                                                                                                                                                                                                                                                                                                                         
}
transformed parameters {
  vector[2] z[N] = ode_rk45_tol(dz_dt, z_init, 0, ts,
                                1e-5, 1e-3, 500,
                                alpha, beta, gamma, delta);
}
model {
  {
    profile("priors");
    alpha ~ normal(1, 0.5);
    gamma ~ normal(1, 0.5);
    beta ~ normal(0.05, 0.05);
    delta ~ normal(0.05, 0.05);
    sigma ~ lognormal(-1, 1);
    z_init ~ lognormal(log(10), 1);
  }

  {
    profile("likelihood");
    for (k in 1:2) {
      y_init[k] ~ lognormal(log(z_init[k]), sigma[k]);
      y[ , k] ~ lognormal(log(z[, k]), sigma[k]);
    }
  }
}
generated quantities {
  vector[2] zb[N];
  {
    profile("adams");
    zb = ode_adams_tol(dz_dt, z_init, 0, ts,
                     1e-5, 1e-3, 500,
                     alpha, beta, gamma, delta);
  }

  {
    profile("bdf");
    zb = ode_bdf_tol(dz_dt, z_init, 0, ts,
                     1e-5, 1e-3, 500,
                     alpha, beta, gamma, delta);
  }

  {
    profile("rk45");
    zb = ode_rk45_tol(dz_dt, z_init, 0, ts,
                     1e-5, 1e-3, 500,
                     alpha, beta, gamma, delta);
  }
}

Anyway the idea is to see which solver is faster on the problem.

The output is:

name, total, fwd, rev
adams,0.117718,0.117718,0
bdf,0.115601,0.115601,0
likelihood,0.72739,0.675845,0.051545
priors,0.220735,0.207088,0.0136471
rk45,0.0271892,0.0271892,0

which definitely gives me the info I want, but all those numbers kinda come from different places and so maybe they should be arranged a bit more.

rok-cesnovar commented 3 years ago

Like transformed data gets run once with doubles, generated quantities gets run for every draw with doubles, transformed parameters is messy..

Lets discuss this in the design doc. To me, the cumulative time is the easiest to process. For example, the per-run time could be the biggest for the transformed data but we know in the grand scheme of things its most likely negligible.

nested profiles and labeling the call-site (tdata, tparam, model)

Hm, these both sound like good-to-have yes. But, both of these can be achieved by the user labeling the profiles. So you could label GQ profiles as profile("GQ - rk45"). By labeling them automatically we remove some of the freedom the user would have. For example one could want to profile both the tparam and GQ rk45 calls together. And we would prevent that with auto-grouping.

Same for

model,93.6092,70.1797,23.4295
   - likelihood,31.4064,28.2638,3.14264
   - random effects,61.1205,40.9512,20.1693

Just label the likelihood as either " - likelihood" or "model - likelihood" and if the profiles would be shown in the order they appear, both would be shown as you want them to.

We definitely need to order the profiles based on how they appear in the model. They are currently shown alphabetically, but that was not the intent. That is the order they show if I just iterate over the map.

bob-carpenter commented 3 years ago

Can we have the same tag in multiple blocks? Often the pieces I do are a combination of things going on in generated quantities, the model block and transformed parameters, so I'm not sure it makes sense to summarize by block.

rok-cesnovar commented 3 years ago

Yes, the current implementation would accumulate profiles with the same name.

rok-cesnovar commented 3 years ago

The only thing that is currently bothering me is how could we profile the parameters transforms (lower, upper, ...).

I dont have a clean solution for that at the moment so I am leaning towards not supporting that at this time.

nhuurre commented 3 years ago

Well, you could just allow a profile statement in the parameters block--but I don't think transforms are ever going to be a performance bottleneck.

bbbales2 commented 3 years ago

This is the output now:

name,total_time,fwd_time,rev_time,chain_stack_total,nochain_stack_total,fwd_passes,AD_fwd_passes,AD_rev_passes
centering,5.335e-06,5.335e-06,0,0,0,1,0,0
fixed effects,1.72901,1.35503,0.373981,236822,118411000,1,118411,118411
likelihood,2.0658,1.9612,0.104593,118411,0,1,118411,118411
prior,0.205329,0.180555,0.0247746,1894576,0,1,118411,118411
random effects,5.15733,3.48409,1.67324,414438500,0,1,118411,118411

I tried to format this like the current Stan summary output:

Profile information for Stan model: mrp_prof_model
1 chain

                  Time  Forward_Mode_Time  Reverse_Mode_Time  ChainStack  NoChainStack  NoADPasses  AdPasses

centering      5.3e-06            5.3e-06                  0           0             0           1         0
fixed effects      1.7                1.4               0.37       2.4e5         1.2e8           1     1.2e5
likelihood         2.1                2.0               0.10       1.2e5             0           1     1.2e5
prior             0.21               0.18              0.025       1.9e6             0           1     1.2e5
random effects     5.2                3.5                1.7       4.1e8             0           1     1.2e5

Notes:

I got rid of the FwdAdPass and RevADPass columns. Seems like they were duplicate. For every rev pass we do a forward pass
It looks like the AdPasses column, while being interesting, is very boring
I am a bit sketch about printing the timing information to stdout along with everything. That is going to look bad with multiple chains regardless of how we swing it. In that regard, I think profile_file + some sort of stansummary program is in order and no stdout printing

The header from the regular Stansummary looks better:

File Edit Options Buffers Tools Help                                                                                                                                                                               
Inference for Stan model: mrp_prof_model
2 chains: each with iter=(1000,1000); warmup=(0,0); thin=(1,1); 2000 iterations saved.

Warmup took (4.7, 4.4) seconds, 9.0 seconds total
Sampling took (4.9, 5.4) seconds, 10 seconds total

It would be annoying to save a lot of that meta information in comments in our beautiful new CSV format though.

There are timing numbers here, and there are memory numbers here, and there are ad pass numbers here. Maybe we should separate out this information:

Profile information for Stan model: mrp_prof_model, 1 chain(s)

Timing information:

                  Time  Forward_Mode_Time  Reverse_Mode_Time

centering      5.3e-06            5.3e-06                  0
fixed effects      1.7                1.4               0.37
likelihood         2.1                2.0               0.10
prior             0.21               0.18              0.025
random effects     5.2                3.5                1.7

Memory information:

               ChainStack  NoChainStack

centering               0             0
fixed effects       2.4e5         1.2e8
likelihood          1.2e5             0
prior               1.9e6             0
random effects      4.1e8             0

Autodiff information:

               NoADPasses  AdPasses

centering               1         0
fixed effects           1     1.2e5
likelihood              1     1.2e5
prior                   1     1.2e5
random effects          1     1.2e5

Do we offer a way to summarize the outputs over chains? My instinct is to just offer a sum. I don't know how to express variance in 4 numbers.

bbbales2 commented 3 years ago

Yeah on thinking more, I think just reporting the total is the easiest thing to think about. That way it's the total time/memory/autodiff calls within chains and also the totals across chains.

The autodiff information section -- there doesn't appear to be much information in here cause the profile statements are not in loops or anything. If you time something in a loop you'll get some variation.

I wonder where the likelihood is evaluated without AD?

Also I think write a separate stansummary for this stuff. I can help tomorrow. I think it should just have the sig_figs, csv_filename, and help options. I don't think we want to print this stuff by default.

rok-cesnovar commented 3 years ago

This is the output now:

name,total_time,fwd_time,rev_time,chain_stack_total,nochain_stack_total,fwd_passes,AD_fwd_passes,AD_rev_passes
centering,5.335e-06,5.335e-06,0,0,0,1,0,0
fixed effects,1.72901,1.35503,0.373981,236822,118411000,1,118411,118411
likelihood,2.0658,1.9612,0.104593,118411,0,1,118411,118411
prior,0.205329,0.180555,0.0247746,1894576,0,1,118411,118411
random effects,5.15733,3.48409,1.67324,414438500,0,1,118411,118411
I tried to format this like the current Stan summary output:

Profile information for Stan model: mrp_prof_model
1 chain

                  Time  Forward_Mode_Time  Reverse_Mode_Time  ChainStack  NoChainStack  NoADPasses  AdPasses

centering      5.3e-06            5.3e-06                  0           0             0           1         0
fixed effects      1.7                1.4               0.37       2.4e5         1.2e8           1     1.2e5
likelihood         2.1                2.0               0.10       1.2e5             0           1     1.2e5
prior             0.21               0.18              0.025       1.9e6             0           1     1.2e5
random effects     5.2                3.5                1.7       4.1e8             0           1     1.2e5
Notes:

For every rev pass we do a forward pass

Yes, except if an exception occurs, then there is no reverse pass.

It looks like the AdPasses column, while being interesting, is very boring

Yes, but if we want to check how much time we spend per gradient eval or how much of the chain stack is used per gradient we need this info.

I am a bit sketch about printing the timing information to stdout along with everything. That is going to look bad with multiple chains regardless of how we swing it. In that regard, I think profile_file + some sort of stansummary program is in order and no stdout printing

Agree. What should the default profile_file be? profile_output.csv?

It would be annoying to save a lot of that meta information in comments in our beautiful new CSV format though.

We dont want to add that to the comments in my opinion.

There are timing numbers here, and there are memory numbers here, and there are ad pass numbers here. Maybe we should separate out this information:

I think we should yes. I like the timing/memory/AD splits.

Do we offer a way to summarize the outputs over chains? My instinct is to just offer a sum. I don't know how to express variance in 4 numbers.

I like the bin/profilesummary idea. I would maybe prefer if this would be handled by stansummary but we would then have to add comments to the CSV which I would prefer not to. So a separate one is fine I guess.

bbbales2 commented 3 years ago

If you indent code blocks you gotta indent every line I think

bbbales2 commented 3 years ago

What should the default profile_file be?

My gut reaction is none, but it would probably be confusing to add profile information and for that to appear nowhere. (Edit: like, to get profiling output you have to add profiling commands and tell the interface something -- pretty annoying)

Let's go with profile.csv and see how annoying that is in the release candidate.

Yes, except if an exception occurs, then there is no reverse pass.

I still think ditch this column. We can keep it for the RC and check how people feel about it.

We dont want to add that to the comments in my opinion.

Sure, no meta information is fine with me.

I like the timing/memory/AD splits.

Do

I would maybe prefer if this would be handled by stansummary

It is annoying to add a program, but it would be at least as annoying to try to detect csv-types or add another flag to stansummary.

rok-cesnovar commented 3 years ago

I still think ditch this column.

Ok. Its always easier to add things then to remove them. Maybe I am just overthinking how much ChainStack per gradient-eval is useful. If we have doubts we want it, its best to not have add it.

bbbales2 commented 3 years ago

per gradient-eval

We can count reverse passes instead so this is the exact number and it easy to explain (no need to say "the number of reverse pass chain calls except for exceptions").

If we're ever in a situation where forward pass ADs differ much from reverse pass things have gone off the rails in other ways, so just one of the FwdAdCalls and RevAdCalls is necessary

rok-cesnovar commented 3 years ago

Stanc3 PR is in, Cmdstan PR is ready to be merged now I believe and the docs PR that adds basic information is also ready. Expanded docs will be added during the freeze.

stan-dev / cmdstan

2.26 timing/profiling meta issue #960

Summary:

Current Version: