Closed mitzimorris closed 1 month ago
Created a new PR because accidentally included some version of Stan math_lib in https://github.com/stan-dev/stan/pull/3305 and don't have the GitHub chops to undo its fubarness.
Name | Old Result | New Result | Ratio | Performance change( 1 - new / old ) |
---|---|---|---|---|
arma/arma.stan | 0.37 | 0.41 | 0.9 | -10.81% slower |
low_dim_corr_gauss/low_dim_corr_gauss.stan | 0.01 | 0.01 | 1.08 | 7.16% faster |
gp_regr/gen_gp_data.stan | 0.03 | 0.03 | 1.15 | 12.77% faster |
gp_regr/gp_regr.stan | 0.13 | 0.1 | 1.32 | 24.28% faster |
sir/sir.stan | 71.91 | 70.15 | 1.03 | 2.44% faster |
irt_2pl/irt_2pl.stan | 4.38 | 4.07 | 1.08 | 7.14% faster |
eight_schools/eight_schools.stan | 0.06 | 0.05 | 1.17 | 14.52% faster |
pkpd/sim_one_comp_mm_elim_abs.stan | 0.28 | 0.24 | 1.16 | 13.79% faster |
pkpd/one_comp_mm_elim_abs.stan | 20.33 | 18.96 | 1.07 | 6.75% faster |
garch/garch.stan | 0.46 | 0.41 | 1.14 | 11.94% faster |
low_dim_gauss_mix/low_dim_gauss_mix.stan | 2.81 | 2.69 | 1.04 | 4.29% faster |
arK/arK.stan | 1.88 | 1.76 | 1.06 | 6.06% faster |
gp_pois_regr/gp_pois_regr.stan | 2.93 | 2.82 | 1.04 | 3.66% faster |
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan | 9.4 | 8.63 | 1.09 | 8.15% faster |
performance.compilation | 184.16 | 197.69 | 0.93 | -7.34% slower |
Mean result: 1.0835976351849117
Jenkins Console Log Blue Ocean Commit hash: 2c8bf6d3d4fe497e1d8ff310ce5e5f26b1925faa
CmdStan tests failed - correctly! offending test is here: https://github.com/stan-dev/cmdstan/blob/33e27a4f7b72c7df558d982a33bfe33ce0b14211/src/test/interface/variational_output_test.cpp#L53
ADVI output format is 1st line is the ADVI estimate, followed by 1000 samples. summary of samples should only look at the sample, not the ADVI mean.
Created a new PR because accidentally included some version of Stan math_lib in #3305 and don't have the GitHub chops to undo its fubarness.
Seems like a good move. Only downside is we lost the review history, which makes it difficult to track what has changed
agreed - will continue to add in changes suggested/requested in #3305
Sounds good! Could you ping me when you think all of the 'old' review is resolved and I can take a look at the branch then?
@WardBrian @SteveBronder I've made changes and have a branch of CmdStan that works with updated code. can this go in today?
It's a bit tricky to approve because of the size. Overall:
src/stan/io/stan_csv_reader.hpp
look great, would be happy to merge if they were independentchains.hpp
-- just deleting dead code?It would be nice if every function in chainset looked like the split_rank_normalized_ess
functions, where the actual computation lived somewhere else. That way it is easy to independently test, discuss, change, etc.
It also seems odd to me to not expose the older forms of ess etc in the chainset class, if they're similarly small functions. I understand the primary consumer may not care as it stands, but if we want to say "here's a better API for chains.hpp", including the stuff chains.hpp can already do seems nice
I'm not sure why any changes were made to chains.hpp -- just deleting dead code?
https://github.com/stan-dev/stan/pull/3266 added call for split r-hat to chains.hpp
when I first started working on this PR, I thought that we could add rank-normalized ESS and MCSE to chains.hpp as well. I can revert chains.hpp to the state that it was in before #https://github.com/stan-dev/stan/pull/3266.
I'm pretty nervous about changing both the code organization and the behavior at the same time (see above on quantiles, etc). It also means we can't test the new one against the old one
we know that the behavior will be different because the old version was implemented before the 2019 improved r-hat paper came along. for this PR, I ran the example output CSV files through CmdStanR, which uses posterior, which provides a correct implementation of the rank-normalized rhat. (Aki / Paul Beurkner / Johah ).
I'm not referring to the new ESS/R-Hat in that comment -- those are entirely new things being computed and reported to the user. But it does seem weird (and potentially impactful) if the 95% interval changes based on just updating stansummary and running on the same outputs as before
this is what the quantile code from chains.hpp
computes:
using boost::accumulators::accumulator_set;
using boost::accumulators::left;
using boost::accumulators::quantile;
using boost::accumulators::quantile_probability;
using boost::accumulators::right;
using boost::accumulators::stats;
using boost::accumulators::tag::tail;
using boost::accumulators::tag::tail_quantile;
double M = x.rows();
size_t cache_size = M;
if (prob < 0.5) {
accumulator_set<double, stats<tail_quantile<left> > > acc(
tail<left>::cache_size = cache_size);
for (int i = 0; i < M; i++)
acc(x(i));
return quantile(acc, quantile_probability = prob);
}
accumulator_set<double, stats<tail_quantile<right> > > acc(
tail<right>::cache_size = cache_size);
for (int i = 0; i < M; i++)
acc(x(i));
return quantile(acc, quantile_probability = prob);
}
the new code is more efficient - but possibly different behavoir.
your call on what to do.
the new code is more efficient
Do you have timing on that? The latest commit seems to just call quantile in a loop, so I'd expect them to end up pretty much even
but possibly different behavoir.
Definitely different -- here's a super minimal CSV file:
foo
100
101
102
103
104
105
106
107
108
109
110
current stansummary:
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
foo 105 1.8 3.3 100 105 110 3.4 inf 2.8
the cmdstan PR based on this branch:
Mean MCSE StdDev MAD 5% 50% 95% N_Eff_bulk N_Eff_tail R_hat
foo 105 1.5 3.3 4.4 100 105 109 5.0 5.0 2.1
this is not surprising - the super-minimal CSV file has 10 rows. we're advertising different computation. we do this all the time.
also, trying to run minimal CSV file "foo.csv" through bin/stansummary from 2.35.0 release bombs. how did you run this test?
here's another test: current release vs new branch - quantiles are the same.
~/.cmdstan/cmdstan-2.35.0> bin/stansummary src/test/interface/example_output/bernoulli_chain_1.csv
Inference for Stan model: bernoulli_model
1 chains: each with iter=(1000); warmup=(0); thin=(1); 1000 iterations saved.
Warmup took 0.0090 seconds
Sampling took 0.023 seconds
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
lp__ -7.3 3.7e-02 0.77 -9.1 -7.0 -6.8 443 19275 1.0
accept_stat__ 0.90 4.6e-03 1.5e-01 0.57 0.96 1.0 1026 44597 1.00
stepsize__ 1.0 nan 4.9e-15 1.0 1.0 1.0 nan nan nan
treedepth__ 1.4 1.6e-02 4.9e-01 1.0 1.0 2.0 895 38919 1.00
n_leapfrog__ 2.3 3.4e-02 9.6e-01 1.0 3.0 3.0 799 34758 1.00
divergent__ 0.00 nan 0.0e+00 0.00 0.00 0.00 nan nan nan
energy__ 7.8 5.1e-02 1.0e+00 6.8 7.5 9.9 411 17865 1.0
theta 0.26 6.1e-03 0.12 0.079 0.25 0.47 384 16683 1.00
new version:
(base) ~/github/stan-dev/cmdstan (feature/1263-new-rhat-summary)> bin/stansummary src/test/interface/example_output/bernoulli_chain_1.csv
Inference for Stan model: bernoulli_model
1 chains: each with iter=1000; warmup=1000; thin=1; 1000 iterations saved.
Warmup took 0.0090 seconds
Sampling took 0.023 seconds
Mean MCSE StdDev MAD 5% 50% 95% N_Eff_bulk N_Eff_tail R_hat
lp__ -7.3 0.034 7.7e-01 0.30 -9.1 -7.0 -6.8 519 503 1.0
accept_stat__ 0.90 0.0041 1.5e-01 0.064 0.57 0.96 1.0 1284 941 1.00
stepsize__ 1.0 nan 2.4e-15 0.00 1.0 1.0 1.0 nan nan nan
treedepth__ 1.4 nan 4.9e-01 0.00 1.0 1.0 2.0 nan nan nan
n_leapfrog__ 2.3 0.034 9.6e-01 0.00 1.0 3.0 3.0 790 786 1.00
divergent__ 0.00 nan 0.0e+00 0.00 0.00 0.00 0.00 nan nan nan
energy__ 7.8 0.047 1.0e+00 0.74 6.8 7.5 9.9 489 486 1.0
theta 0.26 0.0063 1.2e-01 0.12 0.079 0.25 0.47 361 395 1.0
Samples were drawn using hmc with nuts.
For each parameter, N_Eff_bulk and N_Eff_tail measure the effective sample size
for the entire sample and for the the .05 and .95 tails, respectively,
and R_hat_bulk and R_hat_tail measure the potential scale reduction
on split chains, (at convergence will be very close to 1).
I used the current develop branch
I’m not saying we’re obligated to report the same numbers from now until the end of days, but I just think if we do make a change we should have a reason why the new number is better/more correct, not just that we like the code that calculates it more
I have checked the existing unit tests and the behavior on real output files. I think this is OK.
I’m not saying we’re obligated to report the same numbers from now until the end of days, but I just think if we do make a change we should have a reason why the new number is better/more correct, not just that we like the code that calculates it more
The quantiles shouldn't change. [edit: this is wrong---see below] But the R-hat and ESS will. That's one of the motivations for the PR---to switch over to our improved versions of these as described in this paper:
Vehtari, Gelman, Simpson, Carpenter, Bürkner. 2019. Rank-normalization, folding, and localization: An improved Rˆ for assessing convergence of MCMC https://arxiv.org/abs/1903.08008
I believe @avehtari has already switched over RStan and PyStan and we're just waiting for this PR to switch over CmdStan and hence CmdStanPy and CmdStanR. I believe arViz also does these new calculations.
We could keep our old definitions and report two versions of R-hat (old, old per second, new) and three versions of ESS (old, new bulk, new tail), but I think it'd be confusing for users.
As another comment, I don't think we should be reporting ESS for sampler parameters other than lp__
.
I believe @avehtari has already switched over RStan and PyStan and we're just waiting for this PR to switch over CmdStan and hence CmdStanPy and CmdStanR. I believe arViz also does these new calculations.
CmdStanR can call CmdStan diagnostics, but CmdStanR is also using posterior
package, which has new Rhat and bulk- and tail-ESS, and it is more handy to use posterio
package diagnostics than CmdStan
diagnostics. ArviZ
, which is used by Python and Julia users, has these new functions, too. RStan is not yet using posterior
package.
We could keep our old definitions and report two versions of R-hat (old, old per second, new) and three versions of ESS (old, new bulk, new tail), but I think it'd be confusing for users.
The old ESS which (called ess_basic()
in posterior
package) is still needed for computing MCSE for mean estimate. I think the issue is that reporting everything by default would be too much, but CmdStan doesn't have a nice way for the users to define which diagnostic quantities they want to see, and thus we need to make some choices.
As another comment, I don't think we should be reporting ESS for sampler parameters other than lp__.
I agree. Same for Rhat.
The quantiles shouldn't change.
My point is, they do. I believe the test cases we are using are not specific enough to see that, but doing a bit of testing on the side:
50%
quantile by the 4th decimal placeHere's a CSV file: https://gist.github.com/WardBrian/41bbf1b57829efa32e6aa58c5c493f57
Current stansummary (with --sig_figs=4)
Mean MCSE StdDev 5% 50% 95% N_Eff N_Eff/s R_hat
theta 2.593e-01 7.266e-03 1.251e-01 8.379e-02 2.429e-01 4.830e-01 296.7 59330 1.003
new stansummary (with --sig_figs=4)
Mean MCSE StdDev MAD 5% 50% 95% N_Eff_bulk N_Eff_tail R_hat
theta 0.2593 7.263e-03 0.1251 0.1343 0.08379 0.2425 0.4830 296.9 446.6 1.008
0.2429 != 0.2425
CmdStanPy uses --sig_figs=6
by default, so end users will see different numbers.
Again, we can decide we want to make this change, but it just happening incidentally doesn't sit right with me, hence me not merging as-is.
N_Eff_bulk N_Eff_tail
It would be better to use terms ESS_bulk and ESS_tail, as 1) N has different meaning in the paper describing these quantities, and 2) N_eff makes people to misread it as "Number of EFFective samples" which is wrong, and the correct term is "Effective Sample Size", 3) CmdStanR using posterior is showing ESS_bulk and ESS_tail.
The old ESS which (called ess_basic() in posterior package) is still needed for computing MCSE for mean estimate.
@avehtari - could you check the implementations of mcse and mcse_sd?
It would be better to use terms ESS_bulk and ESS_tail
Yes, there is an existing cmdstan issue for this https://github.com/stan-dev/cmdstan/issues/916, which @mitzimorris will address as part of the cmdstan pr that accompanies this
@WardBrian regarding the implementation of quantiles, we need a better understanding how the boost accumulators library computes quantiles - it is returning the value of a the nth draw or interpolating? and which is appropriate?
I spent yesterday looking at the boost code and don't understand it well enough to make that call. changing to use C++ std::nth_element would make the code easier to understand and remove dependencies, but as you say, that's not a good reason to change behavoirs.
I suspect the difference may just be as simple as the existing code prioritizes counting from the left if you're requesting the 50% or higher quantile, but I agree the boost reference is pretty opaque in this area
There is also stan::math::quantile
-- calling this would probably be the ideal thing to do from a code standpoint, but it yields yet a third set of numbers...
The quantiles shouldn't change.
I spoke too soon. The problem with quantiles is that reasonable definitions can differ. When you don't have an exact item, a lot of algorithms are going to interpolate, and how you do that can differ. For example if you only have two numbers [1, 3] and ask for a 50% quantile (i.e., a median), what is the answer? You could return 1 or 3, which is what you get if you turn 50% and the size of 2 into an index. Or you could interpolate between 1 and 3 and return 2. The problem's the exact same thing at an edge, but it might not be 50%. What do you do if I ask you for a 40% quantile of [1, 3]? There you might be tempted to linearly interpolate and say 1.8. That interpolation can often be made more stable if you use more elements around an element, so if we have four elements [2, 3, 5, 10], we might return a different answer than [3, 5] for the 50% quantile. It's a mess!
The bottom line, though, is that in these situations, we really don't care. The sampling algorithm didn't have the granularity to cleanly resolve the quantile so if we use one of the neightboring values or interpolate, it should be fine for our use.
There is also stan::math::quantile -- calling this would probably be the ideal thing to do from a code standpoint, but it yields yet a third set of numbers.
See above. That should also be fine. Using a quantile
function that works on a sorted list will be faster if we do a lot of quantiles. Otherwise, using an accumulator boost style will be faster. With an accumulator, if you are looking for the rank K element in a list, there's an O(N * log2(K)) algorithm (N for considering each element, log2(K) to insert it into position in the accumulator) that only consumes log2(K) memory. It can get even more clever than that, see, e.g., the variants available here:
Calling stan::math::quantile
seems like the best choice then. The code can also be made quite simple:
Eigen::MatrixXd draws = samples(index);
Eigen::Map<Eigen::VectorXd> map(draws.data(), draws.size());
return stan::math::quantile(map, prob);
There is an overload for if prob
is a single number or if it is a vector
plugged most recent version of this branch into CmdStan and comparing CmdStanR's summary (via fit$print) against CmdStan's bin/stansummary on the test CSV files in cmdstan/src/test/interface/example_outputs
and median and quantile values, as well as mean, sd, mad, ess, and rhat all match out to 9 digits.
comments look great - thanks! per discussion with @SteveBronder, I think the way to proceed here is with 3 PRs:
chainset.hpp
to code base; a replacement for chains.hpp
. I agree that would have been a better (or at least probably faster) way to go originally, but this being as far along as it is it would also be fine by me to proceed with this one big PR. Whichever you feel like, at this point
I think 3 PRs would be better - a lot of work, but the grown-up thing to do here. (especially since I've been advocating pragmatic programming lately).
@avehtari - could you check the implementations of mcse and mcse_sd?
These should use ess_mean without rank normalization. The rank normalized versions are generic diagnostics that work also with infinite variance, but by definition of MCSE mcse_mean and mcse_sd should not use rank-normalization.
closing this per above comment - https://github.com/stan-dev/stan/pull/3310#issuecomment-2389091218
Submission Checklist
./runTests.py src/test/unit
make cpplint
Summary
This PR uses the improved R-hat diagnostics (https://github.com/stan-dev/stan/pull/3266) and extends rank-normalization to the computation of ESS-bulk and ESS-tail. These diagnostics are computed on samples from one or more chains, which entails parsing the sample out of a stan-csv file and assembling it into an Eigen MatrixXd object.
The work done for this PR is:
stan/analyze/mcmc
so that both ESS and Rhat can use rank-normalizationstan/mcmc/chainset.hpp
which stores a set of samples as a std::vector of Eigen MatrixXds.stan_csv_reader
to fix bug https://github.com/stan-dev/stan/issues/3301, and to distinguish between ADVI and NUT-HMC samples (because CmdStan PR https://github.com/stan-dev/cmdstan/pull/972 was done in order to get summaries from ADVI outputs).Intended Effect
The consumer of this functionality is CmdStan's
stansummary
utility, which, given a list of Stan CSV filenames, assembles them into a chainset object and then calls functions on the chainset object to get summary stats and diagnostics.How to Verify
unit tests in this repo; unit tests in CmdStan repo test end-to-end parsing and reporting
Side Effects
N/A
Documentation
Documentation for CmdStan bin/stansummary to be done in docs repo, and CmdStanPy docs (CmdStanPy's
summary
function wraps CmdStan'sstansummary
).Copyright and Licensing
Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Columbia University
By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses: