stan-dev / stan

Stan development repository. The master branch contains the current release. The develop branch contains the latest stable development. See the Developer Process Wiki for details.
https://mc-stan.org
BSD 3-Clause "New" or "Revised" License
2.6k stars 370 forks source link

Feature/3299 diagnostics chainset #3310

Closed mitzimorris closed 1 month ago

mitzimorris commented 1 month ago

Submission Checklist

Summary

This PR uses the improved R-hat diagnostics (https://github.com/stan-dev/stan/pull/3266) and extends rank-normalization to the computation of ESS-bulk and ESS-tail. These diagnostics are computed on samples from one or more chains, which entails parsing the sample out of a stan-csv file and assembling it into an Eigen MatrixXd object.

The work done for this PR is:

Intended Effect

The consumer of this functionality is CmdStan's stansummary utility, which, given a list of Stan CSV filenames, assembles them into a chainset object and then calls functions on the chainset object to get summary stats and diagnostics.

How to Verify

unit tests in this repo; unit tests in CmdStan repo test end-to-end parsing and reporting

Side Effects

N/A

Documentation

Documentation for CmdStan bin/stansummary to be done in docs repo, and CmdStanPy docs (CmdStanPy's summary function wraps CmdStan's stansummary).

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Columbia University

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

mitzimorris commented 1 month ago

Created a new PR because accidentally included some version of Stan math_lib in https://github.com/stan-dev/stan/pull/3305 and don't have the GitHub chops to undo its fubarness.

stan-buildbot commented 1 month ago

Name Old Result New Result Ratio Performance change( 1 - new / old )
arma/arma.stan 0.37 0.41 0.9 -10.81% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.01 0.01 1.08 7.16% faster
gp_regr/gen_gp_data.stan 0.03 0.03 1.15 12.77% faster
gp_regr/gp_regr.stan 0.13 0.1 1.32 24.28% faster
sir/sir.stan 71.91 70.15 1.03 2.44% faster
irt_2pl/irt_2pl.stan 4.38 4.07 1.08 7.14% faster
eight_schools/eight_schools.stan 0.06 0.05 1.17 14.52% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.28 0.24 1.16 13.79% faster
pkpd/one_comp_mm_elim_abs.stan 20.33 18.96 1.07 6.75% faster
garch/garch.stan 0.46 0.41 1.14 11.94% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan 2.81 2.69 1.04 4.29% faster
arK/arK.stan 1.88 1.76 1.06 6.06% faster
gp_pois_regr/gp_pois_regr.stan 2.93 2.82 1.04 3.66% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 9.4 8.63 1.09 8.15% faster
performance.compilation 184.16 197.69 0.93 -7.34% slower

Mean result: 1.0835976351849117


Jenkins Console Log Blue Ocean Commit hash: 2c8bf6d3d4fe497e1d8ff310ce5e5f26b1925faa


Machine information No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.3 LTS Release: 20.04 Codename: focal CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 80 On-line CPU(s) list: 0-79 Thread(s) per core: 2 Core(s) per socket: 20 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz Stepping: 4 CPU MHz: 2400.000 CPU max MHz: 3700.0000 CPU min MHz: 1000.0000 BogoMIPS: 4800.00 Virtualization: VT-x L1d cache: 1.3 MiB L1i cache: 1.3 MiB L2 cache: 40 MiB L3 cache: 55 MiB NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79 Vulnerability Gather data sampling: Mitigation; Microcode Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Mitigation; IBRS Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS; IBPB conditional; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT vulnerable Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke md_clear flush_l1d arch_capabilities G++: g++ (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0 Copyright (C) 2019 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Clang: clang version 10.0.0-4ubuntu1 Target: x86_64-pc-linux-gnu Thread model: posix InstalledDir: /usr/bin
mitzimorris commented 1 month ago

CmdStan tests failed - correctly! offending test is here: https://github.com/stan-dev/cmdstan/blob/33e27a4f7b72c7df558d982a33bfe33ce0b14211/src/test/interface/variational_output_test.cpp#L53

ADVI output format is 1st line is the ADVI estimate, followed by 1000 samples. summary of samples should only look at the sample, not the ADVI mean.

WardBrian commented 1 month ago

Created a new PR because accidentally included some version of Stan math_lib in #3305 and don't have the GitHub chops to undo its fubarness.

Seems like a good move. Only downside is we lost the review history, which makes it difficult to track what has changed

mitzimorris commented 1 month ago

agreed - will continue to add in changes suggested/requested in #3305

WardBrian commented 1 month ago

Sounds good! Could you ping me when you think all of the 'old' review is resolved and I can take a look at the branch then?

mitzimorris commented 1 month ago

@WardBrian @SteveBronder I've made changes and have a branch of CmdStan that works with updated code. can this go in today?

WardBrian commented 1 month ago

It's a bit tricky to approve because of the size. Overall:

It would be nice if every function in chainset looked like the split_rank_normalized_ess functions, where the actual computation lived somewhere else. That way it is easy to independently test, discuss, change, etc.

It also seems odd to me to not expose the older forms of ess etc in the chainset class, if they're similarly small functions. I understand the primary consumer may not care as it stands, but if we want to say "here's a better API for chains.hpp", including the stuff chains.hpp can already do seems nice

mitzimorris commented 1 month ago

I'm not sure why any changes were made to chains.hpp -- just deleting dead code?

https://github.com/stan-dev/stan/pull/3266 added call for split r-hat to chains.hpp

when I first started working on this PR, I thought that we could add rank-normalized ESS and MCSE to chains.hpp as well. I can revert chains.hpp to the state that it was in before #https://github.com/stan-dev/stan/pull/3266.

mitzimorris commented 1 month ago

I'm pretty nervous about changing both the code organization and the behavior at the same time (see above on quantiles, etc). It also means we can't test the new one against the old one

we know that the behavior will be different because the old version was implemented before the 2019 improved r-hat paper came along. for this PR, I ran the example output CSV files through CmdStanR, which uses posterior, which provides a correct implementation of the rank-normalized rhat. (Aki / Paul Beurkner / Johah ).

WardBrian commented 1 month ago

I'm not referring to the new ESS/R-Hat in that comment -- those are entirely new things being computed and reported to the user. But it does seem weird (and potentially impactful) if the 95% interval changes based on just updating stansummary and running on the same outputs as before

mitzimorris commented 1 month ago

this is what the quantile code from chains.hpp computes:

    using boost::accumulators::accumulator_set;
    using boost::accumulators::left;
    using boost::accumulators::quantile;
    using boost::accumulators::quantile_probability;
    using boost::accumulators::right;
    using boost::accumulators::stats;
    using boost::accumulators::tag::tail;
    using boost::accumulators::tag::tail_quantile;
    double M = x.rows();
    size_t cache_size = M;

    if (prob < 0.5) {
      accumulator_set<double, stats<tail_quantile<left> > > acc(
          tail<left>::cache_size = cache_size);
      for (int i = 0; i < M; i++)
        acc(x(i));
      return quantile(acc, quantile_probability = prob);
    }
    accumulator_set<double, stats<tail_quantile<right> > > acc(
        tail<right>::cache_size = cache_size);
    for (int i = 0; i < M; i++)
      acc(x(i));
    return quantile(acc, quantile_probability = prob);
  }

the new code is more efficient - but possibly different behavoir.

your call on what to do.

WardBrian commented 1 month ago

the new code is more efficient

Do you have timing on that? The latest commit seems to just call quantile in a loop, so I'd expect them to end up pretty much even

but possibly different behavoir.

Definitely different -- here's a super minimal CSV file:

foo
100
101
102
103
104
105
106
107
108
109
110

current stansummary:

      Mean  MCSE  StdDev  5%  50%  95%  N_Eff  N_Eff/s  R_hat

foo    105   1.8     3.3 100  105  110    3.4      inf    2.8

the cmdstan PR based on this branch:

      Mean  MCSE  StdDev  MAD   5%  50%  95%  N_Eff_bulk  N_Eff_tail  R_hat

foo    105   1.5     3.3  4.4  100  105  109         5.0         5.0    2.1
mitzimorris commented 1 month ago

this is not surprising - the super-minimal CSV file has 10 rows. we're advertising different computation. we do this all the time.

mitzimorris commented 1 month ago

also, trying to run minimal CSV file "foo.csv" through bin/stansummary from 2.35.0 release bombs. how did you run this test?

mitzimorris commented 1 month ago

here's another test: current release vs new branch - quantiles are the same.

 ~/.cmdstan/cmdstan-2.35.0> bin/stansummary src/test/interface/example_output/bernoulli_chain_1.csv 
Inference for Stan model: bernoulli_model
1 chains: each with iter=(1000); warmup=(0); thin=(1); 1000 iterations saved.

Warmup took 0.0090 seconds
Sampling took 0.023 seconds

                Mean     MCSE   StdDev     5%   50%   95%  N_Eff  N_Eff/s  R_hat

lp__            -7.3  3.7e-02     0.77   -9.1  -7.0  -6.8    443    19275    1.0
accept_stat__   0.90  4.6e-03  1.5e-01   0.57  0.96   1.0   1026    44597   1.00
stepsize__       1.0      nan  4.9e-15    1.0   1.0   1.0    nan      nan    nan
treedepth__      1.4  1.6e-02  4.9e-01    1.0   1.0   2.0    895    38919   1.00
n_leapfrog__     2.3  3.4e-02  9.6e-01    1.0   3.0   3.0    799    34758   1.00
divergent__     0.00      nan  0.0e+00   0.00  0.00  0.00    nan      nan    nan
energy__         7.8  5.1e-02  1.0e+00    6.8   7.5   9.9    411    17865    1.0

theta           0.26  6.1e-03     0.12  0.079  0.25  0.47    384    16683   1.00

new version:

(base) ~/github/stan-dev/cmdstan (feature/1263-new-rhat-summary)> bin/stansummary src/test/interface/example_output/bernoulli_chain_1.csv 
Inference for Stan model: bernoulli_model
1 chains: each with iter=1000; warmup=1000; thin=1; 1000 iterations saved.

Warmup took 0.0090 seconds
Sampling took 0.023 seconds

                Mean    MCSE   StdDev    MAD     5%   50%   95%  N_Eff_bulk  N_Eff_tail  R_hat

lp__            -7.3   0.034  7.7e-01   0.30   -9.1  -7.0  -6.8         519         503    1.0
accept_stat__   0.90  0.0041  1.5e-01  0.064   0.57  0.96   1.0        1284         941   1.00
stepsize__       1.0     nan  2.4e-15   0.00    1.0   1.0   1.0         nan         nan    nan
treedepth__      1.4     nan  4.9e-01   0.00    1.0   1.0   2.0         nan         nan    nan
n_leapfrog__     2.3   0.034  9.6e-01   0.00    1.0   3.0   3.0         790         786   1.00
divergent__     0.00     nan  0.0e+00   0.00   0.00  0.00  0.00         nan         nan    nan
energy__         7.8   0.047  1.0e+00   0.74    6.8   7.5   9.9         489         486    1.0

theta           0.26  0.0063  1.2e-01   0.12  0.079  0.25  0.47         361         395    1.0

Samples were drawn using hmc with nuts.
For each parameter, N_Eff_bulk and N_Eff_tail measure the effective sample size 
for the entire sample and for the the .05 and .95 tails, respectively, 
and R_hat_bulk and R_hat_tail measure the potential scale reduction 
on split chains, (at convergence will be very close to 1).
WardBrian commented 1 month ago

I used the current develop branch

I’m not saying we’re obligated to report the same numbers from now until the end of days, but I just think if we do make a change we should have a reason why the new number is better/more correct, not just that we like the code that calculates it more

mitzimorris commented 1 month ago

I have checked the existing unit tests and the behavior on real output files. I think this is OK.

bob-carpenter commented 1 month ago

I’m not saying we’re obligated to report the same numbers from now until the end of days, but I just think if we do make a change we should have a reason why the new number is better/more correct, not just that we like the code that calculates it more

The quantiles shouldn't change. [edit: this is wrong---see below] But the R-hat and ESS will. That's one of the motivations for the PR---to switch over to our improved versions of these as described in this paper:

Vehtari, Gelman, Simpson, Carpenter, Bürkner. 2019. Rank-normalization, folding, and localization: An improved Rˆ for assessing convergence of MCMC https://arxiv.org/abs/1903.08008

I believe @avehtari has already switched over RStan and PyStan and we're just waiting for this PR to switch over CmdStan and hence CmdStanPy and CmdStanR. I believe arViz also does these new calculations.

We could keep our old definitions and report two versions of R-hat (old, old per second, new) and three versions of ESS (old, new bulk, new tail), but I think it'd be confusing for users.

bob-carpenter commented 1 month ago

As another comment, I don't think we should be reporting ESS for sampler parameters other than lp__.

avehtari commented 1 month ago

I believe @avehtari has already switched over RStan and PyStan and we're just waiting for this PR to switch over CmdStan and hence CmdStanPy and CmdStanR. I believe arViz also does these new calculations.

CmdStanR can call CmdStan diagnostics, but CmdStanR is also using posterior package, which has new Rhat and bulk- and tail-ESS, and it is more handy to use posterio package diagnostics than CmdStan diagnostics. ArviZ, which is used by Python and Julia users, has these new functions, too. RStan is not yet using posterior package.

We could keep our old definitions and report two versions of R-hat (old, old per second, new) and three versions of ESS (old, new bulk, new tail), but I think it'd be confusing for users.

The old ESS which (called ess_basic() in posterior package) is still needed for computing MCSE for mean estimate. I think the issue is that reporting everything by default would be too much, but CmdStan doesn't have a nice way for the users to define which diagnostic quantities they want to see, and thus we need to make some choices.

As another comment, I don't think we should be reporting ESS for sampler parameters other than lp__.

I agree. Same for Rhat.

WardBrian commented 1 month ago

The quantiles shouldn't change.

My point is, they do. I believe the test cases we are using are not specific enough to see that, but doing a bit of testing on the side:

Here's a CSV file: https://gist.github.com/WardBrian/41bbf1b57829efa32e6aa58c5c493f57

Current stansummary (with --sig_figs=4)

                     Mean       MCSE     StdDev         5%        50%        95%  N_Eff    N_Eff/s      R_hat
theta           2.593e-01  7.266e-03  1.251e-01  8.379e-02  2.429e-01  4.830e-01  296.7      59330      1.003

new stansummary (with --sig_figs=4)

                  Mean       MCSE  StdDev      MAD       5%     50%     95%  N_Eff_bulk  N_Eff_tail   R_hat
theta           0.2593  7.263e-03  0.1251   0.1343  0.08379  0.2425  0.4830       296.9       446.6   1.008

0.2429 != 0.2425

CmdStanPy uses --sig_figs=6 by default, so end users will see different numbers.

Again, we can decide we want to make this change, but it just happening incidentally doesn't sit right with me, hence me not merging as-is.

avehtari commented 1 month ago

N_Eff_bulk N_Eff_tail

It would be better to use terms ESS_bulk and ESS_tail, as 1) N has different meaning in the paper describing these quantities, and 2) N_eff makes people to misread it as "Number of EFFective samples" which is wrong, and the correct term is "Effective Sample Size", 3) CmdStanR using posterior is showing ESS_bulk and ESS_tail.

mitzimorris commented 1 month ago

The old ESS which (called ess_basic() in posterior package) is still needed for computing MCSE for mean estimate.

@avehtari - could you check the implementations of mcse and mcse_sd?

https://github.com/stan-dev/stan/blob/dbf6a9f80ae2cd38e64fb627080ead01bc81dad7/src/stan/mcmc/chainset.hpp#L424-L467

WardBrian commented 1 month ago

It would be better to use terms ESS_bulk and ESS_tail

Yes, there is an existing cmdstan issue for this https://github.com/stan-dev/cmdstan/issues/916, which @mitzimorris will address as part of the cmdstan pr that accompanies this

mitzimorris commented 1 month ago

@WardBrian regarding the implementation of quantiles, we need a better understanding how the boost accumulators library computes quantiles - it is returning the value of a the nth draw or interpolating? and which is appropriate?

I spent yesterday looking at the boost code and don't understand it well enough to make that call. changing to use C++ std::nth_element would make the code easier to understand and remove dependencies, but as you say, that's not a good reason to change behavoirs.

WardBrian commented 1 month ago

I suspect the difference may just be as simple as the existing code prioritizes counting from the left if you're requesting the 50% or higher quantile, but I agree the boost reference is pretty opaque in this area

WardBrian commented 1 month ago

There is also stan::math::quantile -- calling this would probably be the ideal thing to do from a code standpoint, but it yields yet a third set of numbers...

bob-carpenter commented 1 month ago

The quantiles shouldn't change.

I spoke too soon. The problem with quantiles is that reasonable definitions can differ. When you don't have an exact item, a lot of algorithms are going to interpolate, and how you do that can differ. For example if you only have two numbers [1, 3] and ask for a 50% quantile (i.e., a median), what is the answer? You could return 1 or 3, which is what you get if you turn 50% and the size of 2 into an index. Or you could interpolate between 1 and 3 and return 2. The problem's the exact same thing at an edge, but it might not be 50%. What do you do if I ask you for a 40% quantile of [1, 3]? There you might be tempted to linearly interpolate and say 1.8. That interpolation can often be made more stable if you use more elements around an element, so if we have four elements [2, 3, 5, 10], we might return a different answer than [3, 5] for the 50% quantile. It's a mess!

The bottom line, though, is that in these situations, we really don't care. The sampling algorithm didn't have the granularity to cleanly resolve the quantile so if we use one of the neightboring values or interpolate, it should be fine for our use.

There is also stan::math::quantile -- calling this would probably be the ideal thing to do from a code standpoint, but it yields yet a third set of numbers.

See above. That should also be fine. Using a quantile function that works on a sorted list will be faster if we do a lot of quantiles. Otherwise, using an accumulator boost style will be faster. With an accumulator, if you are looking for the rank K element in a list, there's an O(N * log2(K)) algorithm (N for considering each element, log2(K) to insert it into position in the accumulator) that only consumes log2(K) memory. It can get even more clever than that, see, e.g., the variants available here:

https://www.boost.org/doc/libs/1_86_0/doc/html/accumulators/user_s_guide.html#accumulators.user_s_guide.the_statistical_accumulators_library

WardBrian commented 1 month ago

Calling stan::math::quantile seems like the best choice then. The code can also be made quite simple:

    Eigen::MatrixXd draws = samples(index);
    Eigen::Map<Eigen::VectorXd> map(draws.data(), draws.size());

    return stan::math::quantile(map, prob);

There is an overload for if prob is a single number or if it is a vector

mitzimorris commented 1 month ago

plugged most recent version of this branch into CmdStan and comparing CmdStanR's summary (via fit$print) against CmdStan's bin/stansummary on the test CSV files in cmdstan/src/test/interface/example_outputs and median and quantile values, as well as mean, sd, mad, ess, and rhat all match out to 9 digits.

mitzimorris commented 1 month ago

comments look great - thanks! per discussion with @SteveBronder, I think the way to proceed here is with 3 PRs:

  1. fix issues with stan_csv_reader
  2. refactor code added to stan::analyze::mcmc in https://github.com/stan-dev/stan/pull/3266 to add ESS_bulk, ESS_tail and MCSE calculations (to chains.hpp)
  3. add chainset.hpp to code base; a replacement for chains.hpp.
WardBrian commented 1 month ago

I agree that would have been a better (or at least probably faster) way to go originally, but this being as far along as it is it would also be fine by me to proceed with this one big PR. Whichever you feel like, at this point

mitzimorris commented 1 month ago

I think 3 PRs would be better - a lot of work, but the grown-up thing to do here. (especially since I've been advocating pragmatic programming lately).

avehtari commented 1 month ago

@avehtari - could you check the implementations of mcse and mcse_sd?

These should use ess_mean without rank normalization. The rank normalized versions are generic diagnostics that work also with infinite variance, but by definition of MCSE mcse_mean and mcse_sd should not use rank-normalization.

mitzimorris commented 1 month ago

closing this per above comment - https://github.com/stan-dev/stan/pull/3310#issuecomment-2389091218