stan-dev / docs

Documentation for the Stan language and CmdStan
https://mc-stan.org/docs/
Other
38 stars 112 forks source link

FAQ from wiki for User's guide #188

Open bob-carpenter opened 4 years ago

bob-carpenter commented 4 years ago

Stan Frequently Answered Questions (FAQ)

This is more of a collection of answers than questions hence the name. It might be better to reformulate it eventually so users can index by question rather than by answer.

andrewgelman commented 4 years ago

Hi--i have a few comments on this; see below. Andrew

On May 23, 2020, at 7:08 PM, Bob Carpenter notifications@github.com wrote:

Do not use interval-bounded priors unless physically necessary. Although must be used for parameters representing probabilities, interval bounds tend to provide poor priors both statistically and computationally. If values beyond the bounds are consistent with the other parameters and data, probability mass will bunch up at the boundaries, biasing estimates and pushing the sampler or optimizer out to plus or minus infinity on the unconstrained scale.

I'm not quite sure how to say this, but I'm concerned that the above paragraph could be confusing.

Do not use overly diffuse priors. In particular, priors like gamma(1e-4, 1e-4) or inv_gamma(1e-4, 1e-4) lead to the opposite problem of interval bounded priors, namely they tend to pull estimates away from zero toward the tail. In general, make sure typical values from priors are reasonable for parameters.

Again, I'm not quite sure how to revise this, but I think this is misleading. inv_gamma(1e-4, 1e-4) is not a diffuse prior at all! It's very concentrated, as can be seen by plotting the density function.

Parameterization shape matters. There are multiple ways to parameterize the same density and some of them are more computationally efficient and arithmetically stable than others, and some produce such challenging posterior geometries that they induce biased samples. The ideal posterior will be unit scaled (centered at zero, scale of one) and have no correlation among the parameters. The usual sign that a parameterization needs to be fixed is divergent transitions. The usual fix is to try to both standardize the scale and break dependencies among parameters in the prior.

We could cite this paper for more on the topic: http://www.stat.columbia.edu/~gelman/research/published/parameterization.pdf

Sampling and optimization happen on the unconstrained scale. All parameters are transformed to the unconstrained scale for sampling with an appropriate Jacobian adjustment (which is turned off for optimization). For example, positive constrained parameters are log transformed, interval constrained paramters are log-odds (logit) transformed, simplex parameters are transformed with a steak-breaking construction, and covariance matrices are transformed with a Cholesky factorization.

There shouldn't be any divergent transitions. Divergences happen in Hamiltonian Monte Carlo when numerical stability is detectedin the algorithm used to simulate the Hamiltonian dynamics of a fictional particle representing the parameters, specifically when the simulated Hamiltonian diverges to a value far away from the origianl Hamiltonian (it should be exactly conserved in a perfect simulation). If divergent transitions arise, first try increasing the target acceptance rate to 0.9, 0.95, or even 0.99 or higher (adapt_delta in most interfaces) and also reduce the initial stepsize to 0.1 or 0.01 (stepsize in most interfaces). This will cause slower but surer adaptation and sampling. If that does not solve the problem, reparameterization will be required to deal with the extreme curvature being found. A parameterization with less variation across in posterior curvature (value of Hessian matrix of second derivatives of the log density with repspect to the parameters). Plots in the R interface are available to highlight where divergences occur to provide clues for reparameterizations.

"Shouldn't be any" is too strong. I was fitting a model recently that had a couple divergent transitions in the warmup; I think it was something to do with the starting values. Maybe rephrase to "There shouldn't be many divergent transitions"? Or, "There shoudln't be many divergent transitions after warmup is done."

Push calculations to the transformed data or generated quantity blocks if possible. These are evaluated once and once per iteration and do not require derivatives.

Clarify: "The transformed data block is evaluated just once at the beginning of the Stan call. The generated quantity block is evaluated once per iteration. In contrast, the transformed parameters and model blocks are evaluated at every step. There can be many steps in each iteration." Use specialized functions wherever possible. Do not create a diagonal matrix, then multiply---use the specialized multiplication (diag_pre_multiply, diag_post_multiply, and quad_form_diag.

Add "For example" to the beginning of the second sentence above.

Keep Stan code in standalone files. This makes it easier to index line numbers in parser or runtime errors and easier to share programs among different clients and the users' group.

An exception is google colab code, where it can be helpful to have the Stan code be editable within R or Python blocks.

To debug, build a program up from simpler tested parts in steps and validate on simulated data. Do not start trying to implement a long program and then provide it real data. Instead, begin with simple programs tested on simulated data (for posterior interval coverage, not for point estimate error), then continue to refine the program and data simulator until it is ready for real data.

We could point here, for example: https://statmodeling.stat.columbia.edu/2019/03/23/yes-i-really-really-really-like-fake-data-simulation-and-i-cant-stop-talking-about-it/

bob-carpenter commented 4 years ago

Thanks for the feedback on this.

What I'd like to do is inlude this and the recommended priors wiki as an appendix to the user's guide. This would be part of it, but it sounds like we really need to better explain some of these issues.

The problem with ignoring divergences is that they occur in regions of high curvature and signal that the algorithm isn't exploring the region it should be exploring. Thus they can lead to biased estimates. To diagnose whether ignoring divergences is acceptable for the estimation task, you'd need to compare against a better answer, say with a much lower step size. We can do that with simulated data and see what happens, but it's tricky with real data for which the model is poorly specified.