stan-dev / stan

Stan development repository. The master branch contains the current release. The develop branch contains the latest stable development. See the Developer Process Wiki for details.
https://mc-stan.org
BSD 3-Clause "New" or "Revised" License
2.56k stars 369 forks source link

Next Manual, 2.7.0 (revisited) #1508

Closed bob-carpenter closed 9 years ago

bob-carpenter commented 9 years ago

This is the issue for the next manual. All manual-related requests and patches should go here. There will be a branch named after the issue as soon as someone creates one that can also be modified directly if you have push permission on stan-dev.

bob-carpenter commented 9 years ago

From @jonathan-g in #1507:

The 2.6.2 manual (Saturday 14th March, 2015) describes Phi on page 46 as $\int_{-\infty}^x \text{Normal}(y|0,1) dy$, but on page 312 as $\frac{1}{\sqrt{2\pi}} \int_0^x e^{-t^2/2} dt$. These are inconsistent in the lower limit of integration. I think the definition on page 312 should have $-\infty$ as the lower limit.

randommm commented 9 years ago

In chapter 11. Latent Discrete Parameters:

log_sum_exp should be over log Uniform(s|1, T ) + t=1 log Poisson(D t |t < s ? e : l) (it's missing bracket). So it would be: log_sum_exp ( log Uniform(s|1, T ) + t=1 log Poisson(D t |t < s ? e : l) )

"Although not computationally advantageous compared to working in expectation using lp, it is possible to use Stan to sample the value of s at every iteration."

Is misleading. We are sampling the discrete parameter via "working in expectation using lp", so it is the computationally advantageous way of doing it.

bob-carpenter commented 9 years ago
Logo copyright 2015 Michael Betancourt. Special thanks to Stephanie Mannheim (http://www.stephaniemannheim.com/) for critical refinements.
bob-carpenter commented 9 years ago
bgoodri commented 9 years ago

In the subsection "Program Block: transformed parameters" and the subsection "Reject Statements" (and perhaps elsewhere) we should emphasize that inequality constraints are intended to catch programmer or numerical errors, not to define a probability model (in contrast to constraints in the parameters block).

The issue is fundamentally the same as in

parameters {
  real a;
  real<lower=a> b;
  real<lower=a,upper=b> theta;
}
model {
  theta ~ normal(0,1); // wrong, needs T[a,b]
}

which is just as wrong as

parameters {
  real a;
  real<lower=a> b;
  real theta;
}
model {
  if (theta < a || theta > b) reject("theta not in (a,b)");
  theta ~ normal(0,1); // still wrong, needs T[a,b]
}

The conceptual issue is that the prior does not integrate to 1 over the admissible parameter space; it integrates to 1 over all real numbers and integrates to something less than 1 over (a,b). In these simple univariate cases, we can overcome that with the T[,] notation, which essentially divides by whatever the prior integrates to over (a,b).

Unfortunately, users are using the inequality constraints in transformed parameters blocks and similarly reject() statements to enforce complicated inequalities on multivariate functions, in which case their priors integrate to something less than 1 over the region of the parameter space where the complicated inequalities are satisfied. But we don't generally know what value the prior integrates to, so we can't do an increment_log_prob(-log(value)) to compensate.

It is also the same problem as doing

parameters {
  real<lower=0,upper=1> correlations[K * (K - 1) / 2];
}
transformed parameters {
  corr_matrix[K] Lambda; // may get rejected by check_pos_definite
  int counter;
  counter <- 1;
  for (i in 1:K) {
    Lambda[i,i] <- 1;
    for (j in (i+1):K) {
      Lambda[i,j] <- correlations[counter];
      Lambda[j,i] <- Lambda[i,j];
      counter <- counter + 1;
     }
  }
}
model {
  correlations ~ beta(alpha,beta); // wrong
  ...
}

In this case, the prior on the correlations integrates to 1 over the (0,1)^(K * (K - 1) / 2) hypercube but for K > 2, many points in that hypercube violate the positive definiteness constraint on correlation matrices. So, the prior integrates to some unknown number less than 1 over the positive definite subspace that we can't compensate for.

And even if this is not a "big deal" in particular models where the amount of truncated posterior density is negligible or constant, we can't sample from that truncated posterior efficiently. So, users really need to utilize 1-to-1 mappings that guarantee the constraints are satisfied and only use reject() statements to check themselves.

bob-carpenter commented 9 years ago
bob-carpenter commented 9 years ago
jonathan-g commented 9 years ago

page 451, section "Warmup Times and Estimating the Mass Matrix": minor LaTeX problem due to using non-ASCII character instead of ASCII single-quote for apostrophe in source: "... standardizing the data doesnâ ̆ Z ́ t always help. Therefore, Stan ..."

bob-carpenter commented 9 years ago

Linas Mockus reported on stan-users:

Did anybody have experience in implementing negbin k (negative binomial when k=1) in stan. I was trying to implement using lmgamma function but I assume there is a typo in stan manual:

That last one looks like it wasn't a typo. At least it matches the Wikipedia definition.

bob-carpenter commented 9 years ago

I may not have time to get this in, but it's a useful writeup from Michael for going forward:

This may not Unfortunately we’re overdo for a serious rewrite of the manual, especially given our rapidly evolving understanding of subtle theoretical MCMC issues. I have everything collected into slides, but turning it into text will have to wait until I have some free time. But I’ll summarize the results here.

MCMC is just a method for numerically estimating integrals, which is hard in high dimensions because of something called concentration of measure. Essentially the only parameters that contribute to integrals lie on a hyper surface in parameter space and if we want to estimate integrals then we need to find that hyper surface and then explore it. That’s exactly what a Markov chain does — eventually we have a series of states distributed across the hyper surface, { y_{n} }, that we can use as a grid for doing really simple quadrature,

\int dy p(y) f(y) ~ 1/N sum{n = 1}^{N} f(y{n})

Everything relies on our being able to find and then explore the entire hyper surface (which we call the typical set). This is incredibly hard to determine theoretically so we’re left with a bunch of necessary but not sufficient diagnostics to identify when we’re not exploring everything.

All of these apply only to samples — because we dynamically adapt sampling parameters in warmup those samples will behave very differently. Again, ignore anything you see in warmup.

Traceplots:

The characteristic way pathologies manifest is the Markov chain getting stuck in the boundary of a pathological region of parameter space. If you run long enough you’ll literally see the chain get stuck for long periods of time. You can also compare chains to identify multi modality and other problems.

R-hat:

If our Markov chain can explore the entire target distribution then any realization of it should look the same. R-hat essentially performs an analysis of variance on a bunch of chains (and subsets of chains) looking for any deviations. The more chains you use the more sensitive you’ll be to potential pathologies.

n-divergent:

Hamiltonian Monte Carlo has a unique diagnostic — regions of the typical set that are hard to explore induce particular numerical divergences that we can capture and report. This is much more sensitive then looking at trace plots! If you see any divergences in sampling then you have to be careful. For a few divergences you can increase the target acceptance probability, but for any nontrivial number of divergences you’ll probably have to consider tweaking your model to use stronger priors or different parameterizations.

In RStan you can grab the divergences with

fit <- stan(file='model_name.stan', data=input_data, iter=2000, chains=1, seed=4938483)

count_divergences <- function(fit) { sampler_params <- get_sampler_params(fit, inc_warmup=FALSE) sum(sapply(sampler_params, function(x) c(x[,'n_divergent__']))[,1]) }

Finally the treedepth issue is not so much a diagnostic of a bad Markov chain but rather a safeguard we’ve built in to avoid infinite loops caused by ill-formed posteriors which require the sampler to explore all the way to infinity in back (requiring an infinite tree depth an infinite time). Sometimes your sampler will need to explore beyond our default safeguard value, in which case you have to increase it manually — this doesn’t effect the validity of your results, just the performance of the algorithm.

I like looking at a histogram of the treedepths,

hist_treedepth <- function(fit) { sampler_params <- get_sampler_params(fit, inc_warmup=FALSE) hist(sapply(sampler_params, function(x) c(x[,'treedepth__']))[,1], breaks=0:20, main="", xlab="Treedepth") abline(v=10, col=2, lty=1) }

Technically the sampler tries one more step after maxdepth which is why you might see it exceed that value by one. I can never remember the indexing and ordering to state whether you should see maxdepth or maxdepth + 1 but in the end it doesn’t really matter as the bad behavior (histogram saturating at a large value) is pretty obvious.