stan-dev / stan

Stan development repository. The master branch contains the current release. The develop branch contains the latest stable development. See the Developer Process Wiki for details.
https://mc-stan.org
BSD 3-Clause "New" or "Revised" License
2.57k stars 368 forks source link

next manual: 2.6.2 #1202

Closed bob-carpenter closed 9 years ago

bob-carpenter commented 9 years ago

This is the issue for changes to the manual for the 2.6.0++ release.

bob-carpenter commented 9 years ago

From Andrew:

[ed. I think this is a more subtle issue --- there's a whole chapter pending on how to use optimization and it should go in there]

bob-carpenter commented 9 years ago

Having done a bunch of these, I now think it's a bad idea and that we should in fact pull back on all the pedantic definitions of earlier functions. But it's too much work to do it now.

bob-carpenter commented 9 years ago

From Michael Betancourt on stan-users:

Conceptually MCMC warmup times are basically equivalent to the autocorrelation time — because HMC chains tend to be lowly autocorrelated they also tend to converge really, really quickly.

The HUGE caveat is that such an argument assumes uniformity of curvature across the parameter space, as assumption which is violated in many of the complex models we see. Very often the tails have large curvature while the bulk is relatively well-behaved; in other words, warmup is slow not because the actual convergence time is slow but rather because the cost of an HMC iteration is more expensive out in the tails.

Poor behavior in the tails is the kind of pathology that Andrew notes we can find by running only a few warmup iterations. By looking at the acceptance probabilities and step sizes of the first few iterations you can get an idea of how bad the problem is and whether you need to address it with modeling efforts.

The mass matrix can compensate for linear (i.e. global) correlations in the posterior which can dramatically improve the performance of HMC in some problems. Of course this requires that we know the global correlations a priori.

In complex models this is incredibly difficult (for example, nonlinear model components convolve the scales of the data, so standardizing the data doesn’t always help) so in Stan we learn these correlations online with an adaptive warmup. In models with strong nonlinear (i.e. local) correlations this learning can be slow, even with regularization. This is ultimately why warmup in Stan often needs to be so long, and why a sufficiently long warmup can yield such substantial performance improvements.

An important clarification: the mass matrix compensates for only linear (equivalently global or position-independent) correlations in the posterior. The hierarchical parameterizations, on the other hand, affect some of the nasty nonlinear (equivalently local or position-dependent) correlations common in hierarchical models. Only in Riemannian HMC does the metric (which can be thought of as a position-dependent mass matrix) start compensating for nonlinear correlations.

As Matt noted, one of the biggest difficulties with dense mass matrices is the estimation of the mass matrix itself which introduces a bit of a chicken-and-egg scenario.

My personal intuition/experience is that most of the hard models that we see are not dominated by linear correlations that require a dense mass matrix but rather more complex nonlinear correlations that are best tackled with better parameterizations or more advanced algorithms like Riemannian HMC.

bob-carpenter commented 9 years ago
bob-carpenter commented 9 years ago

If z is a K-dimensional parameter vector constrained so that

sum(z) = 0,

then you can code up z as follows:

  parameters {
    simplex[K] z_raw;
    real<lower=0> z_scale;
  ...

  transformed parameters {
    vector[K] z;
    z <- (z_raw - 1.0 / K) * z_scale;
  ...

The parameterization involves K free parameters (K-1 for the simplex and one for the scale).

In the simplest case, if you have something like a 1PL IRT model, with likelihood:

y[m,n] ~ bernoulli(inv_logit(alpha[m] - beta[n]))

then alpha and beta aren't identified. But if you give them priors such as

 alpha ~ normal(0,1);
 beta ~ normal(0,10);

then you'll find in the posterior, alpha will sum to 0 and beta will float because it's less constrained. That's our recommended strategy for dealing with centering. It's not a hard constraint.

An alternative is to pin one of the elements of alpha or beta, and let everything adjust relative to that. This is clunky, but it can be done.

Ben also had a much simpler suggestion following up the above discussion:

parameters {
  vector[K-1] z_free;
}
transformed parameters {
  vector[K] z;
  for(k in 1:(K-1) z[k] <- z_free[k];
  z[K] <- -sum(z_free);
}

Ben adds:

But won't it produce a different prior based on which of the z[k] is defined as the negative sum of the others?

If you imagine a KxK covariance matrix that is singular, then yes, depending on which row / column you exclude, the remaining elements of the (K-1)x(K-1) covariance matrix differ but I don't think the distribution of z changes unless you are doing something informative with the priors that go into the linear predictor.

This paper actually argues for using sum-to-zero constraints in discrete choice models

http://www.burgette.org/sMNP-R0-FINAL.pdf

relative to the usual practice of defining utility relative to a baseline category largely because which category you decide to call the baseline affects the predictions. We can do a similar posterior is Stan but without the Gibbs sampling.

demodw commented 9 years ago

Since there is no issue opened for the 2.7 manual, I'll leave what I have noticed here. This is from the v2.6 reference, downloaded today. Minor stuff, but I am not quite sure if I can add a fix for it myself.

bob-carpenter commented 9 years ago

Thanks.

2.6.0++ is where updates for the 2.6 manual should go.
When we decide if that's 2.6.1 or 2.7.0, we'll change the issue title.

On Feb 6, 2015, at 10:21 AM, demodw notifications@github.com wrote:

Since there is no issue opened for the 2.7 manual, I'll leave what I have noticed here. This is from the v2.6 reference, downloaded today. Minor stuff, but I am not quite sure if I can add a fix for it myself.

p. 90 third equation states p_y(ytheta, mu, sigma). Should it not be p_y(y|theta, mu, sigma)? p. 96 second paragraph, it seems like something is missing: "Furthermore, the true values x are given a hierarchical prior; without"

— Reply to this email directly or view it on GitHub.

syclik commented 9 years ago
bob-carpenter commented 9 years ago

Moved from #1300

This may be a bug in the Stan Manual 2.6.0 version for the Multilevel 2PL IRT model example on pages 52-53.

The parameters are defined as:

parameters {
  real delta;                                      // mean student ability
  real alpha[J];                                 // ability for j - mean
  real beta[K];                                  // difficulty for k
  real<lower=0> gamma[K];            // discrimination of k
  real<lower=0> sigma_beta;          // scale of difficulties
  real<lower=0> sigma_gamma;     // scale of log discrimination
}

This is the same set up as in Manual version 2.5.0 with a delta parameter for mean ability.

However, in the model specification delta disappears and a mu_beta appears kind of in its place (but discussed in the text as the mean beta location).

  model {
      alpha ~ normal(0,1);
      beta ~ normal(0,sigma_beta);
      gamma ~ lognormal(0,sigma_gamma);
      mu_beta ~ cauchy(0,5);
      sigma_alpha ~ cauchy(0,5);
      sigma_beta ~ cauchy(0,5);
      sigma_gamma ~ cauchy(0,5);
      for (n in 1:N)
        y[n] ~ bernoulli_logit(gamma[kk[n]]
                               * (alpha[jj[n]] - (beta[kk[n]] + mu_beta)));
}

delta is then again discussed in the text on page 54.

bob-carpenter commented 9 years ago

Moved from Aki's issue #1309

bob-carpenter commented 9 years ago

From Andy Choi on stan-users:

(pg 96, section 10.1):

"..... Furthermore, the ture values x are given a hierarchical prior; without

bob-carpenter commented 9 years ago

From Sebastian Weber via stan-users:

The manual currently says:

The `size()` function extracts the number of elements in an array. The function is overloaded to apply to arrays of integers, reals, matrices, vectors, and row vectors.

OR

akcchoi commented 9 years ago
bob-carpenter commented 9 years ago

Thanks for dropping this into the right issue tracker.

We're about to do a patch release to deal with some issues to make Stan CRAN-compatible, and I'll patch all the bugs in the manual then.

On Mar 3, 2015, at 6:36 PM, akcchoi notifications@github.com wrote:

pg 91:

missing last paranthesis: increment_log_prob(log_sum_exp(..., log(0.7) + normal_log(y,3,1));

— Reply to this email directly or view it on GitHub.

akcchoi commented 9 years ago

Bob, Many thanks for the note and glad that I can help.

STAN is awesome! Andy

On Tuesday, 3 March 2015, Bob Carpenter notifications@github.com wrote:

Thanks for dropping this into the right issue tracker.

We're about to do a patch release to deal with some issues to make Stan CRAN-compatible, and I'll patch all the bugs in the manual then.

  • Bob

On Mar 3, 2015, at 6:36 PM, akcchoi <notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

pg 91:

missing last paranthesis: increment_log_prob(log_sum_exp(..., log(0.7) + normal_log(y,3,1));

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/1202#issuecomment-76928035.

bob-carpenter commented 9 years ago
bob-carpenter commented 9 years ago