next manual: 2.6.2 - Githubissues

bob-carpenter commented 9 years ago

This is the issue for changes to the manual for the 2.6.0++ release.

bob-carpenter commented 9 years ago

From Andrew:

[x] Somewhere we should discuss how to do constrained optimization. This came up on the list (indeed, we can use that example), that if you’re doing constrained optimization, you want to put the constraints into the model, not into the parameter definitions. This point is far from obvious!

[ed. I think this is a more subtle issue --- there's a whole chapter pending on how to use optimization and it should go in there]

bob-carpenter commented 9 years ago

[x] expand definitions of operators for matrices following basics for real

Having done a bunch of these, I now think it's a bad idea and that we should in fact pull back on all the pedantic definitions of earlier functions. But it's too much work to do it now.

bob-carpenter commented 9 years ago

From Michael Betancourt on stan-users:

[x] Warmup:

Conceptually MCMC warmup times are basically equivalent to the autocorrelation time — because HMC chains tend to be lowly autocorrelated they also tend to converge really, really quickly.

The HUGE caveat is that such an argument assumes uniformity of curvature across the parameter space, as assumption which is violated in many of the complex models we see. Very often the tails have large curvature while the bulk is relatively well-behaved; in other words, warmup is slow not because the actual convergence time is slow but rather because the cost of an HMC iteration is more expensive out in the tails.

Poor behavior in the tails is the kind of pathology that Andrew notes we can find by running only a few warmup iterations. By looking at the acceptance probabilities and step sizes of the first few iterations you can get an idea of how bad the problem is and whether you need to address it with modeling efforts.

[x] The Mass Matrix (or, more formally, the Euclidean Metric):

The mass matrix can compensate for linear (i.e. global) correlations in the posterior which can dramatically improve the performance of HMC in some problems. Of course this requires that we know the global correlations a priori.

In complex models this is incredibly difficult (for example, nonlinear model components convolve the scales of the data, so standardizing the data doesn’t always help) so in Stan we learn these correlations online with an adaptive warmup. In models with strong nonlinear (i.e. local) correlations this learning can be slow, even with regularization. This is ultimately why warmup in Stan often needs to be so long, and why a sufficiently long warmup can yield such substantial performance improvements.

[x] Nonlinearity (my headline, not Michael's)

An important clarification: the mass matrix compensates for only linear (equivalently global or position-independent) correlations in the posterior. The hierarchical parameterizations, on the other hand, affect some of the nasty nonlinear (equivalently local or position-dependent) correlations common in hierarchical models. Only in Riemannian HMC does the metric (which can be thought of as a position-dependent mass matrix) start compensating for nonlinear correlations.

As Matt noted, one of the biggest difficulties with dense mass matrices is the estimation of the mass matrix itself which introduces a bit of a chicken-and-egg scenario.

My personal intuition/experience is that most of the hard models that we see are not dominated by linear correlations that require a dense mass matrix but rather more complex nonlinear correlations that are best tackled with better parameterizations or more advanced algorithms like Riemannian HMC.

bob-carpenter commented 9 years ago

[x] add multi_logit model example with categorical_logit parameterization
[x] add (K-1) parameterization of categorical and softamx

bob-carpenter commented 9 years ago

[x] add discussion of how to center parameters

If z is a K-dimensional parameter vector constrained so that

sum(z) = 0,

then you can code up z as follows:

  parameters {
    simplex[K] z_raw;
    real<lower=0> z_scale;
  ...

  transformed parameters {
    vector[K] z;
    z <- (z_raw - 1.0 / K) * z_scale;
  ...

The parameterization involves K free parameters (K-1 for the simplex and one for the scale).

In the simplest case, if you have something like a 1PL IRT model, with likelihood:

y[m,n] ~ bernoulli(inv_logit(alpha[m] - beta[n]))

then alpha and beta aren't identified. But if you give them priors such as

 alpha ~ normal(0,1);
 beta ~ normal(0,10);

then you'll find in the posterior, alpha will sum to 0 and beta will float because it's less constrained. That's our recommended strategy for dealing with centering. It's not a hard constraint.

An alternative is to pin one of the elements of alpha or beta, and let everything adjust relative to that. This is clunky, but it can be done.

Ben also had a much simpler suggestion following up the above discussion:

parameters {
  vector[K-1] z_free;
}
transformed parameters {
  vector[K] z;
  for(k in 1:(K-1) z[k] <- z_free[k];
  z[K] <- -sum(z_free);
}

Ben adds:

But won't it produce a different prior based on which of the z[k] is defined as the negative sum of the others?

If you imagine a KxK covariance matrix that is singular, then yes, depending on which row / column you exclude, the remaining elements of the (K-1)x(K-1) covariance matrix differ but I don't think the distribution of z changes unless you are doing something informative with the priors that go into the linear predictor.

This paper actually argues for using sum-to-zero constraints in discrete choice models

http://www.burgette.org/sMNP-R0-FINAL.pdf

relative to the usual practice of defining utility relative to a baseline category largely because which category you decide to call the baseline affects the predictions. We can do a similar posterior is Stan but without the Gibbs sampling.

demodw commented 9 years ago

Since there is no issue opened for the 2.7 manual, I'll leave what I have noticed here. This is from the v2.6 reference, downloaded today. Minor stuff, but I am not quite sure if I can add a fix for it myself.

[x] p. 90 third equation states p_y(ytheta, mu, sigma). Should it not be p_y(y|theta, mu, sigma)?
[x] p. 96 second paragraph, it seems like something is missing: "Furthermore, the true values x are given a hierarchical prior; without"

bob-carpenter commented 9 years ago

Thanks.

2.6.0++ is where updates for the 2.6 manual should go.
When we decide if that's 2.6.1 or 2.7.0, we'll change the issue title.

Bob

On Feb 6, 2015, at 10:21 AM, demodw notifications@github.com wrote:

Since there is no issue opened for the 2.7 manual, I'll leave what I have noticed here. This is from the v2.6 reference, downloaded today. Minor stuff, but I am not quite sure if I can add a fix for it myself.

p. 90 third equation states p_y(ytheta, mu, sigma). Should it not be p_y(y|theta, mu, sigma)? p. 96 second paragraph, it seems like something is missing: "Furthermore, the true values x are given a hierarchical prior; without"

— Reply to this email directly or view it on GitHub.

syclik commented 9 years ago

[x] pg 266: Definining Probability Functions for Sampling Statements... this is missing a right parenthesis: increment_log_prob(foo_log(z,phi);

bob-carpenter commented 9 years ago

Moved from #1300

This may be a bug in the Stan Manual 2.6.0 version for the Multilevel 2PL IRT model example on pages 52-53.

The parameters are defined as:

parameters {
  real delta;                                      // mean student ability
  real alpha[J];                                 // ability for j - mean
  real beta[K];                                  // difficulty for k
  real<lower=0> gamma[K];            // discrimination of k
  real<lower=0> sigma_beta;          // scale of difficulties
  real<lower=0> sigma_gamma;     // scale of log discrimination
}

This is the same set up as in Manual version 2.5.0 with a delta parameter for mean ability.

However, in the model specification delta disappears and a mu_beta appears kind of in its place (but discussed in the text as the mean beta location).

  model {
      alpha ~ normal(0,1);
      beta ~ normal(0,sigma_beta);
      gamma ~ lognormal(0,sigma_gamma);
      mu_beta ~ cauchy(0,5);
      sigma_alpha ~ cauchy(0,5);
      sigma_beta ~ cauchy(0,5);
      sigma_gamma ~ cauchy(0,5);
      for (n in 1:N)
        y[n] ~ bernoulli_logit(gamma[kk[n]]
                               * (alpha[jj[n]] - (beta[kk[n]] + mu_beta)));
}

delta is then again discussed in the text on page 54.

[x] update model
[x] thank Christopher Gandrud for reporting

bob-carpenter commented 9 years ago

Moved from Aki's issue #1309

[x] p. 140 code has eta_sq ~ cauchy(0,5) and rho_sq ~ cauchy(0,5), but p. 141 recommends eta ~ cauchy(0,5) and 1/rho_sq ~ cauchy(0,5) (maybe the 1/rho_sq ~ is not correct syntax, but you know what I mean)
[x] p. 141 rho is couple times as rho instead of \rho
[x] p. 145 the code has extra "<" in the beginning of one line
[x] p. 219 example code has min and max, but the following paragraph refers to fmin and fmax

bob-carpenter commented 9 years ago

From Andy Choi on stan-users:

(pg 96, section 10.1):

"..... Furthermore, the ture values x are given a hierarchical prior; without

[x] complete the sentence
[x] thank Andy Choi

bob-carpenter commented 9 years ago

From Sebastian Weber via stan-users:

The manual currently says:

The `size()` function extracts the number of elements in an array. The function is overloaded to apply to arrays of integers, reals, matrices, vectors, and row vectors.

[x] remove this and suggest num_elements() instead

OR

[x] generalize the size() function

akcchoi commented 9 years ago

pg 91:
[x] missing last paranthesis: increment_log_prob(log_sum_exp(..., log(0.7) + normal_log(y,3,1));
[x] thank Andy Choi

bob-carpenter commented 9 years ago

Thanks for dropping this into the right issue tracker.

We're about to do a patch release to deal with some issues to make Stan CRAN-compatible, and I'll patch all the bugs in the manual then.

Bob

On Mar 3, 2015, at 6:36 PM, akcchoi notifications@github.com wrote:

pg 91:

missing last paranthesis: increment_log_prob(log_sum_exp(..., log(0.7) + normal_log(y,3,1));

— Reply to this email directly or view it on GitHub.

akcchoi commented 9 years ago

Bob, Many thanks for the note and glad that I can help.

STAN is awesome! Andy

On Tuesday, 3 March 2015, Bob Carpenter notifications@github.com wrote:

Thanks for dropping this into the right issue tracker.

We're about to do a patch release to deal with some issues to make Stan CRAN-compatible, and I'll patch all the bugs in the manual then.

Bob

On Mar 3, 2015, at 6:36 PM, akcchoi <notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

pg 91:

missing last paranthesis: increment_log_prob(log_sum_exp(..., log(0.7) + normal_log(y,3,1));

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/1202#issuecomment-76928035.

bob-carpenter commented 9 years ago

[x] See #1334 and fix it
[x] thank Fred Ros

bob-carpenter commented 9 years ago

[x] add an authors page listing the core developers

stan-dev / stan

next manual: 2.6.2 #1202