stan-dev / stan

Stan development repository. The master branch contains the current release. The develop branch contains the latest stable development. See the Developer Process Wiki for details.
https://mc-stan.org
BSD 3-Clause "New" or "Revised" License
2.59k stars 368 forks source link

next manual, 2.14++ #2122

Closed bob-carpenter closed 7 years ago

bob-carpenter commented 7 years ago

Summary:

This is where typos and bug reports and requests go for the next manual after the 2.14 release.

We didn't update the manual for the 2.14 release, so I updated the issue title.

bob-carpenter commented 7 years ago

@alvaro1101 reports in issue #2121 that the model goes off page in the mining disaster example in the latent discrete parameter chapter.

bob-carpenter commented 7 years ago

My response to a question from xiehw on stan-users list.

If the bivariate distribution is easily marginalized into univariate distributions, like say a normal, you'd use those on the observed components. So that'd be:

vector[2] y[N];
int<lower=0, upper=1> y_observed[N, 2];

for (n in 1:N) {
  if (y_observed[n, 1] && y_observed[n, 2])
    y[n] ~ multi_normal(mu, Sigma);
  else if (y_observed[n, 1])
    y[n, 1] ~ normal(mu[1], sqrt(Sigma[1, 1]));
  else if (y_observed[n, 2])
    y[n, 2] ~ normal(mu[2], sqrt(Sigma[2, 2]));
}

But... that's not very efficient. It's much better to vectorize all of this. So, in transformed data, build up three vectors of indices, for the three cases above:

transformed data {
  int ns12[observed_12(y_observed)];
  int ns1[observed_1(y_observed)];
  int ns2[observed_2(y_observed)];

you need to do this with functions to count because you need to declare sizes. Then the rest of transformed data just fills (I'm going to use the syntax about to release in 2.13):

  int n12 = 1;
  int n1 = 1;
  int n2 = 1;
  for (n in 1:N) {
    if (y_observed[n, 1] && y_observed[n, 2]) {
      ns12[n12] = n;
      n12 = n12 + 1;
    } else if (y_observed[n, 1]) {
      ns1[n1] = n;
      n1 = n1 + 1;
    } else if (y_observed[n, 2]) {
      ns2[n2] = n;    
      n2 = n2 + 1;
    }
  }

and then, in the model block, everything's nice and vectorizable using those indexes constructed once in transformed data:

  y[ns12] ~ multi_normal(mu, Sigma);
  y[ns1] ~ normal(mu[1], sqrt(Sigma[1, 1]));
  y[ns2] ~ normal(mu[2], sqrt(Sigma[2, 2]));

This is much more efficient than using latent variables for the missing data, but requires the multivariate distribution to be marginalized analytically.

mike-lawrence commented 7 years ago

In the multivariate regression example, in the "Optimization through Vectorization" section there's a bit that says

The code in the Stan program above can be sped up dramatically by replacing:
%
\begin{stancode}
  for (n in 1:N)
    y[n] ~ normal(x[n] * beta[jj[n]], sigma);
\end{stancode}
%
with a redeclaration of \code{beta} and a revised model block sampling
statement. 
%
\begin{stancode}
parameters {
  matrix[J, K] beta;
...
model {
  y ~ normal(rows_dot_product(x, beta[jj]), sigma);
  ...
\end{stancode}
%
which takes advantage of indexing introduced in Stan version 2.9 (see \refsection{language-multi-indexing}).

However, when you actually try this, one gets a parser error:

No matches for: 
  rows_dot_product(vector[], matrix)

because beta is an array of vectors. One can define beta as a matrix instead, but then the vectorization of the beta ~ multi_normal(... doesn't work. So either the docs will have to choose one or the other vectorization or (and this would be a non-docs issue) update the rows_dot_product() to treat an array of vectors like a matrix.

bob-carpenter commented 7 years ago

@mike-lawrence Thanks, I'll fix that. It's an ongoing tension in programming converting between representations useful for one thing and those useful for another (e.g., matrices in row vs. column-major order; one's good for left terms and one for right terms in multiplications).

bob-carpenter commented 7 years ago

Fix problems brought up by @andrewgelman

I'm looking at the mixture model in the manual (p. 115 of the 2.12.0 manual). I see two potential problems:

[from Bob: Yes, add these. Include the argument that the posterior is symmetric iff there is a symmetric prior.]

bob-carpenter commented 7 years ago
bob-carpenter commented 7 years ago

Stefano on stan-users points out:

The Stan program for the change point model is shown in Figure 12.1. The transformed parameter lp[s] stores the quantity log p„s;Dje; l…. Although the model in Figure 12.1 is easy to undertand, <- (s) missing.

simpsonm commented 7 years ago

Edit: I now see something a little more forceful in section 4.10. Perhaps that language should be used in the section(s) on user-defined functions as well.

Currently in section 22.1 the manual notes that you can use reject() statements in user-defined functions in order to do input validation. I think the manual should say more and in fact strongly encourage the user to use such statements, at least for more complex functions. Consider the following function:

real myfun(real x){
  real z = x;
  while(fabs(z) > 1.0){
    z = z/2.0;
  }
  return z;
}

This will compile fine, but at runtime if fabs(x) == positive_infinity() the while loop will not terminate and Stan will eat up all available memory before crashing (in RStan 2.12.1 anyway). This is easily solvable by checking if x is a finite number and rejecting if it isn't. For example this fixes the problem:

real myfun(real x){
  if(x == not_a_number() || x == positive_infinity() || x == negative_infinity())
    reject("x should be a finite real number; x = ", x);
  real z = x;
  while(fabs(z) > 1.0){
    z = z/2.0;
  }
  return z;
}

In this example it's not too hard to figure out what's going on, but in a more complex while loop it may not be obvious. Biasing users toward starting with input validation for their more complex functions seems like a good thing.

This suggestion is motivated by my own recent issues with a more complex function that uses a while loop, and had the manual been more assertive about using reject() statements I think I would have avoided the problem altogether. Or at least I'd like to think so.

bob-carpenter commented 7 years ago

@simpsonm Great idea. This is a key principle in writing APIs---they should verify their inputs and fail as early as possible with a meaningful error message. And thanks for working out the example.

You can't really do x == not_a_number() because comparing NaN to anything yields false. That's why there's an is_nan(x) function; we also have an is_inf(x), so the condition is most naturally

  if (is_nan(x) || is_inf(x)) ...

We should really have an is_finite().

bob-carpenter commented 7 years ago
sakrejda commented 7 years ago
bob-carpenter commented 7 years ago
RobertMyles commented 7 years ago

On page 122, should the lines:

// thin and scale the QR decomposition
  Q_ast = qr_Q(x)[, 1:K] * sqrt(n - 1);
  R_ast = qr_R(x)[1:K, ] / sqrt(n - 1);

use N - 1 instead of n - 1, given that there is no n anywhere else in the program?

bob-carpenter commented 7 years ago

@RobertMyles Thanks. I'll fix https://github.com/stan-dev/stan/issues/2122#issuecomment-271621141

bob-carpenter commented 7 years ago

From Conner DiPaolo on stan-users:

[editorial note: This is all implied by the fact that LL' is positive definite.]

Also:

bob-carpenter commented 7 years ago
bob-carpenter commented 7 years ago

From #2212 by https://github.com/vasishth:

bob-carpenter commented 7 years ago

@seantalts brought this up while unit testing to_matrix:

real a[0, 3];

This will compile to a vector<vector<T> > type, but it will be empty. So if we do something like to_matrix(a), we'll get a 0 x 0 matrix.

jgabry commented 7 years ago

On stan-users John Hall points out that at the bottom of page 325 (under "Non-Centered Parameterization" heading) the following sentence

When there is a lot of data, such a hierarchical model can be made much more efficient by shifting the data’s correlation with the parameters to the hyperparameters.

should actually be referring to when there is not much data.

He suggests that the whole paragraph

When there is a lot of data, such a hierarchical model can be made much more efficient by shifting the data’s correlation with the parameters to the hyperparameters. Similar to the funnel example, this will be much more efficient in terms of effective sample size when there is not much data (see (Betancourt and Girolami, 2013)).

be changed to something like the following:

When there is not much data, a non-centered parameterization can be much more efficient in terms of effective sample size by shifting the data’s correlation with the parameters to the hyperparameters. (see (Betancourt and Girolami, 2013)).

bob-carpenter commented 7 years ago

John Hall went on to suggest linking to Gelman and Hill 21.14, which provides a statistic to estimate how much pooling can/will be done.

lambda <- function(fit, ...) {
    extracted_fit <- rstan::extract(fit, permuted = TRUE, ...)
    N <- length(extracted_fit)
    result <- rep(NA, N)
    names(result) <- names(extracted_fit)
    for (i in 1:N) {
        extracted_fit_i <- extracted_fit[[i]]
        if (length(dim(extracted_fit_i)) == 1) next #only calculate if more than 
                                                    #1 dimension
        e <- extracted_fit_i - mean(extracted_fit_i)
        result[i] <- 1 - var(apply(e, 2, mean)) / mean(apply(e, 1, var))
    }
    return(result)
}
bob-carpenter commented 7 years ago
data {
 int<lower = 0> N_obs;
 int<lower = 0> N_mis;
 int<lower = 1, upper = N_obs + N_mis> ii_obs[N_obs];
 int<lower = 1, upper = N_obs + N_mis> ii_mis[N_mis];
 real y_obs[N_obs];
  ...
parameters {
  real y[N_obs + N_mis];
  ...
transformed parameters {
 real y_mis[N_mis];
 y[ii_obs] = y_obs;
 y[ii_mis] = y_mis;
 ...
jgabry commented 7 years ago

Do we now allow indexing and assignments in the parameters block? That is, does this

y[ii_obs] = y_obs;

really work in parameters now? Or was that supposed to go in transformed parameters?

bob-carpenter commented 7 years ago

Problem is with vertical bar---with it, nothing indexes properly, which explains why all the other functions look OK other than the probability functions.

bob-carpenter commented 7 years ago
data {
  int<lower = 0> N_obs;
  int<lower = 0> N_mis;
  int<lower = 1, upper = N_obs + N_mis> ii_obs[N_obs];
  int<lower = 1, upper = N_obs + N_mis> ii_mis[N_mis];
  real y_obs[N_obs];
}
transformed data {
  int<lower = 0> N;
  N = N_obs + N_mis;
}
parameters {
  real y_mis[N_mis];
  real<lower=0> sigma;
}
transformed parameters {
  real y[N_obs + N_mis];
  y[ii_obs] = y_obs;
  y[ii_mis] = y_mis;
}
model {
  sigma ~ gamma(1, 1);
  y[2:N] ~ normal(y[1:(N - 1)], sigma);
}
alvaro1101 commented 7 years ago

Hi,

There is few deprecated assignment operator <- in pages 203 and 579.

bob-carpenter commented 7 years ago

The non-centered parameterization is

parameters {
  real mu;
  real<lower=0> lambda;
  real<lower=0> theta;
  ...
transformed parameters {
  real u = mu + y * z;  // u ~ double_exponential(mu, lambda);
  ...
model {
  y ~ exponential(inv(lambda));
  z ~ normal(0, 1);
  ...
to-mi commented 7 years ago

In the signature of normal_lccdf on page 493, the second | should probably be ,. Similarly for signature for normal_rng immediately following the previous. (In fact, most _cdf functions have , instead of | as the first delimiter but this seems to be already noted in #2034.)

Descriptions of sampling statements, like "Increment log probability with bernoulli_lpmf(y, theta) ..." on page 478, should have | instead of ,.

bob-carpenter commented 7 years ago

From Andrew:

jmh530 commented 7 years ago

In Section 8.2 on QR factorization, I feel like the paragraph that recommends you use it when you do not have an informative prior on the location of beta is mistaken (also mentioned in 8.3).

Consider a probability distribution Y ~ N(XB, sigma) with a prior B ~ N(mu_B, sd_B). Y is then re-written as Y ~ N(Q theta, sigma). The equivalent distribution on the prior of theta would be theta ~ MVN(mu_B R, t(R) diag(sd_B ^ 2) R), assuming mu_B is 1XN.

My point is that you can still include a prior on the mean of beta, you just have to do the appropriate transformation in order to express it in terms of the new beta. This raises some other questions about the reasoning in that section, but I'll leave off here.

bob-carpenter commented 7 years ago
jgabry commented 7 years ago

@jmh530 yeah you're right that it's possible to use the QR approach even with a non-zero prior mean on the coefficients, so perhaps the recommendation in the manual is worded too strongly. @bgoodri should comment on this though because he wrote that section and I forget if he had a reason for discouraging this (he's definitely aware that's it's possible). One reason might just be that it is strange to have a prior belief that there is a dense covariance matrix for the MVN on theta since theta represents the coefficients on the columns of the Q matrix and those columns are orthogonal.

jmh530 commented 7 years ago

True. When I was playing around with it, I had extracted the standard deviations from the covariance matrix and just used the univariate normal distribution.

On Wed, Mar 8, 2017 at 11:40 AM, Jonah Gabry notifications@github.com wrote:

@jmh530 https://github.com/jmh530 yeah you're right that it's possible to use the QR approach even with a non-zero prior mean on the coefficients, so perhaps the recommendation in the manual is worded too strongly. @bgoodri https://github.com/bgoodri should comment on this though because he wrote that section and I forget if he had a reason for discouraging this (he's definitely aware that's it's possible). One reason might just be that it is strange to have a prior believe that there is a dense covariance matrix for the MVN on theta since theta represents the coefficients on the columns of the Q matrix and those columns are orthogonal.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2122#issuecomment-285094262, or mute the thread https://github.com/notifications/unsubscribe-auth/AMJWygKPHFdD8niS1Y0CMq5kVBZK-7txks5rjtoEgaJpZM4K59eC .

bob-carpenter commented 7 years ago

Olivier Ma pointed out on stan-users that the code on p. 131 is trying to copy a column vector into a row of a matrix. It should be:

  row_vector[D] zeros = rep_row_vector(0, D);
bob-carpenter commented 7 years ago

Eric Innocents Eboulet pointed this out on stan-users:

aechase commented 7 years ago

In 8.16, the Posterior Predictive Checks section contains the following sample code:

 generated quantities {
      vector[N] y_tilde;
      for (n in 1:N)
        y_tilde = normal_rng(alpha + beta * x[n], sigma);
    }

I believe that should be y_tilde[n] = ...

betanalpha commented 7 years ago

From Trung Dung Tran: in "Non-Centered Parameterization" within section 26.6 (page 325) the sentence "When there is a lot of data, such a hierarchical model..." should read "When there is not a lot of data, such a hierarchical model...".

bob-carpenter commented 7 years ago
bob-carpenter commented 7 years ago
bob-carpenter commented 7 years ago
bob-carpenter commented 7 years ago
bob-carpenter commented 7 years ago
bob-carpenter commented 7 years ago

Fixed with #2266