next manual, 2.14++ - Githubissues

bob-carpenter commented 7 years ago

Summary:

This is where typos and bug reports and requests go for the next manual after the 2.14 release.

We didn't update the manual for the 2.14 release, so I updated the issue title.

bob-carpenter commented 7 years ago

@alvaro1101 reports in issue #2121 that the model goes off page in the mining disaster example in the latent discrete parameter chapter.

[x] fix

bob-carpenter commented 7 years ago

My response to a question from xiehw on stan-users list.

[x] add this to missing data chapter for case of bivariate data with 0--2 missing elements per observation

If the bivariate distribution is easily marginalized into univariate distributions, like say a normal, you'd use those on the observed components. So that'd be:

vector[2] y[N];
int<lower=0, upper=1> y_observed[N, 2];

for (n in 1:N) {
  if (y_observed[n, 1] && y_observed[n, 2])
    y[n] ~ multi_normal(mu, Sigma);
  else if (y_observed[n, 1])
    y[n, 1] ~ normal(mu[1], sqrt(Sigma[1, 1]));
  else if (y_observed[n, 2])
    y[n, 2] ~ normal(mu[2], sqrt(Sigma[2, 2]));
}

But... that's not very efficient. It's much better to vectorize all of this. So, in transformed data, build up three vectors of indices, for the three cases above:

transformed data {
  int ns12[observed_12(y_observed)];
  int ns1[observed_1(y_observed)];
  int ns2[observed_2(y_observed)];

you need to do this with functions to count because you need to declare sizes. Then the rest of transformed data just fills (I'm going to use the syntax about to release in 2.13):

  int n12 = 1;
  int n1 = 1;
  int n2 = 1;
  for (n in 1:N) {
    if (y_observed[n, 1] && y_observed[n, 2]) {
      ns12[n12] = n;
      n12 = n12 + 1;
    } else if (y_observed[n, 1]) {
      ns1[n1] = n;
      n1 = n1 + 1;
    } else if (y_observed[n, 2]) {
      ns2[n2] = n;    
      n2 = n2 + 1;
    }
  }

and then, in the model block, everything's nice and vectorizable using those indexes constructed once in transformed data:

  y[ns12] ~ multi_normal(mu, Sigma);
  y[ns1] ~ normal(mu[1], sqrt(Sigma[1, 1]));
  y[ns2] ~ normal(mu[2], sqrt(Sigma[2, 2]));

This is much more efficient than using latent variables for the missing data, but requires the multivariate distribution to be marginalized analytically.

mike-lawrence commented 7 years ago

In the multivariate regression example, in the "Optimization through Vectorization" section there's a bit that says

The code in the Stan program above can be sped up dramatically by replacing:
%
\begin{stancode}
  for (n in 1:N)
    y[n] ~ normal(x[n] * beta[jj[n]], sigma);
\end{stancode}
%
with a redeclaration of \code{beta} and a revised model block sampling
statement. 
%
\begin{stancode}
parameters {
  matrix[J, K] beta;
...
model {
  y ~ normal(rows_dot_product(x, beta[jj]), sigma);
  ...
\end{stancode}
%
which takes advantage of indexing introduced in Stan version 2.9 (see \refsection{language-multi-indexing}).

However, when you actually try this, one gets a parser error:

No matches for: 
  rows_dot_product(vector[], matrix)

because beta is an array of vectors. One can define beta as a matrix instead, but then the vectorization of the beta ~ multi_normal(... doesn't work. So either the docs will have to choose one or the other vectorization or (and this would be a non-docs issue) update the rows_dot_product() to treat an array of vectors like a matrix.

[x] fix

bob-carpenter commented 7 years ago

@mike-lawrence Thanks, I'll fix that. It's an ongoing tension in programming converting between representations useful for one thing and those useful for another (e.g., matrices in row vs. column-major order; one's good for left terms and one for right terms in multiplications).

bob-carpenter commented 7 years ago

Fix problems brought up by @andrewgelman

I'm looking at the mixture model in the manual (p. 115 of the 2.12.0 manual). I see two potential problems:

[x] Should we do orderded mu[K]; so that we solve the alising problem?
[x] I'm worried about the cauchy(0,2.5); prior on sigma. I think that it may be necessary to exclude sigma from 0, for example using a lognormal prior. If any sigma can be near 0, this will cause the likelihood to blow up and it will give us a degenerate posterior.
[x] in some examples, not separated well---identification vs. interpretation
[x] in some examples, care about interpreting components (male/female, democrat republican, etc.)
[x] in some examples, overparameterized and don't care about components---only prediction

[from Bob: Yes, add these. Include the argument that the posterior is symmetric iff there is a symmetric prior.]

bob-carpenter commented 7 years ago

[x] add warning to diag_matrix warning about inefficiency and recommend looping or one of the diag_ scaling functions
[x] add warning to inverse() about numerical instability and recommend matrix division

bob-carpenter commented 7 years ago

Stefano on stan-users points out:

The Stan program for the change point model is shown in Figure 12.1. The transformed parameter lp[s] stores the quantity log p„s;Dje; l…. Although the model in Figure 12.1 is easy to undertand, <- (s) missing.

[x] add missing s

simpsonm commented 7 years ago

Edit: I now see something a little more forceful in section 4.10. Perhaps that language should be used in the section(s) on user-defined functions as well.

Currently in section 22.1 the manual notes that you can use reject() statements in user-defined functions in order to do input validation. I think the manual should say more and in fact strongly encourage the user to use such statements, at least for more complex functions. Consider the following function:

real myfun(real x){
  real z = x;
  while(fabs(z) > 1.0){
    z = z/2.0;
  }
  return z;
}

This will compile fine, but at runtime if fabs(x) == positive_infinity() the while loop will not terminate and Stan will eat up all available memory before crashing (in RStan 2.12.1 anyway). This is easily solvable by checking if x is a finite number and rejecting if it isn't. For example this fixes the problem:

real myfun(real x){
  if(x == not_a_number() || x == positive_infinity() || x == negative_infinity())
    reject("x should be a finite real number; x = ", x);
  real z = x;
  while(fabs(z) > 1.0){
    z = z/2.0;
  }
  return z;
}

In this example it's not too hard to figure out what's going on, but in a more complex while loop it may not be obvious. Biasing users toward starting with input validation for their more complex functions seems like a good thing.

This suggestion is motivated by my own recent issues with a more complex function that uses a while loop, and had the manual been more assertive about using reject() statements I think I would have avoided the problem altogether. Or at least I'd like to think so.

[x] rewrite example to highlight why we want to catch errors early

bob-carpenter commented 7 years ago

@simpsonm Great idea. This is a key principle in writing APIs---they should verify their inputs and fail as early as possible with a meaningful error message. And thanks for working out the example.

You can't really do x == not_a_number() because comparing NaN to anything yields false. That's why there's an is_nan(x) function; we also have an is_inf(x), so the condition is most naturally

  if (is_nan(x) || is_inf(x)) ...

We should really have an is_finite().

[x] handled with above

bob-carpenter commented 7 years ago

[x] Change Marcus's affiliation to York University

sakrejda commented 7 years ago

[x] lgamma signature on page 428 is mangled (lgamma is missing, the return type is in bold instead).

bob-carpenter commented 7 years ago

[x] mention that mixture models also work as priors in the mixture models chapter
[x] mention that mixtures with a uniform constant mixing ratio can drop the mixing ratio, e.g., log_sum_exp(log(0.5) + a, log(0.5) + b) = log(0.5) + log_sum_exp(a, b) and then you can drop the additive constant log(0.5)

RobertMyles commented 7 years ago

On page 122, should the lines:

// thin and scale the QR decomposition
  Q_ast = qr_Q(x)[, 1:K] * sqrt(n - 1);
  R_ast = qr_R(x)[1:K, ] / sqrt(n - 1);

use N - 1 instead of n - 1, given that there is no n anywhere else in the program?

bob-carpenter commented 7 years ago

@RobertMyles Thanks. I'll fix https://github.com/stan-dev/stan/issues/2122#issuecomment-271621141

[x] fix
[x] thank Robert Myles in acknowledgements

bob-carpenter commented 7 years ago

From Conner DiPaolo on stan-users:

[x] Since this assumes that L_{ii} > 0 for all i, I'd suggest revising the relevant section 57.3 for the multi_normalcholesky to make this explicit because it currently says that the parameterization is valid if L is lower triangular and Sigma = L L^T, which includes L{ii} potentially non-positive.

[editorial note: This is all implied by the fact that LL' is positive definite.]

[x] This documentation issue might also be present in other Cholesky parameterized densities, though I haven't checked.

Also:

[x] include that the functions raise exceptions/reject when the diagonal contains a non-positive element
[x] thank Conner in the doc

bob-carpenter commented 7 years ago

[x] add Imad to dev list

bob-carpenter commented 7 years ago

From #2212 by https://github.com/vasishth:

[x] clarify that the lambda in section 12.5 on vectorizing mixtures is a scalar---it's the same mixture ratio for each and it's a value in [0, 1]

bob-carpenter commented 7 years ago

@seantalts brought this up while unit testing to_matrix:

[x] make a note about 0 x N x ... arrays in Stan:

real a[0, 3];

This will compile to a vector<vector<T> > type, but it will be empty. So if we do something like to_matrix(a), we'll get a 0 x 0 matrix.

[x] add warnings to relevant functions

jgabry commented 7 years ago

On stan-users John Hall points out that at the bottom of page 325 (under "Non-Centered Parameterization" heading) the following sentence

When there is a lot of data, such a hierarchical model can be made much more efficient by shifting the data’s correlation with the parameters to the hyperparameters.

should actually be referring to when there is not much data.

He suggests that the whole paragraph

When there is a lot of data, such a hierarchical model can be made much more efficient by shifting the data’s correlation with the parameters to the hyperparameters. Similar to the funnel example, this will be much more efficient in terms of effective sample size when there is not much data (see (Betancourt and Girolami, 2013)).

be changed to something like the following:

When there is not much data, a non-centered parameterization can be much more efficient in terms of effective sample size by shifting the data’s correlation with the parameters to the hyperparameters. (see (Betancourt and Girolami, 2013)).

[x] Either use John Hall's recommended change or, at a minimum, change "a lot of data" to "not much data" or an equivalent

bob-carpenter commented 7 years ago

John Hall went on to suggest linking to Gelman and Hill 21.14, which provides a statistic to estimate how much pooling can/will be done.

lambda <- function(fit, ...) {
    extracted_fit <- rstan::extract(fit, permuted = TRUE, ...)
    N <- length(extracted_fit)
    result <- rep(NA, N)
    names(result) <- names(extracted_fit)
    for (i in 1:N) {
        extracted_fit_i <- extracted_fit[[i]]
        if (length(dim(extracted_fit_i)) == 1) next #only calculate if more than 
                                                    #1 dimension
        e <- extracted_fit_i - mean(extracted_fit_i)
        result[i] <- 1 - var(apply(e, 2, mean)) / mean(apply(e, 1, var))
    }
    return(result)
}

[x] moveed to longer-term issue: https://github.com/stan-dev/stan/issues/1238#issuecomment-291977589

bob-carpenter commented 7 years ago

[x] update missing data section to account for slicing with this example from the mailing list:

data {
 int<lower = 0> N_obs;
 int<lower = 0> N_mis;
 int<lower = 1, upper = N_obs + N_mis> ii_obs[N_obs];
 int<lower = 1, upper = N_obs + N_mis> ii_mis[N_mis];
 real y_obs[N_obs];
  ...
parameters {
  real y[N_obs + N_mis];
  ...
transformed parameters {
 real y_mis[N_mis];
 y[ii_obs] = y_obs;
 y[ii_mis] = y_mis;
 ...

jgabry commented 7 years ago

Do we now allow indexing and assignments in the parameters block? That is, does this

y[ii_obs] = y_obs;

really work in parameters now? Or was that supposed to go in transformed parameters?

[x] [update from Bob: I fixed it in the issue, but comments originating from email defeat markup in a way I don't know how to fix.]

bob-carpenter commented 7 years ago

[x] replace the _lpdf and other probability functions that got dropped from the index when we changed notations

Problem is with vertical bar---with it, nothing indexes properly, which explains why all the other functions look OK other than the probability functions.

bob-carpenter commented 7 years ago

[x] add sliced vector approach to missing data following the discussion with Måns Magnusson on stan-users. The model we wound up with was:

data {
  int<lower = 0> N_obs;
  int<lower = 0> N_mis;
  int<lower = 1, upper = N_obs + N_mis> ii_obs[N_obs];
  int<lower = 1, upper = N_obs + N_mis> ii_mis[N_mis];
  real y_obs[N_obs];
}
transformed data {
  int<lower = 0> N;
  N = N_obs + N_mis;
}
parameters {
  real y_mis[N_mis];
  real<lower=0> sigma;
}
transformed parameters {
  real y[N_obs + N_mis];
  y[ii_obs] = y_obs;
  y[ii_mis] = y_mis;
}
model {
  sigma ~ gamma(1, 1);
  y[2:N] ~ normal(y[1:(N - 1)], sigma);
}

alvaro1101 commented 7 years ago

Hi,

There is few deprecated assignment operator <- in pages 203 and 579.

[x] fix
[x] thank alvaro1101

bob-carpenter commented 7 years ago

[x] add description of double exponential (Laplace) distribution as an exponential scale mixture of normals, see: https://en.wikipedia.org/wiki/Laplace_distribution#Relation_to_the_exponential_distribution.

The non-centered parameterization is

parameters {
  real mu;
  real<lower=0> lambda;
  real<lower=0> theta;
  ...
transformed parameters {
  real u = mu + y * z;  // u ~ double_exponential(mu, lambda);
  ...
model {
  y ~ exponential(inv(lambda));
  z ~ normal(0, 1);
  ...

to-mi commented 7 years ago

In the signature of normal_lccdf on page 493, the second | should probably be ,. Similarly for signature for normal_rng immediately following the previous. (In fact, most _cdf functions have , instead of | as the first delimiter but this seems to be already noted in #2034.)

Descriptions of sampling statements, like "Increment log probability with bernoulli_lpmf(y, theta) ..." on page 478, should have | instead of ,.

[x] fix issue with normal_lccdf on p. 493
[x] _cdf isn't updated yet, so can't fix that
[x] fix sampling statement doc
[x] thank to-mi

bob-carpenter commented 7 years ago

From Andrew:

[x] In section 45.7, the lower limits of the integrals should be -\infty, not 0, right?

jmh530 commented 7 years ago

In Section 8.2 on QR factorization, I feel like the paragraph that recommends you use it when you do not have an informative prior on the location of beta is mistaken (also mentioned in 8.3).

Consider a probability distribution Y ~ N(XB, sigma) with a prior B ~ N(mu_B, sd_B). Y is then re-written as Y ~ N(Q theta, sigma). The equivalent distribution on the prior of theta would be theta ~ MVN(mu_B R, t(R) diag(sd_B ^ 2) R), assuming mu_B is 1XN.

My point is that you can still include a prior on the mean of beta, you just have to do the appropriate transformation in order to express it in terms of the new beta. This raises some other questions about the reasoning in that section, but I'll leave off here.

bob-carpenter commented 7 years ago

[x] make sure all the index entries for densites are included
[x] get Phi and Phi_approx sorting in proper alphabetical order

jgabry commented 7 years ago

@jmh530 yeah you're right that it's possible to use the QR approach even with a non-zero prior mean on the coefficients, so perhaps the recommendation in the manual is worded too strongly. @bgoodri should comment on this though because he wrote that section and I forget if he had a reason for discouraging this (he's definitely aware that's it's possible). One reason might just be that it is strange to have a prior belief that there is a dense covariance matrix for the MVN on theta since theta represents the coefficients on the columns of the Q matrix and those columns are orthogonal.

jmh530 commented 7 years ago

True. When I was playing around with it, I had extracted the standard deviations from the covariance matrix and just used the univariate normal distribution.

On Wed, Mar 8, 2017 at 11:40 AM, Jonah Gabry notifications@github.com wrote:

@jmh530 https://github.com/jmh530 yeah you're right that it's possible to use the QR approach even with a non-zero prior mean on the coefficients, so perhaps the recommendation in the manual is worded too strongly. @bgoodri https://github.com/bgoodri should comment on this though because he wrote that section and I forget if he had a reason for discouraging this (he's definitely aware that's it's possible). One reason might just be that it is strange to have a prior believe that there is a dense covariance matrix for the MVN on theta since theta represents the coefficients on the columns of the Q matrix and those columns are orthogonal.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2122#issuecomment-285094262, or mute the thread https://github.com/notifications/unsubscribe-auth/AMJWygKPHFdD8niS1Y0CMq5kVBZK-7txks5rjtoEgaJpZM4K59eC .

bob-carpenter commented 7 years ago

Olivier Ma pointed out on stan-users that the code on p. 131 is trying to copy a column vector into a row of a matrix. It should be:

  row_vector[D] zeros = rep_row_vector(0, D);

[x] change shapes so the code compiles
[x] add some more description
[x] thank Olivier in the acknowledgements

bob-carpenter commented 7 years ago

Eric Innocents Eboulet pointed this out on stan-users:

[x] poisson_lpmf instead of poisson_lpdf(2|3.7)
[x] thank Eric

aechase commented 7 years ago

In 8.16, the Posterior Predictive Checks section contains the following sample code:

 generated quantities {
      vector[N] y_tilde;
      for (n in 1:N)
        y_tilde = normal_rng(alpha + beta * x[n], sigma);
    }

I believe that should be y_tilde[n] = ...

[x] fix this
[x] thank Alex Chase

betanalpha commented 7 years ago

From Trung Dung Tran: in "Non-Centered Parameterization" within section 26.6 (page 325) the sentence "When there is a lot of data, such a hierarchical model..." should read "When there is not a lot of data, such a hierarchical model...".

[x] fix this

bob-carpenter commented 7 years ago

[x] make sure dev team and affiliations are up to date

bob-carpenter commented 7 years ago

[x] remove backslash before Stan in all files

bob-carpenter commented 7 years ago

[x] changed manual license to CC-BY-ND (requiring attribution, disallowing derivative works); I think we should use that in general for our doc

bob-carpenter commented 7 years ago

[x] add a crisp mathematical definition of divergences in the HMC chapter

bob-carpenter commented 7 years ago

Fixed with #2266

stan-dev / stan

next manual, 2.14++ #2122

Summary: