next manual, 2.4.0++ - Githubissues

This is the issue for general changes to the 2.4.0 manual for the next release (2.4.0++), such as typos, new examples, etc. Changes related to new features should be bundled with the new feature itself.

[x] include a discussion of (K-1)-parameter versions of softmax, categorical, and categorical logit for identifiability; (even better, complete issue #266)

[x] manual describes gaussian_dlm_obs_log with signature:

real gaussian_dlm_obs_log(vector y, matrix F, matrix G, matrix V, matrix W, vector m0, matrix C0)

but it should be:

real gaussian_dlm_obs_log (matrix y, matrix F, matrix G, matrix V, matrix W, vector m0, matrix C0)

See functions_signatures.h:

add("gaussian_dlm_obs_log",DOUBLE_T,MATRIX_T,MATRIX_T, ...

And the description itself:

The log of the density of the Gaussian Dynamic Linear model with observation matrix y ...

[x] Include Jeffrey Arnold in preface footnote?

[x] typo in pareto distribution pdf, should be \frac{\alpha\,x_\mathrm{m}^\alpha}{x^{\alpha+1}}

[x] remove the term "simplex" from describing the categorical distribution parameter so that users don't think they need to pass in a simplex variable (Andrew's suggestion)

[x] add suggestion for time-series models such as

for (t in 2:T) 
  y[t] ~ normal(y[t-1],sigma)

to be vectorized as

tail(y,T-1) ~ normal(head(y,T-1), sigma);

[x] fix issue reported by Jan Gläscher in e-mail to Andrew:

I noticed what I think is a typo in the Reference Manual (top of page 24 "Regression Models'). There it lists:

for (n in 1:N)
  y ~ normal(x[n] * beta, sigma);

but if I am not mistaken I think this should read:

for (n in 1:N)
  y[n] ~ normal(x[n] * beta, sigma);

In Stan reference manual, Section 15.1, subsection "Type declarations for Functions", 2nd paragraph ends w/ incomplete sentence. "Unlike type declarations for variables, function type declarations for matrix and vector types are not declared with their sizes. Like local variable declarations, func- tion argument type declarations"

[x] fix this

[ ] Move all the description of warmup and other algorithm parameters to Stan manual and just reference them from CmdStan and RStan manuals.

[x] add precedence for exponential (^) operator at 0.5

[x] rewrite covariance prior section to most efficient version possible
[x] add sections on (default) priors following Andrew's e-mail
- [x] basic regression priors
- [x] hierarchical priors
- [x] MLE boundary-avoiding priors

[x] Add discussion of lp__ and convergence.

From Michael on stan-users list:

1) Markov chain convergence is a global property and does not depend on the choice of function. Rhat considers the composition of a Markov chain and a function, and if the Markov chain has converged then each Markov chain + function composition will have converged. Multivariate functions converge when all of their margins have converged by the Cramer-Wold theorem.

2) The transformation from unconstrained space to constrained space is just another function, so does not effect convergence.

3) Different functions may have different autocorrelations, but if the Markov chain has equilibrated then all Markov chain + function compositions should be consistent with convergence. Formally any function that appears inconsistent is of concern and although it would be unreasonable to test every function, lp__ should at least be consistent.

The obvious difference in lp__ is that it tends to vary quickly with position and is consequently susceptible to outliers. Not to mention the fact that its magnitude is typically large and it's worth looking into how accurate the floating point computations are in the Rhat calculation.

From Andrew on stan-users list:

I think it's a mistake to declare convergence in any practical sense if Rhat for lp__ is high. If lp__ has not converged then the different chains are really in different parts of the space.

More from Michael on stan-users list:

Issue One: What is Convergence?

There is no hard cutoff between "transience" and "equilibrium". What happens is that as N-> infinity the distribution of possible states in the Markov chain approaches the target distribution and in that limit the expected value of the Monte Carlo estimator of any integrable function converges to the true expectation. There is nothing like warmup here because in the N->infinity limit the effects of initial state are completely washed out.

The problem is what happens for finite N. If we can prove a strong geometric ergodicity property (which depends on the sampler and the target distribution) then one can show that there exists a finite time after which the chain forgets its initial state with a large probability. This is both the autocorrelation time and the warmup time -- but even if you can show it exists and is finite (which is nigh impossible) you can't compute an actual value analytically.

So what you do in practice is hope that N is large enough for the expectations to be reasonably accurate. Removing warmup iterations improves the accuracy of the expectations but there is no guarantee that removing any finite number of samples will be enough.

Issue Two: Why Inconsistent Rhats?

There are two things to worry about here.

Firstly, as noted above, for any finite N there will always be some residual effect of the initial state, which typically manifests as some small (or large if the autocorrelation time is huge) probability of having a large outlier. Functions robust to such outliers (say, quantiles) will appear more stable and have better Rhats. Functions vulnerable to such outliers may show fragility.

Secondly, Rhat makes very strong assumptions. In particular is assumes that the functions being considered are Gaussian (or it only uses the first two moments and assumes some kind of independence -- the point is that strong assumptions are made) that do not always hold. In particular, the distribution for the log posterior density almost never looks Gaussian, instead it features long tails that can lead to large Rhats even in the large N limit.

The tweaks that Andrew keeps talking about all have the flavor of making the samples of interest more Gaussian and hence the Rhat calculation more accurate.

Conclusion:

"Convergence" is a global property and holds for all integrable functions at once, but Rhat requires additional assumptions so may not work for all functions equally well.

Note that if you just compare the expectations between chains then we can rely on the Markov chain asymptotics for Gaussian distributions and can apply the standard tests.

[x] fix multivariate prior example to use Omega instead of Sigma in the rescaling (p. 37, section 5.9)

[x] renumber pages so pdf pages correspond to actual pages

This will require not resetting the page numbering after the front matter, which is the LaTeX default

[x] Add relationship between frechet and weibull distributions: From @alyst: X is Weibull-distributed, then 1/X would follow Frechet distribution.

More specifically, using the parameterizations in Stan, if X ~ Weibull(alpha, sigma), 1/X ~ Frechet(alpha, 1/sigma).

[x] If the pull request isn't merged yet for the Frechet, this should go there. I'm not sure what the relationship is, either, so if the Frechet branch is already merged, please indicate what the relationship is that should be explained in this issue.

On Aug 2, 2014, at 4:40 PM, Peter Li notifications@github.com wrote:

Add relationship between frechet and weibull distributions.

[x] Stan doesn't compile under g++-4.4.

[x] add doc for operator^ as a function

From Andrew in a personal e-mail:

[x] p.3: p(theta|y;x) should be p(theta|y,x) following standard (BDA) notation. This one I feel very strongly about, as you know!
[x] p.4: same thing.
[x] p.6: same thing
[x] p.6: Somewhere we should discuss how to do constrained optimization. This came up on the list (indeed, we can use that example), that if you’re doing constrained optimization, you want to put the constraints into the model, not into the parameter definitions. This point is far from obvious! [ed. I think this is a more subtle issue --- there's a whole chapter pending on how to use optimization and it should go in there]
[x] p.7: I’d either remove the bit on empirical Bayes or make it a footnote. It just doesn’t seem important enough to be its own displayed paragraph in chapter 1. Once we have MML we can include it for real!
[x] p.22: I’d prefer writing y rather than Y, partly for consistency with BDA notation (we typcially use lower case for scalars/vectors and reserve upper case for matrices) and also consistency with the Stan code on the same page. Also it looks like in this example x is a scalar so I’d prefer to write x rather than X. This applies to all three lines of equations on p.22.
[x] Also I’d prefer to wrtie “normal” rather than “Normal”; again, I think that it’s best to maintain consistency with the Stan upper-and-lower-case notation as much as possible. [ed. There are a lot of these throughout, including every piece of doc on a prob function. see next bullet point]
[x] p.24: I suppose that by the same logic I’d want to say cauchy rather than Cauchy, but since Cauchy is actually soneone’s name, I’m happy to capitalize it. But this example (also Bernoulli and Poisson) illustrate that there’s really no claen way to do this so I won’t be a hardliner on this one. I see your logic to always using capital letters for distribution in the mathematical formulas and always using lower-case in the Stan code. Every time I see “Normal,” I wince—but maybe that’s just my problem!
[x] p.42: When introducing time series models at the top of p.42, maybe we can point forward to our later discussions of spline and Gaussian process models.
[x] p.44, footnote: If we’re going to start giving this sort of advice, we might as well also suggest that users consider adding a trend term to their model, as an un-fitted trend will show up as nonstationarity.
[x] p.59: I think it’s confusing to lead off with BUGS and R here. This is the Stan model so we should focus on Stan. I think the point you’re trying to make with the sentence on Bugs and R is that new Stan users might be confused if they are expecting Stan to have these capabilities. But if that’s the case we should just say so directly, something like this:

“Missing arrays of observed and missing data can be difficult to include in Stan, partly because it can be tricky to model discrete unknowns in Stan and partly because unlike some other statistical languages (for example, R and Bugs), Stan requires observed and unknown quantities to be defined in separate places in the model. Thus it can be necessary to include code in a Stan program to splice together observed and missing parts of a data structure.”
[x] p.59: This one is no big deal, but if we really want to be consistent with BDA, we should use the subscript “_mis” rather than “_miss”. In BDA, it’s always “obs” and “mis” (on the “hi/lo” principle)
[x] p.68: The first sentence in this section is not quite correct. We need to either remove the word “categorical” from the sentence or add the word “Discrete” to the beginning of the sentence (and also to the beginning of the title of the chapter, I suppose). I think the latter choice is what we want to do here, because we don’t seem to be talking about confinuous mixtures. Indeed, it looks to me like the number of mixture components is always fixed in this chapter, in which case
[x] I think we should call the chapter “Finite Mixture Modeling” or, even better, “Finite Mixtures”.
[x] p.68: Aain, I’m bothered by the capitalizaiton of Categorical and Normal. It seems like it could lead to trouble if people read this as Stan code and capitalize the distribution names.
[x] p.69: This really is no big deal at all but I’m baffled by the spacing rules you are using in the three displayed equations. Expecially where you have “log( “
[ed. no spacing before or after parens in log()]
[x] p.73: This is slightly misleading. Yes, a user could put a flat prior on x in this case but it would give the wrong answer. There are different ways to frame this but one way is to realize that the dimensionality of x increases with sample size, so the usual convergence argument for flat priors does not work here. This is well known: if you want to fit a mixture model, you have to model the x’s. A default model of the form x_n ~ normal (mu_x, sigma_x) is actually not a bad start although more can be done. We need to fix this because, as written, it looks like modeling x is just an option. But it’s not an option, it’s a necessity in these sorts of models.
[x] p.85: Just a question here, relating to the equations on the bottom of p.85. What’s your rule for when to use a “times” sign, for example a x b, rather than simply ab? Either is ok with me, I just have the impression you have a rule here and I was wondering what it was. [ed. The problem with writing just ab is that it's ambiguous between a single variable with a two-letter name and a product of two variables with one-letter names.]

[x] p.95: This one came up on the list recently . . . When we introduce “automatic relevance determination” perhaps we can say something like, “‘Automatic relevance determination’ is the term used in machine learning, corresponding to the term ‘hierarchical modeling’ in statistics. In either case, the idea is that hyperparameters are estimated from data (in Bayesian inference, via the joint posteiror distribution) rahter than being pre-set.” [ed. nifty --- I didn't even make the connection, but just copied from Radford Neal's usage in his papers.]

[x] p.105: see comment on p.69. Again, no big deal but I’m baffled by the rule that leads to the space after the mimus sign in “increment_log_prob(- log(y);”

Start a new appendix with a list of all the error message text and a longer explanation of what they mean and possible causes.

[x] vanishing density (initialization errors)
[x] informational message (problem with numerics or bad specs)

Marco suggested on stan-users adding

[x] reference in the row() function doc to using it as an lvalue, as well as mentioning col() doesn't work and the syntactic sugar operator[]
[x] cross-ref the lvalue discussion for operator[] in the assignment statement description
[x] create an lvalue table somewhere in the language ref (assignment uses the specialized assignments in Stan, which can do promotion of int to double and double to var where necessary)

[x] fix link to MPL in Eigen license section of the appendix.

[x] elaborate on:

elementwise multiplication and division are documented in table of operator precedences in chapter Expressions chapter (chapter 20, figure 20.1). There should also be a mention of these two operators in section 20.4 "Arithmetic and Matrix Expressions".
Is use of these operators more efficient than using a for loop? i.e., for vector[N] a; vector[N] b; vector[N] c; is c <- a .* b; more efficient than for (i in 1:N) { c[i] <- a[i] * b[i]; } ?

[x] In section "Zero-Inflated Models", page 71

Change:

(y[n] == 0) ~ bernoulli(1,theta);

to:

(y[n] == 0) ~ bernoulli(theta);

[x] contrast hurdle model with zero-inflated Poisson

zero-inflated

Pr[y] = Bern(1 | theta) + Bern(0 | theta) * Poisson(0 | lambda) if y = 0
Pr[y] = Bern(0 | theta) * Poission(0 | lambda)

implement as mixture with log-sum-exp

hurdle

Pr[y] = Bern(1|theta) if y = 0
Pr[y] = Bern(0|theta) * Poisson(y | lambda) / (1 - PoissonCDF(0 | lambda))

imlement as conditional with second case given by truncation

From Fraenzi:

p. 56 The log_sum_exp operation just multiplies the probabilities for each prior state j on the log scale in an arithmetically stable way.

[x] fix this so it clarifies that we're doing addition on the linear scale, so we need log_sum_exp on the log scale

[x] if_else doc is wrong in terms of order of conditions

overall, it looks very odd to have some functions documented in painstaking detail with values, boundaries, and derivatives, and others with just the text we used to have

[x] remove all the extraneous boundary conditions

doc for operator+() is wrong, and needs many more cases, because

finite + finite = -inf or +inf for some overflows and finite otherwise
inf + inf = inf
-inf + inf = NaN
inf + finite = inf
-inf + finite = -inf

The rules are symmetric here, too, but I didn't include the other verisions.

The same for operator- and all the other operators.

[x] cut-and-paste typo throughout of 'opreator' instead of 'operator' (in the math)

Arthur Breitman reported a doc bug on stan-users:

[x] remove bogus .size() call and replace with cols() in the "recursive functions" section of the language spec (around p. 223 in most recent draft)

[x] remove "thing" from "size type can be anything type" in array size() function definition in the doc

[x] make sure discontinuity due to step function discussion is more highlighted and in the right scope; it also applies to the new is_inf and is_nan functions, which take real arguments and return int;
[x] the if-then-else function also needs to link to the discontinuity warning

from Howard Zail on stan-users:

[x] See page 50: towards the bottom of the page h[t] is shown in the formula as a function of h[t]. Instead h[t] should be a function of h[t-1]

Ed. It was right as it was, but I elaborated on why. The key is to recognize that h[t] is a local variable that's being reassigned to scale and shift appropriately.

Add some version of Michael's comments about missing data (from stan-users):

When dealing with missing data the general strategy is to build a missing-data likelihood for each pattern of missingness. That way you can partition your data into each different pattern and then your total likelihood is just

L{complete} * L{missing1} * … * L_{missing2}

In your case, for example, the natural patterns would be (1) complete, (2) missing just party choice, (3) missing just income, (4) missing both. The biggest problem with this strategy is actually computing the missing data likelihoods.

For example, let's say that our data is two-dimensional and the likelihood is modeled as a Gaussian:

p(x1, x2 | mu, sigma1, sigma2, rho) = Normal( x | mu, Sigma)

with x = (x1, x2), mu = (mu1, mu2), and Sigma = ( (sigma1, rho) (rho, sigma2 )).

In this case we can marginalize over x1 and x2 pretty easily to give the marginal likelihoods

p(x1 | mu, sigma) = \int dx2 Normal( x | mu, Sigma) = Normal( x1 | mu1, sigma1)

and

p(x2 | mu, sigma) = \int dx1 Normal (x | mu, Sigma) = Normal (x2, mu2, sigma2)

Then we'd have for each complete datum,

x_{complete} ~ Normal(x | mu, Sigma)

and for the missing data

x{missing x2} ~ Normal(x1 | mu1, sigma1) x{missing x1} ~ Normal(x2 | mu2, sigma2)

Techniques like imputation essentially approximate these marginal likelihood distributions, but they tend to generate inconsistent models that cause problems. Others can speak more on this better than I.

That's all straightforward, if tricky to implement in practice because of the integrals needed to compute the missing data likelihoods, but unfortunately it's not the end of the story. The problem is that we've completely ignored the reason why the data might be missing -- in particular, we're assuming that the data are missing at random and that the missingness is not correlated with the value of the data. But this is a bit suspect in survey modeling, especially with questions like income! Andrew discusses this in detail in "Bayesian Data Analysis", and ultimately the models become significantly more complex (with effects like censoring).

It's probably easiest if you start with the complete data model that you'd like to fit and then iterate from there. In particular, first get it working in Stan (we're happy to help) and then with the full model in hand you can have the conversation about missingness, censoring, and the like.

[x] skipping; need more elaborated model with terminology matching whatever Andrew likes

[x] for operators, include operator syntax, so it will look like

!x = operator!(x) = { ...

x || y  =  operator||(x,y) = { ...

[x] remove "implicit" in description of multinomial_rng and also fix font

[x] thank Kyle Foreman for document patches

[x] clean up use of bare text in LaTeX in negative binomial definition for E[] and var[]. Also, lower-case the latter.
[x] also deal with the messed up indentation in the sentences containing E[] and var[] --- don't know how they got centered.
[x] check to see if the "inverse scale" in our two negative binomials are the same, and if not, say how they're different
[x] change n to `y' in first negative binomial doc
[x] clean up negative_binomial_2 doc according to Marco's comment

In fact, the parameter phi of neg_binomial_2 is the same of the parameter alpha (shape) of neg_binomial. It is also the parameter r of wikipedia's definition of negative binomial (number of failures until the experiment is stopped).
But I don't think it should be called shape either, nor use the same greek letter of neg_binomial (alpha), they reason is that, while they are the same parameter, their interpretation is very different in the three parameterizations mentioned above (because the other parameters are different):
In neg_binomial, when we change alpha, with all other things held constant, the mean and the variance increase. (the same is true for wikipedia's r)
In neg_binomial_2, when we change phi, with all other things held constant, the mean is held constant and the variance decreases.
For this reason, I think the best naming for phi is precision parameter, or even better phi^-1 is the overdispersion parameter (seems more in tune with literature).

[x] include note with step function that we have comparison operators, so that step(a - b) gives the same result as (a > b).

[x] fix

From Dan Schrage:

I noticed a few typos in section 5.9 of the reference manual. In the subsection "Coding the Model in Stan," it looks like the parameter Omega used to be called Sigma, but there are still some stale references to Sigma in both the code and the text.

Here's the diff of the five changes I made:

<   Sigma_beta <- quad_form_diag(Sigma,tau);
---
>   Sigma_beta <- quad_form_diag(Omega,tau);
1648c1648
< \code{Sigma}, define it as a transformed parameter.  The function
---
> \code{Sigma\_beta}, define it as a transformed parameter.  The function
1650,1651c1650,1651
< \code{quad\_form\_diag(Sigma,tau)} is equivalent to
< \code{diag\_matrix(tau) * Sigma * diag\_matrix(tau)}, where
---
> \code{quad\_form\_diag(Omega,tau)} is equivalent to
> \code{diag\_matrix(tau) * Omega * diag\_matrix(tau)}, where
1668c1668
<   <- diag_pre_multiply(tau, diag_post_multiply(Sigma, tau));
---
>   <- diag_pre_multiply(tau, diag_post_multiply(Omega, tau));

Comment from user who'd rather remain anonymous:

[x] add footnote in marginalization chapter where the ternary operator is introduced indicating that it is not a LaTeX SNAFU, but rather just syntactic sugar for what R writes as ifelse(c,x,y) (though technically, we should call ifelse "R's syntactic bitters" rather than calling the conditional operator "C's syntactic sugar").

[x] fix comment about v2

From #1071, originally reported by David Chudzicki

The manual says: "Stan 1.0 does not do discrete sampling", with a footnote about plans for v2. But Stan is in v2 already, right? https://github.com/stan-dev/stan/blob/master/src/docs/stan-reference/introduction.tex#L146

Quoted material from Tomi Peltola, originally submitted as #1072

It seems to me that this section (pages 136-137 in 2.4.0 manual) has problems with tau and its square. Shouldn't it be tau^2 ~ Gamma, not tau ~ Gamma?

The sampling for tau is OK, but the conversion to scale needs to be fixed.

[x] use tau^{-1/2} as the scale, not tau^{-2}

Peltola adds:

The rescaling of alpha to get beta should then divide by tau instead of tau^2?

[x] change rescaling to multiply by the scale, i.e., tau^{-1/2}

[x] Add a function example for matrix power --- it's a nice example of recursion.

matrix matrix_pow(matrix a, int n) {
  if (n == 0) 
    return diag_matrix(rep_vector(0, rows(a)));
  else if (n == 1)
    return a;
  else
    return a * matrix_pow(a, n - 1);
}

and I could present an equivalent iterative version:

matrix matrix_pow(matrix a, int n) {
  if (n == 0) {
    return diag_matrix(rep_vector(0, rows(a)));
  } else if (n == 1) {
    return a;
  } else {
    matrix[rows(a),cols(a)] b;
    b <- a;
    for (m in 1:n)
      b <- b * a;
    return b;
  }
}

[x] Add clarifications from Daniel Lee on reproducibility:

You need to fix: Stan, {Cmd, R, Py}Stan interface, compiler, compiler flags, operating system, libraries on the operating system, and if running through a separate process like R/Python that needs to be compiled the same way, versions need to be the same, and libraries need to be the same. Here, Stan is a particular version. That could be a tagged version or a git commit hash. And the same goes with any of the interfaces like RStan, CmdStan, or PyStan. So each piece needs to be the same. (I'm sure that was assumed, but I just wanted to make it clear.) There are a couple more caveats here. It not only needs the same compiler, but it also needs the same compiler flags and link libraries. Concretely, if you compiled a single Stan program using the same CmdStan code base, but changed the optimization flag (-O3 vs -O2 or -O0), the two programs may not return the identical stream of results. If, however, you compiled it today using one set of flags, took the computer away from the internet and didn't allow it to update anything, then came back in a decade and recompiled the Stan program in the same way, you should get the same results. Regarding the same machine -- I think it just needs to be the same OS. But... if you're talking about something like a desktop that isn't managed by an IT department that restricts everything, it's hard to guarantee that the the libraries on the system are identical. This isn't usually a problem across machines using the same operating system, but we've run into some trouble with differences between Mac and Windows. ... same random seed and chain ID ... Yes. But there's a slight subtlety here. The data needs to be the same down to the bit level. For example, if you are running in RStan, Rcpp handles the conversion between R's floating point numbers and C++ doubles. If Rcpp changes the conversion process or use different types, there's a (small) chance that there are more or less digits saved and that might not allow you to reproduce what you've generated in the past, but it should allow you to reproduce the new run. This holds, and should hold in the future.

I think we do need the same machine at the hardware level in terms of which CPU is being used.

[x] update version number in .tex files
[x] update version number for make
[x] update version number in Stan API

[x] fix ALL of the discussion of Cholesky factorizations and multivariate normals and whatnot in the manual to bring them up to date with Ben and Andrew's latest recommendations

stan-dev / stan

next manual, 2.4.0++ #786

zero-inflated

hurdle