stan-dev / stan

Stan development repository. The master branch contains the current release. The develop branch contains the latest stable development. See the Developer Process Wiki for details.
https://mc-stan.org
BSD 3-Clause "New" or "Revised" License
2.56k stars 369 forks source link

next manual, 2.4.0++ #786

Closed bob-carpenter closed 9 years ago

bob-carpenter commented 10 years ago

This is the issue for general changes to the 2.4.0 manual for the next release (2.4.0++), such as typos, new examples, etc. Changes related to new features should be bundled with the new feature itself.

bob-carpenter commented 10 years ago
randommm commented 10 years ago

real gaussian_dlm_obs_log(vector y, matrix F, matrix G, matrix V, matrix W, vector m0, matrix C0)

but it should be:

real gaussian_dlm_obs_log (matrix y, matrix F, matrix G, matrix V, matrix W, vector m0, matrix C0)

See functions_signatures.h:

add("gaussian_dlm_obs_log",DOUBLE_T,MATRIX_T,MATRIX_T, ...

And the description itself:

The log of the density of the Gaussian Dynamic Linear model with observation matrix y ...
randommm commented 10 years ago
PeterLi2016 commented 10 years ago
bob-carpenter commented 10 years ago
bob-carpenter commented 10 years ago
for (t in 2:T) 
  y[t] ~ normal(y[t-1],sigma)

to be vectorized as

tail(y,T-1) ~ normal(head(y,T-1), sigma);
bob-carpenter commented 10 years ago

I noticed what I think is a typo in the Reference Manual (top of page 24 "Regression Models'). There it lists:

for (n in 1:N)
  y ~ normal(x[n] * beta, sigma);

but if I am not mistaken I think this should read:

for (n in 1:N)
  y[n] ~ normal(x[n] * beta, sigma);
mitzimorris commented 10 years ago

In Stan reference manual, Section 15.1, subsection "Type declarations for Functions", 2nd paragraph ends w/ incomplete sentence. "Unlike type declarations for variables, function type declarations for matrix and vector types are not declared with their sizes. Like local variable declarations, func- tion argument type declarations"

bob-carpenter commented 10 years ago
bob-carpenter commented 10 years ago
bob-carpenter commented 10 years ago
bob-carpenter commented 10 years ago

From Michael on stan-users list:

1) Markov chain convergence is a global property and does not depend on the choice of function. Rhat considers the composition of a Markov chain and a function, and if the Markov chain has converged then each Markov chain + function composition will have converged. Multivariate functions converge when all of their margins have converged by the Cramer-Wold theorem.

2) The transformation from unconstrained space to constrained space is just another function, so does not effect convergence.

3) Different functions may have different autocorrelations, but if the Markov chain has equilibrated then all Markov chain + function compositions should be consistent with convergence. Formally any function that appears inconsistent is of concern and although it would be unreasonable to test every function, lp__ should at least be consistent.

The obvious difference in lp__ is that it tends to vary quickly with position and is consequently susceptible to outliers. Not to mention the fact that its magnitude is typically large and it's worth looking into how accurate the floating point computations are in the Rhat calculation.

From Andrew on stan-users list:

I think it's a mistake to declare convergence in any practical sense if Rhat for lp__ is high. If lp__ has not converged then the different chains are really in different parts of the space.

More from Michael on stan-users list:

Issue One: What is Convergence?

There is no hard cutoff between "transience" and "equilibrium". What happens is that as N-> infinity the distribution of possible states in the Markov chain approaches the target distribution and in that limit the expected value of the Monte Carlo estimator of any integrable function converges to the true expectation. There is nothing like warmup here because in the N->infinity limit the effects of initial state are completely washed out.

The problem is what happens for finite N. If we can prove a strong geometric ergodicity property (which depends on the sampler and the target distribution) then one can show that there exists a finite time after which the chain forgets its initial state with a large probability. This is both the autocorrelation time and the warmup time -- but even if you can show it exists and is finite (which is nigh impossible) you can't compute an actual value analytically.

So what you do in practice is hope that N is large enough for the expectations to be reasonably accurate. Removing warmup iterations improves the accuracy of the expectations but there is no guarantee that removing any finite number of samples will be enough.

Issue Two: Why Inconsistent Rhats?

There are two things to worry about here.

Firstly, as noted above, for any finite N there will always be some residual effect of the initial state, which typically manifests as some small (or large if the autocorrelation time is huge) probability of having a large outlier. Functions robust to such outliers (say, quantiles) will appear more stable and have better Rhats. Functions vulnerable to such outliers may show fragility.

Secondly, Rhat makes very strong assumptions. In particular is assumes that the functions being considered are Gaussian (or it only uses the first two moments and assumes some kind of independence -- the point is that strong assumptions are made) that do not always hold. In particular, the distribution for the log posterior density almost never looks Gaussian, instead it features long tails that can lead to large Rhats even in the large N limit.

The tweaks that Andrew keeps talking about all have the flavor of making the samples of interest more Gaussian and hence the Rhat calculation more accurate.

Conclusion:

"Convergence" is a global property and holds for all integrable functions at once, but Rhat requires additional assumptions so may not work for all functions equally well.

Note that if you just compare the expectations between chains then we can rely on the Markov chain asymptotics for Gaussian distributions and can apply the standard tests.

ariddell commented 10 years ago

:+1: for lp__

bob-carpenter commented 10 years ago
bob-carpenter commented 10 years ago

This will require not resetting the page numbering after the front matter, which is the LaTeX default

PeterLi2016 commented 10 years ago

More specifically, using the parameterizations in Stan, if X ~ Weibull(alpha, sigma), 1/X ~ Frechet(alpha, 1/sigma).

bob-carpenter commented 10 years ago

On Aug 2, 2014, at 4:40 PM, Peter Li notifications@github.com wrote:

Add relationship between frechet and weibull distributions.

syclik commented 10 years ago
bob-carpenter commented 10 years ago
bob-carpenter commented 10 years ago

From Andrew in a personal e-mail:

bob-carpenter commented 10 years ago

Start a new appendix with a list of all the error message text and a longer explanation of what they mean and possible causes.

  • [x] vanishing density (initialization errors)
  • [x] informational message (problem with numerics or bad specs)
bob-carpenter commented 10 years ago

Marco suggested on stan-users adding

  • [x] reference in the row() function doc to using it as an lvalue, as well as mentioning col() doesn't work and the syntactic sugar operator[]
  • [x] cross-ref the lvalue discussion for operator[] in the assignment statement description
  • [x] create an lvalue table somewhere in the language ref (assignment uses the specialized assignments in Stan, which can do promotion of int to double and double to var where necessary)
bob-carpenter commented 10 years ago
  • [x] fix link to MPL in Eigen license section of the appendix.
mitzimorris commented 10 years ago
  • [x] elaborate on:

elementwise multiplication and division are documented in table of operator precedences in chapter Expressions chapter (chapter 20, figure 20.1). There should also be a mention of these two operators in section 20.4 "Arithmetic and Matrix Expressions".
Is use of these operators more efficient than using a for loop? i.e., for vector[N] a; vector[N] b; vector[N] c; is c <- a .* b; more efficient than for (i in 1:N) { c[i] <- a[i] * b[i]; } ?

randommm commented 10 years ago
  • [x] In section "Zero-Inflated Models", page 71

Change:

(y[n] == 0) ~ bernoulli(1,theta);

to:

(y[n] == 0) ~ bernoulli(theta);
bob-carpenter commented 10 years ago
  • [x] contrast hurdle model with zero-inflated Poisson
zero-inflated
  • Pr[y] = Bern(1 | theta) + Bern(0 | theta) * Poisson(0 | lambda) if y = 0
  • Pr[y] = Bern(0 | theta) * Poission(0 | lambda)

implement as mixture with log-sum-exp

hurdle
  • Pr[y] = Bern(1|theta) if y = 0
  • Pr[y] = Bern(0|theta) * Poisson(y | lambda) / (1 - PoissonCDF(0 | lambda))

imlement as conditional with second case given by truncation

bob-carpenter commented 9 years ago

From Fraenzi:

p. 56 The log_sum_exp operation just multiplies the probabilities for each prior state j on the log scale in an arithmetically stable way.

  • [x] fix this so it clarifies that we're doing addition on the linear scale, so we need log_sum_exp on the log scale
bob-carpenter commented 9 years ago
  • [x] if_else doc is wrong in terms of order of conditions

overall, it looks very odd to have some functions documented in painstaking detail with values, boundaries, and derivatives, and others with just the text we used to have

bob-carpenter commented 9 years ago
  • [x] remove all the extraneous boundary conditions

doc for operator+() is wrong, and needs many more cases, because

finite + finite = -inf or +inf for some overflows and finite otherwise
inf + inf = inf
-inf + inf = NaN
inf + finite = inf
-inf + finite = -inf

The rules are symmetric here, too, but I didn't include the other verisions.

The same for operator- and all the other operators.

bob-carpenter commented 9 years ago
  • [x] cut-and-paste typo throughout of 'opreator' instead of 'operator' (in the math)
bob-carpenter commented 9 years ago

Arthur Breitman reported a doc bug on stan-users:

  • [x] remove bogus .size() call and replace with cols() in the "recursive functions" section of the language spec (around p. 223 in most recent draft)
bob-carpenter commented 9 years ago
  • [x] remove "thing" from "size type can be anything type" in array size() function definition in the doc
bob-carpenter commented 9 years ago
  • [x] make sure discontinuity due to step function discussion is more highlighted and in the right scope; it also applies to the new is_inf and is_nan functions, which take real arguments and return int;
  • [x] the if-then-else function also needs to link to the discontinuity warning
bob-carpenter commented 9 years ago

from Howard Zail on stan-users:

  • [x] See page 50: towards the bottom of the page h[t] is shown in the formula as a function of h[t]. Instead h[t] should be a function of h[t-1]

Ed. It was right as it was, but I elaborated on why. The key is to recognize that h[t] is a local variable that's being reassigned to scale and shift appropriately.

bob-carpenter commented 9 years ago

Add some version of Michael's comments about missing data (from stan-users):

When dealing with missing data the general strategy is to build a missing-data likelihood for each pattern of missingness. That way you can partition your data into each different pattern and then your total likelihood is just

L{complete} * L{missing1} * … * L_{missing2}

In your case, for example, the natural patterns would be (1) complete, (2) missing just party choice, (3) missing just income, (4) missing both. The biggest problem with this strategy is actually computing the missing data likelihoods.

For example, let's say that our data is two-dimensional and the likelihood is modeled as a Gaussian:

p(x1, x2 | mu, sigma1, sigma2, rho) = Normal( x | mu, Sigma)

with x = (x1, x2), mu = (mu1, mu2), and Sigma = ( (sigma1, rho) (rho, sigma2 )).

In this case we can marginalize over x1 and x2 pretty easily to give the marginal likelihoods

p(x1 | mu, sigma) = \int dx2 Normal( x | mu, Sigma) = Normal( x1 | mu1, sigma1)

and

p(x2 | mu, sigma) = \int dx1 Normal (x | mu, Sigma) = Normal (x2, mu2, sigma2)

Then we'd have for each complete datum,

x_{complete} ~ Normal(x | mu, Sigma)

and for the missing data

x{missing x2} ~ Normal(x1 | mu1, sigma1) x{missing x1} ~ Normal(x2 | mu2, sigma2)

Techniques like imputation essentially approximate these marginal likelihood distributions, but they tend to generate inconsistent models that cause problems. Others can speak more on this better than I.

That's all straightforward, if tricky to implement in practice because of the integrals needed to compute the missing data likelihoods, but unfortunately it's not the end of the story. The problem is that we've completely ignored the reason why the data might be missing -- in particular, we're assuming that the data are missing at random and that the missingness is not correlated with the value of the data. But this is a bit suspect in survey modeling, especially with questions like income! Andrew discusses this in detail in "Bayesian Data Analysis", and ultimately the models become significantly more complex (with effects like censoring).

It's probably easiest if you start with the complete data model that you'd like to fit and then iterate from there. In particular, first get it working in Stan (we're happy to help) and then with the full model in hand you can have the conversation about missingness, censoring, and the like.

  • [x] skipping; need more elaborated model with terminology matching whatever Andrew likes
bob-carpenter commented 9 years ago
  • [x] for operators, include operator syntax, so it will look like
!x = operator!(x) = { ...

x || y  =  operator||(x,y) = { ...
bob-carpenter commented 9 years ago
  • [x] remove "implicit" in description of multinomial_rng and also fix font
bob-carpenter commented 9 years ago
  • [x] thank Kyle Foreman for document patches
bob-carpenter commented 9 years ago
  • [x] clean up use of bare text in LaTeX in negative binomial definition for E[] and var[]. Also, lower-case the latter.
  • [x] also deal with the messed up indentation in the sentences containing E[] and var[] --- don't know how they got centered.
  • [x] check to see if the "inverse scale" in our two negative binomials are the same, and if not, say how they're different
  • [x] change n to `y' in first negative binomial doc
  • [x] clean up negative_binomial_2 doc according to Marco's comment

    In fact, the parameter phi of neg_binomial_2 is the same of the parameter alpha (shape) of neg_binomial. It is also the parameter r of wikipedia's definition of negative binomial (number of failures until the experiment is stopped).
    But I don't think it should be called shape either, nor use the same greek letter of neg_binomial (alpha), they reason is that, while they are the same parameter, their interpretation is very different in the three parameterizations mentioned above (because the other parameters are different):
    In neg_binomial, when we change alpha, with all other things held constant, the mean and the variance increase. (the same is true for wikipedia's r)
    In neg_binomial_2, when we change phi, with all other things held constant, the mean is held constant and the variance decreases.
    For this reason, I think the best naming for phi is precision parameter, or even better phi^-1 is the overdispersion parameter (seems more in tune with literature).
bob-carpenter commented 9 years ago
  • [x] include note with step function that we have comparison operators, so that step(a - b) gives the same result as (a > b).
bob-carpenter commented 9 years ago
  • [x] fix

From Dan Schrage:

I noticed a few typos in section 5.9 of the reference manual. In the subsection "Coding the Model in Stan," it looks like the parameter Omega used to be called Sigma, but there are still some stale references to Sigma in both the code and the text.

Here's the diff of the five changes I made:

<   Sigma_beta <- quad_form_diag(Sigma,tau);
---
>   Sigma_beta <- quad_form_diag(Omega,tau);
1648c1648
< \code{Sigma}, define it as a transformed parameter.  The function
---
> \code{Sigma\_beta}, define it as a transformed parameter.  The function
1650,1651c1650,1651
< \code{quad\_form\_diag(Sigma,tau)} is equivalent to
< \code{diag\_matrix(tau) * Sigma * diag\_matrix(tau)}, where
---
> \code{quad\_form\_diag(Omega,tau)} is equivalent to
> \code{diag\_matrix(tau) * Omega * diag\_matrix(tau)}, where
1668c1668
<   <- diag_pre_multiply(tau, diag_post_multiply(Sigma, tau));
---
>   <- diag_pre_multiply(tau, diag_post_multiply(Omega, tau));
bob-carpenter commented 9 years ago

Comment from user who'd rather remain anonymous:

  • [x] add footnote in marginalization chapter where the ternary operator is introduced indicating that it is not a LaTeX SNAFU, but rather just syntactic sugar for what R writes as ifelse(c,x,y) (though technically, we should call ifelse "R's syntactic bitters" rather than calling the conditional operator "C's syntactic sugar").
bob-carpenter commented 9 years ago
  • [x] fix comment about v2

From #1071, originally reported by David Chudzicki

The manual says: "Stan 1.0 does not do discrete sampling", with a footnote about plans for v2. But Stan is in v2 already, right? https://github.com/stan-dev/stan/blob/master/src/docs/stan-reference/introduction.tex#L146
bob-carpenter commented 9 years ago

Quoted material from Tomi Peltola, originally submitted as #1072

It seems to me that this section (pages 136-137 in 2.4.0 manual) has problems with tau and its square. Shouldn't it be tau^2 ~ Gamma, not tau ~ Gamma?

The sampling for tau is OK, but the conversion to scale needs to be fixed.

  • [x] use tau^{-1/2} as the scale, not tau^{-2}

Peltola adds:

The rescaling of alpha to get beta should then divide by tau instead of tau^2?
  • [x] change rescaling to multiply by the scale, i.e., tau^{-1/2}
bob-carpenter commented 9 years ago
  • [x] Add a function example for matrix power --- it's a nice example of recursion.
matrix matrix_pow(matrix a, int n) {
  if (n == 0) 
    return diag_matrix(rep_vector(0, rows(a)));
  else if (n == 1)
    return a;
  else
    return a * matrix_pow(a, n - 1);
}

and I could present an equivalent iterative version:

matrix matrix_pow(matrix a, int n) {
  if (n == 0) {
    return diag_matrix(rep_vector(0, rows(a)));
  } else if (n == 1) {
    return a;
  } else {
    matrix[rows(a),cols(a)] b;
    b <- a;
    for (m in 1:n)
      b <- b * a;
    return b;
  }
}  
bob-carpenter commented 9 years ago
  • [x] Add clarifications from Daniel Lee on reproducibility:
You need to fix: Stan, {Cmd, R, Py}Stan interface, compiler, compiler flags, operating system, libraries on the operating system, and if running through a separate process like R/Python that needs to be compiled the same way, versions need to be the same, and libraries need to be the same. Here, Stan is a particular version. That could be a tagged version or a git commit hash. And the same goes with any of the interfaces like RStan, CmdStan, or PyStan. So each piece needs to be the same. (I'm sure that was assumed, but I just wanted to make it clear.) There are a couple more caveats here. It not only needs the same compiler, but it also needs the same compiler flags and link libraries. Concretely, if you compiled a single Stan program using the same CmdStan code base, but changed the optimization flag (-O3 vs -O2 or -O0), the two programs may not return the identical stream of results. If, however, you compiled it today using one set of flags, took the computer away from the internet and didn't allow it to update anything, then came back in a decade and recompiled the Stan program in the same way, you should get the same results. Regarding the same machine -- I think it just needs to be the same OS. But... if you're talking about something like a desktop that isn't managed by an IT department that restricts everything, it's hard to guarantee that the the libraries on the system are identical. This isn't usually a problem across machines using the same operating system, but we've run into some trouble with differences between Mac and Windows. ... same random seed and chain ID ... Yes. But there's a slight subtlety here. The data needs to be the same down to the bit level. For example, if you are running in RStan, Rcpp handles the conversion between R's floating point numbers and C++ doubles. If Rcpp changes the conversion process or use different types, there's a (small) chance that there are more or less digits saved and that might not allow you to reproduce what you've generated in the past, but it should allow you to reproduce the new run. This holds, and should hold in the future.

I think we do need the same machine at the hardware level in terms of which CPU is being used.

bob-carpenter commented 9 years ago
  • [x] update version number in .tex files
  • [x] update version number for make
  • [x] update version number in Stan API
bob-carpenter commented 9 years ago
  • [x] fix ALL of the discussion of Cholesky factorizations and multivariate normals and whatnot in the manual to bring them up to date with Ben and Andrew's latest recommendations