next manual 2.6.0 - Githubissues

bob-carpenter commented 9 years ago

This is where issues for the manual after 2.5.0 go.

bob-carpenter commented 9 years ago

Also from Andrew:

Every time I see “Normal,” I wince—but maybe that’s just my problem!

I can live with "normal(0,1)", but don't like the "N(0,1)" used in BDA.

lowercase all math densities, e.g., replacing "Normal" with "normal", including all the ones named after people such as "Weibull" and "Wishart"
[x] no, don't do it. The problem is for distros like ExpModNormal, which need camel case in order to be parsable.

bob-carpenter commented 9 years ago

[x] break sparse and ragged coding out into its own chapter

betanalpha commented 9 years ago

[x] In 9.3 "Zero-Inflated Models", change "Other distributions than the Poisson can also be inflated in this way." to "Other discrete distributions than the Poisson can also be inflated in this way." Zero-inflation doesn't work for continuous distributions.

bob-carpenter commented 9 years ago

From Rob Goedman:

[x] include pointers to new interfaces
- [x] Julia
- [x] MATLAB

syclik commented 9 years ago

[x] fix —fixme— in Assigning subsection in Array Data Types section

bob-carpenter commented 9 years ago

[x] add cbind and rbind in index with pointers to append_col and append_row rather than signatures to help out R users looking for the functions.

Anything else we want to do this for?

bob-carpenter commented 9 years ago

There seems to be a typo on page 34 of stan-reference-2.5.0.pdf The pdf has

From @seldomworks in #1097:

model {
  y ~ normal(x*beta, sigma); // likelihood
}

I think it was intended to be

model {
  y ~ normal(x*beta + alpha, sigma); // likelihood
}

[x] clarify this by either adding the alpha back in or mentioning explicitly that x is assumed to include a column of 1s for the intercept; the advantage of keeping it separate is that it can be given a different prior

bob-carpenter commented 9 years ago

[x] Add discussion of multiple change points as extension to change point section

Where you have the single loop in s for the changepoint, you need to instead keep pairs. Easiest way to do it would be in a matrix:

  matrix[T,T] lp;
  lp <- rep_matrix(log_unif,T);
  for (s1 in 1:T)
    for (s2 in 1:T)
      for (t in 1:T)
        lp[s1,s2] <- lp[s1,s2] 
                     + poisson_log(D[t], if_else(t < s1, 
                                                 e, if_else(t < s2, m, l)));

and then in the model needs to be changed to convert the matrix lp to a vector so it can be passed to log_sum_exp:

  increment_log_prob(log_sum_exp(to_vector(lp)));

The problem is that as there are more change points, the computational complexity grows. You can see it intuitively from the loops.

Suppose there are N items. With a single change point, you need to consider all N positions. Each position requires an amount N of work, so overall complexity is O(N^2).

With two change points you need to consider all (N choose 2) change points. That's a quadratic number of pairs, each requiring an amount N of work, so overall complexity is O(N^3).

bob-carpenter commented 9 years ago

From Sebastian Weber on stan-users:

[x] in the MCMC algorithms chapter, include an example of the kind of model you'd call with fixed parameters (e.g., one that uses the generated quantities block to generate variates)
[x] make sure to mention you can have parameters in the model and initialize them in the usual ways

syclik commented 9 years ago

[x] thank @apw for the manual patches (#1116, #1117, #1118, #1119)

bob-carpenter commented 9 years ago

[x] clean up parameterizations of multilevel 2PL IRT model to ensure identifiability even if the priors don't exactly match the ones given (p. 49)

bob-carpenter commented 9 years ago

Related to a modeling issue brought up by Guido Biele on stan-users list:

[x] in truncation section add discussion of how y ~ normal(mu,sigma) T[L,] requires y > L or the probability is zero; in user-written truncations, say in a mixture, this must be done explicitly

bob-carpenter commented 9 years ago

From Andrew via e-mail:

[x] on page 92, in block of code, there should be a semicolon after sigma_x

bob-carpenter commented 9 years ago

[x] pull discussion of ragged and sparse matrices into its own chapter

bob-carpenter commented 9 years ago

[x] change reference to "standard vector" in doc for segment function to "array" (manual should be written from the user's perspective, where there's no notion of "standard vector") [p. 329]
[x] check for other instances to change

jrnold commented 9 years ago

Regarding the Kalman filter examples, I have examples and fully implemented Kalman filtering (of several flavors: with / without missing values, as batch / sequentially) with backward sampling in the generated quantities block here: https://github.com/jrnold/ssmodels-in-stan. I could probably write up that section.

bob-carpenter commented 9 years ago

From David Hallvig in issue #1138

[x] fix refs
[x] thank David Hallvig

HMM is part of the Time-Series Models chapter (Ch. 6) but is stated, in the first paragraph of said chapter, to be part of a later chapter (although the reference is given to a subsection of the Time-Series Models, i.e. 6.6).

Here's the relevant section:

\chapter{Time-Series Models}

\noindent
Times series data come arranged in temporal order. This chapter
presents two kinds of time series models, regression-like models such
as autogression and moving average models, and hidden Markov models.

In later chapters, we discuss two alternative models which may be
applied to time-series data, 
%
\begin{itemize}
\item Gaussian processes (GP) in \refchapter{gaussian-processes} and 
\item hidden Markov models (HMM) in \refsection{hmms}.
\end{itemize}

bob-carpenter commented 9 years ago

[x] summarize discussion in doc

Question from Jon Zelner on stan-users:

My implementation is essentially the same as the vanilla GP implementation on page 130 of the Stan reference (hence the lack of a code example). However, since certain symptoms may be less important to the lab-confirmed diagnosis than others, I have been trying to implement the automatic relevance determination on page 133 of the reference manual. When I try to do this, however, the parameter values tend to blow up. So, I was wondering if anyone out there with experience working with these kinds of models had a suggestion for a) priors for the relevance parameters and/or b) constraints that might make for a more useful model.

Response from Aki:

Note also that, rho and eta are weakly identifiable and the ratio of them is better identified and affects the nonlinearity. However it is usual to use independent priors (I don't remember if anyone uses joint prior with dependency).

I usually like to think suitable prior for the length scale l=1/rho. If the length scale is larger than the scale of the data the model is practically linear (wrt the particular covariate) and increasing the length scale does not change the model. Thus you should use a prior which goes down for the larger length scale values. If the length scale is so small that the correlation between data points is zero, then decreasing the length scale further does not change the model. Usually I've had no need to restrict the length scale to go to very small values, but sometimes. I usually use half-t prior for the length scale as a weakly informative prior.

Eta corresponds to how much of the variation is explained by the regression function and has a similar role to the prior variance for linear model weights. Thus we can use same weakly informative priors as in linear models. I often use half-t prior for eta.

Question from Herra Huu:

I'm a bit puzzled by Aki's response. It would make sense for me, if the dimension of the input (==D) would be one. But if I understood the original question correctly, here we would have: a) D>1 b) covariance function: f(x[i], x[j]) = eta * exp(-sum_{d=1}^{D} rho[d] * pow(x[i,d] - x[j,d],2))

Why, in this case, the term “automatic relevance determination” is misleading? I mean, if for example rho[1]==0, then we could just drop the first input dimension and we still would get exactly the same results. In general, wouldn't it be true that the closer the rho[d] is to zero the less effect x[,d] would have? (well, the original scale of the inputs matters, so it's not exactly that straightforward unless we normalize the inputs etc)

Response from Andrew:

I accept that we should continue to use the term “automatic relevance determination” because it exists, and people use it. But I find the term a bit distracting because, from the perspective of Stan, it’s just hierarchical Bayesian modeling:

It’s not any more “automatic” than any other Bayesian inference
The “relevance” interpretation seems tied to some very specific model choices
“Determination” is just inference.

Response from Aki:

In general, wouldn't it be true that the closer the rho[d] is to zero the less effect x[,d] would have?

A priori yes, but not a posteriori as the actual dependencies between x and y affect also. What I tried to say is that with a covariate x1 having a linear effect and another covariate x2 having a nonlinear effect, it is possible that rho1<rho2 even if the predictive relevance of x1 is higher. The rho is related to the relevance, but it is more accurate to say that it measures the nonlinearity (or the expected number of upcrossings, GPML p. 80). I couldn't quickly find a nice example with GP, but figures 1+3 in http://becs.aalto.fi/en/research/bayes/publications/LampinenVehtari_NN2001_preprint.pdf illustrate the same issue with MLP. We have made the same experiment with GPs, but just couldn't now find if I have the figures somewhere.

syclik commented 9 years ago

[x] get rid of references to models by path altogether in the manual. There are only links for some old models --- it is hard to maintain.

rtrangucci commented 9 years ago

[x] fix Student-t reparameteriation code block on page 184 to 185 (section 19.1: Reparameterizing a Student-t Distribution):

Currently:

paramters {
  real<lower = 0> nu;
  real<lower = 0> tau;
...
transformed parmaters {
  real beta;
  beta <- alpha / sqrt(tau);
...
model {
  real half_nu;
  half_nu <- 0.5 * nu;
  tau ~ gamma(half_nu, half_nu);
  alpha ~ normal(0,1);

to:

paramters {
  real<lower = 0> nu;
  real<lower = 0> tau;
  real alpha;  
...
transformed parmaters {
  real beta;
  beta <- alpha / sqrt(tau);
...
model {
  real half_nu;
  half_nu <- 0.5 * nu;
  tau ~ gamma(half_nu, half_nu);
  alpha ~ normal(0,1);

Declaration of alpha is the new part.

bob-carpenter commented 9 years ago

[x] fix the lasso and ridge signs
[x] fix the swap of lambda1 and lambda2
[x] fabs in place of abs

From Howard Zail on stan-users:

On page 353 in the Stan manual, there is some code for the Lasso. The key line is:

increment_log_prob(lambda * abs(beta[k]));

However, lambda is always non-negative. Optimizing a model will maximize the log probability, so the above code produces larger beta values for larger lambda values. The thinking behind Lasso is to produce smaller beta values for larger lambda. I think what is really required is:

increment_log_prob( - lambda * abs(beta[k]));

Probably the same problem exists with the Ridge regression code and the elastic net code in the Stan manual.

From Andre Pfeuffer on stan-users:

The formula on pg. 322 stan-reference 2.3.0 r(\beta) = ... shows a lambda1, lambda2. Lambda1 becomes lambda2 and vice versa in the code on pg. 323. Thus its flipped. It also should be negative. I'm not aware, if the betahat transformation shown on same page only apply's to the lasso or is mandatory already for the ridge regression
also?

From Bobby Jacob on stan-users:

I agree with Andre on the things he's mentioned. In the Stan manual 2.5, page 396, lambda 1 and 2 don't correspond to the last equation on page 395. Better to have lambda2 be associated with beta^2, and lambda1 with abs(beta). I think this is how the error was created in the subsequent program. Also, change abs to fabs to avoid warnings when running.

Andre is also right to say that the lambdas in the sample program on 396 should have negative signs on them. Each of the samples for the lasso and ridge had negative signs on the lambda because you are penalizing the log likelihood, as you are here.

randommm commented 9 years ago

[x] Fix the Hurdle distribution:

Change

y[n] ~ bernoulli(theta);
if (y[n] > 0)
y[n] ~ poisson(lambda) T[1,];

to

(y[n] == 0) ~ bernoulli(theta);
if (y[n] > 0)
  y[n] ~ poisson(lambda) T[1,];

or even

(y[n] > 0) ~ bernoulli(theta);
if (y[n] > 0)
  y[n] ~ poisson(lambda) T[1,];

if you change the convention of the pdf.

bob-carpenter commented 9 years ago

[x] make fix below suggested by Cody Ross on stan-users:

On page 248 of version 2.5.0-4 I see: The truncation has the same effect as the following direct update to the accumulated log probability (see the subsection of Section 24.3 contrasting log probability increment and sampling statements for more information).

increment_log_prob(normal_log(y, 0, 1));
increment_log_prob(-(1 - log(normal_cdf(-0.5, 0, 1))));

Should it read instead as?:

increment_log_prob(normal_log(y, 0, 1));
increment_log_prob(-   log(1 - normal_cdf(-0.5, 0, 1) )  );

bob-carpenter commented 9 years ago

[x] add note that initialization on bounds leads to problems (this is related to #537 and behavior will change in terms of error messages when that happens)

aadler commented 9 years ago

From above: “lowercase all math densities, e.g., replacing "Normal" with "normal", including all the ones named after people such as "Weibull" and "Wishart"”. Doesn't every manual of style require the capitalization of proper names?

andrewgelman commented 9 years ago

You can get Bob and me all wound up on this one! In my books, I capitalize Poisson and Wishart and Bernoulli but not normal and binomial and gamma. But I recall that Bob made a compelling argument that, in a software manual, typographical consistency is a more important concern.

On Dec 18, 2014, at 9:51 PM, Avraham Adler notifications@github.com wrote:

From above: “lowercase all math densities, e.g., replacing "Normal" with "normal", including all the ones named after people such as "Weibull" and "Wishart"”. Doesn't every manual of style require the capitalization of proper names?

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/1081#issuecomment-67592016.

bob-carpenter commented 9 years ago

Typographically, the Stan manual uses:

sans-serif: mathematical probability function typewriter: code function serif: running text other mathematical functions like log() or exp()

Gelman and Hill; Gelman et al. use

typewriter: code function serif: running text mathematical probability function other mathematical functions

In Andrew's book with Jennifer, you see "dpois" in code, because it's BUGS code (which Andrew calls "Bugs"). And then you see three conventions for distributions: N for normal, upper cased for those named after a person, and lower cased for others, all in the running text font.

I'm OK with

sans-serif OR serif for mathematical prob functions

and OK with

lower-case or upper-case for probability functions

I'm less happy with "N" or "G" for normal or gamma because it looks inconsistent, and even less happy when people put it in a script (caligraphic in TeX-speak) font. And if we go lower case, I'd just as soon lower-case "poisson" and "weibull".

Andrew --- do you want to decide? There's some work in changing it because there are hundreds of pages packed with distribution names, but some of the changes are just macros.

Bob

On Dec 18, 2014, at 9:56 PM, Andrew Gelman notifications@github.com wrote:

You can get Bob and me all wound up on this one! In my books, I capitalize Poisson and Wishart and Bernoulli but not normal and binomial and gamma. But I recall that Bob made a compelling argument that, in a software manual, typographical consistency is a more important concern.

On Dec 18, 2014, at 9:51 PM, Avraham Adler notifications@github.com wrote:

From above: “lowercase all math densities, e.g., replacing "Normal" with "normal", including all the ones named after people such as "Weibull" and "Wishart"”. Doesn't every manual of style require the capitalization of proper names?

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/1081#issuecomment-67592016.

— Reply to this email directly or view it on GitHub.

aadler commented 9 years ago

I think it boils down to the following question, do you view the Stan manual as a book or text in its own right or is it merely a convenient place to store documentation, which really could or should all be online. If the former, then you do need to follow some accepted manual of style, none of which I know of allow for the lowercase first letter for a person's name. If the latter (i.e. if Stan were interactive, you wouldn't have a document, you would have R-like webpages that pop-up by calling help, but the software as such doesn't allow that), then I think you can be a bit more lax. The fact that you have a separate citation for the manual from that of the software, implies that you view the Stan manual as a stand-alone written text (eligible for an ISBN, perhaps?), and I think you should, at the very least, uppercase proper names.

As for what to do with normal or negative binomial, I don't have a good source for that. In my own writing, I tend to leave them lowercase as per standard English, but I can hear the argument that in the realm of statistics, they are the "proper names" of the distributions.

If you are transcribing code (dpois for example) then I believe that as long as it is explicit that it is code (blockquote or typewriter font) it needs to be printed exactly as it should be in a program. Capitalization would be contraindicated if the code is meant to be used lowercase.

From reading math textbooks, I'm comfortable with calligraphic N for normal, and I think 99% of those who use Stan would be as well. I'm less comfortable with G for gamma, as I've also seen the actual Greek letter used, as well as it spelled out much more often than normal is spelled out. So if I had my druthers (which I don't, I know :) ), I'd go with the current Stan typography which seperates mathematical formulæ from text from code, and go with the "proper name Upper/standard words lower" split for probability functions, as if there isn't anything special about them (treat tem like the rest of the English language). As for N/G, I think you can get away with N, but it's probably better to spell everything out.

I certainly agree that regardless of the final decision, the manual must be consistent.

bob-carpenter commented 9 years ago

We think of it as a manual. There are really three parts to the manual:

user's/programmer's guide
reference manual
intro to Bayesian inference, MCMC, MLE and optimization, etc.

Are you equating "online" with being in HTML format? Other people have said we should render in HTML for searchability.

Don't worry, we'll continue to capitalize names used as names. The question is only what to do with the mathematical function symbols, which are neither running text (where they're clearly capitalized) or computer code (where they're clearly lowercased).

Bob

On Dec 21, 2014, at 2:54 AM, Avraham Adler notifications@github.com wrote:

I think it boils down to the following question, do you view the Stan manual as a book or text in its own right or is it merely a convenient place to store documentation, which really could or should all be online. If the former, then you do need to follow some accepted manual of style, none of which I know of allow for the lowercase first letter for a person's name. If the latter (i.e. if Stan were interactive, you wouldn't have a document, you would have R-like webpages that pop-up by calling help, but the software as such doesn't allow that), then I think you can be a bit more lax. The fact that you have a separate citation for the manual from that of the software, implies that you view the Stan manual as a stand-alone written text (eligible for an ISBN, perhaps?), and I think you should, at the very least, uppercase proper names.

As for what to do with normal or negative binomial, I don't have a good source for that. In my own writing, I tend to leave them lowercase as per standard English, but I can hear the argument that in the realm of statistics, they are the "proper names" of the distributions.

If you are transcribing code (dpois for example) then I believe that as long as it is explicit that it is code (blockquote or typewriter font) it needs to be printed exactly as it should be in a program. Capitalization would be contraindicated if the code is meant to be used lowercase.

From reading math textbooks, I'm comfortable with calligraphic N for normal, and I think 99% of those who use Stan would be as well. I'm less comfortable with G for gamma, as I've also seen the actual Greek letter used, as well as it spelled out much more often than normal is spelled out. So if I had my druthers (which I don't, I know :) ), I'd go with the current Stan typography which seperates mathematical formulæ from text from code, and go with the "proper name Upper/standard words lower" split for probability functions, as if there isn't anything special about them (treat tem like the rest of the English language). As for N/G, I think you can get away with N, but it's probably better to spell everything out.

I certainly agree that regardless of the final decision, the manual must be consistent.

— Reply to this email directly or view it on GitHub.

bob-carpenter commented 9 years ago

Moved from issue #1180 created by @ksvanhorn:

On p. 189 of the Stan Modeling Language Manual it gives example code for reparameterizing a Wishart distribution. The code is incorrect -- the last column of the matrix A is never initialized.

[x] It needs something like the following additional lines:

    for (i in 1:(K-1))
        A[i,K] <- 0;
    A[K,K] <- sqrt(c[K]);

[x] The example code for reparameterizing an inverse Wishart distribution has the same issue.

bob-carpenter commented 9 years ago

[x] include discussion of decision to throw exceptions at boundary in general overview of probability functions in same section as vectorization

bob-carpenter commented 9 years ago

Krzysztof Sakrejda on stan-users suggested an alternative description for the center_lp function example:

Here is an example of a function to assign standard normal priors to a vector of coefficients, along with a center and scale, and return the translated and scaled coefficients.

[x] use Krzysztof's formulation
[x] add cross-ref to parameterization discussion

bob-carpenter commented 9 years ago

[x] add process description chapter

http://www.r-project.org/doc/R-FDA.pdf

Something along the lines of the SDLC section would be great to have in your user-manual. Basically this describes the software life-cycle as how Stan is programmed, released and managed.

blindglobe commented 9 years ago

regarding the "process description chapter", I think you mean, "Process for Software Development and Release", which would be wonderful for supporting corporate IT computer systems validation work by providing justification that "STAN is developed in a way which makes it fit for the purpose of the process of performing a Bayesian statistical analyses". Sure it might seem obvious, but for dotting the i's and crossing the t's for a critical review, it makes things that much easier if the process for software development is described and followed. I'd be happy to review and comment if you want before you release (speaking as co-ghost-writer of the R-FDA document cited).

bob-carpenter commented 9 years ago

[x] replace lambda/sqrt(pi) with lambda/2 in exp_mod_normal doc (the implementation is correct, but our doc was wrong)
[x] thank Andrew Ellis in the manual for pointing it out

bob-carpenter commented 9 years ago

[x] add clarification that inverse Wishart and Wishart both take the scale matrix S as a parameter
[x] compare to BDA's notation Sigma ~ Inv-Wishart(inv(S))

bob-carpenter commented 9 years ago

[x] Fix this one (from Ben on stan-users)

Note that the next sentence of the manual is wrong. It should say

At $\nu = 1$, the LKJ correlation distribution reduces to the uniform distribution over correlation matrices of order $K$.

bob-carpenter commented 9 years ago

Contributed by Gokcen Eraslan via a patch to master #1227 that we haven't merged because it was to master:

[x] remove duplicate "externally" in programming.tex after soft k-means
[x] thank G\"{o}k\c{c}en in the acknowledgments

bob-carpenter commented 9 years ago

John Sutton mentioned on stan-users:

[x] append_row needs to have its own description, not just a cut-and-paste of append_col
[x] thank John in the acknowledgments

bob-carpenter commented 9 years ago

[x] undo the fix that went in for Andrew on cbind and rbind so as not to make the index confusing as to what is a function and what isn't

bob-carpenter commented 9 years ago

[x] thank Juan Sebastián Casallas for a doc patch

bob-carpenter commented 9 years ago

Krzysztof Sakrejda pointed out on stan-users that this is wrong:

Arrays, on the other hand, should be traversed in 
row-major (or first-index fastest) order.

[x] fix it

sakrejda commented 9 years ago

One more, in the section on "reparameterizing the Cauchy", pg. 182: The text "The inverse of the cumulative distribution function, F X −1 : (0, 1) → R, is thus" is followed by an equation for F^-1(y) specified in terms of x.

[x] \pi(x-1/2) should be \pi(y-1/2) ... (?)

bob-carpenter commented 9 years ago

[x] thank Alex Zvoleff for a code patch

bob-carpenter commented 9 years ago

Fixing some typos.

bob-carpenter commented 9 years ago

[x] fix issue with normal typesetting in ragged arrays chapter

stan-dev / stan

next manual 2.6.0 #1081

From Howard Zail on stan-users:

From Andre Pfeuffer on stan-users:

From Bobby Jacob on stan-users: