stan-dev / stan

Stan development repository. The master branch contains the current release. The develop branch contains the latest stable development. See the Developer Process Wiki for details.
https://mc-stan.org
BSD 3-Clause "New" or "Revised" License
2.55k stars 365 forks source link

next manual 2.6.0 #1081

Closed bob-carpenter closed 9 years ago

bob-carpenter commented 9 years ago

This is where issues for the manual after 2.5.0 go.

bob-carpenter commented 9 years ago

Also from Andrew:

Every time I see “Normal,” I wince—but maybe that’s just my problem!

I can live with "normal(0,1)", but don't like the "N(0,1)" used in BDA.

bob-carpenter commented 9 years ago
betanalpha commented 9 years ago
bob-carpenter commented 9 years ago

From Rob Goedman:

syclik commented 9 years ago
bob-carpenter commented 9 years ago

Anything else we want to do this for?

bob-carpenter commented 9 years ago

There seems to be a typo on page 34 of stan-reference-2.5.0.pdf The pdf has

From @seldomworks in #1097:

model {
  y ~ normal(x*beta, sigma); // likelihood
}

I think it was intended to be

model {
  y ~ normal(x*beta + alpha, sigma); // likelihood
}
bob-carpenter commented 9 years ago

Where you have the single loop in s for the changepoint, you need to instead keep pairs. Easiest way to do it would be in a matrix:

  matrix[T,T] lp;
  lp <- rep_matrix(log_unif,T);
  for (s1 in 1:T)
    for (s2 in 1:T)
      for (t in 1:T)
        lp[s1,s2] <- lp[s1,s2] 
                     + poisson_log(D[t], if_else(t < s1, 
                                                 e, if_else(t < s2, m, l)));

and then in the model needs to be changed to convert the matrix lp to a vector so it can be passed to log_sum_exp:

  increment_log_prob(log_sum_exp(to_vector(lp)));

The problem is that as there are more change points, the computational complexity grows. You can see it intuitively from the loops.

Suppose there are N items. With a single change point, you need to consider all N positions. Each position requires an amount N of work, so overall complexity is O(N^2).

With two change points you need to consider all (N choose 2) change points. That's a quadratic number of pairs, each requiring an amount N of work, so overall complexity is O(N^3).

bob-carpenter commented 9 years ago

From Sebastian Weber on stan-users:

syclik commented 9 years ago
bob-carpenter commented 9 years ago
bob-carpenter commented 9 years ago

Related to a modeling issue brought up by Guido Biele on stan-users list:

bob-carpenter commented 9 years ago

From Andrew via e-mail:

bob-carpenter commented 9 years ago
bob-carpenter commented 9 years ago
jrnold commented 9 years ago

Regarding the Kalman filter examples, I have examples and fully implemented Kalman filtering (of several flavors: with / without missing values, as batch / sequentially) with backward sampling in the generated quantities block here: https://github.com/jrnold/ssmodels-in-stan. I could probably write up that section.

bob-carpenter commented 9 years ago

From David Hallvig in issue #1138

HMM is part of the Time-Series Models chapter (Ch. 6) but is stated, in the first paragraph of said chapter, to be part of a later chapter (although the reference is given to a subsection of the Time-Series Models, i.e. 6.6).

Here's the relevant section:

\chapter{Time-Series Models}

\noindent
Times series data come arranged in temporal order. This chapter
presents two kinds of time series models, regression-like models such
as autogression and moving average models, and hidden Markov models.

In later chapters, we discuss two alternative models which may be
applied to time-series data, 
%
\begin{itemize}
\item Gaussian processes (GP) in \refchapter{gaussian-processes} and 
\item hidden Markov models (HMM) in \refsection{hmms}.
\end{itemize}
bob-carpenter commented 9 years ago

Question from Jon Zelner on stan-users:

My implementation is essentially the same as the vanilla GP implementation on page 130 of the Stan reference (hence the lack of a code example). However, since certain symptoms may be less important to the lab-confirmed diagnosis than others, I have been trying to implement the automatic relevance determination on page 133 of the reference manual. When I try to do this, however, the parameter values tend to blow up. So, I was wondering if anyone out there with experience working with these kinds of models had a suggestion for a) priors for the relevance parameters and/or b) constraints that might make for a more useful model.

Response from Aki:

Note also that, rho and eta are weakly identifiable and the ratio of them is better identified and affects the nonlinearity. However it is usual to use independent priors (I don't remember if anyone uses joint prior with dependency).

I usually like to think suitable prior for the length scale l=1/rho. If the length scale is larger than the scale of the data the model is practically linear (wrt the particular covariate) and increasing the length scale does not change the model. Thus you should use a prior which goes down for the larger length scale values. If the length scale is so small that the correlation between data points is zero, then decreasing the length scale further does not change the model. Usually I've had no need to restrict the length scale to go to very small values, but sometimes. I usually use half-t prior for the length scale as a weakly informative prior.

Eta corresponds to how much of the variation is explained by the regression function and has a similar role to the prior variance for linear model weights. Thus we can use same weakly informative priors as in linear models. I often use half-t prior for eta.

Question from Herra Huu:

I'm a bit puzzled by Aki's response. It would make sense for me, if the dimension of the input (==D) would be one. But if I understood the original question correctly, here we would have: a) D>1 b) covariance function: f(x[i], x[j]) = eta * exp(-sum_{d=1}^{D} rho[d] * pow(x[i,d] - x[j,d],2))

Why, in this case, the term “automatic relevance determination” is misleading? I mean, if for example rho[1]==0, then we could just drop the first input dimension and we still would get exactly the same results. In general, wouldn't it be true that the closer the rho[d] is to zero the less effect x[,d] would have? (well, the original scale of the inputs matters, so it's not exactly that straightforward unless we normalize the inputs etc)

Response from Andrew:

I accept that we should continue to use the term “automatic relevance determination” because it exists, and people use it. But I find the term a bit distracting because, from the perspective of Stan, it’s just hierarchical Bayesian modeling:

  1. It’s not any more “automatic” than any other Bayesian inference
  2. The “relevance” interpretation seems tied to some very specific model choices
  3. “Determination” is just inference.

Response from Aki:

In general, wouldn't it be true that the closer the rho[d] is to zero the less effect x[,d] would have?

A priori yes, but not a posteriori as the actual dependencies between x and y affect also. What I tried to say is that with a covariate x1 having a linear effect and another covariate x2 having a nonlinear effect, it is possible that rho1<rho2 even if the predictive relevance of x1 is higher. The rho is related to the relevance, but it is more accurate to say that it measures the nonlinearity (or the expected number of upcrossings, GPML p. 80). I couldn't quickly find a nice example with GP, but figures 1+3 in http://becs.aalto.fi/en/research/bayes/publications/LampinenVehtari_NN2001_preprint.pdf illustrate the same issue with MLP. We have made the same experiment with GPs, but just couldn't now find if I have the figures somewhere.

syclik commented 9 years ago
rtrangucci commented 9 years ago

Currently:

paramters {
  real<lower = 0> nu;
  real<lower = 0> tau;
...
transformed parmaters {
  real beta;
  beta <- alpha / sqrt(tau);
...
model {
  real half_nu;
  half_nu <- 0.5 * nu;
  tau ~ gamma(half_nu, half_nu);
  alpha ~ normal(0,1);

to:

paramters {
  real<lower = 0> nu;
  real<lower = 0> tau;
  real alpha;  
...
transformed parmaters {
  real beta;
  beta <- alpha / sqrt(tau);
...
model {
  real half_nu;
  half_nu <- 0.5 * nu;
  tau ~ gamma(half_nu, half_nu);
  alpha ~ normal(0,1);

Declaration of alpha is the new part.

bob-carpenter commented 9 years ago

From Howard Zail on stan-users:

On page 353 in the Stan manual, there is some code for the Lasso. The key line is:

increment_log_prob(lambda * abs(beta[k]));

However, lambda is always non-negative. Optimizing a model will maximize the log probability, so the above code produces larger beta values for larger lambda values. The thinking behind Lasso is to produce smaller beta values for larger lambda. I think what is really required is:

increment_log_prob( - lambda * abs(beta[k]));

Probably the same problem exists with the Ridge regression code and the elastic net code in the Stan manual.

From Andre Pfeuffer on stan-users:

The formula on pg. 322 stan-reference 2.3.0 r(\beta) = ... shows a lambda1, lambda2. Lambda1 becomes lambda2 and vice versa in the code on pg. 323. Thus its flipped. It also should be negative. I'm not aware, if the betahat transformation shown on same page only apply's to the lasso or is mandatory already for the ridge regression
also?

From Bobby Jacob on stan-users:

I agree with Andre on the things he's mentioned. In the Stan manual 2.5, page 396, lambda 1 and 2 don't correspond to the last equation on page 395. Better to have lambda2 be associated with beta^2, and lambda1 with abs(beta). I think this is how the error was created in the subsequent program. Also, change abs to fabs to avoid warnings when running.

Andre is also right to say that the lambdas in the sample program on 396 should have negative signs on them. Each of the samples for the lasso and ridge had negative signs on the lambda because you are penalizing the log likelihood, as you are here.

randommm commented 9 years ago

Change

y[n] ~ bernoulli(theta);
if (y[n] > 0)
y[n] ~ poisson(lambda) T[1,];

to

(y[n] == 0) ~ bernoulli(theta);
if (y[n] > 0)
  y[n] ~ poisson(lambda) T[1,]; 

or even

(y[n] > 0) ~ bernoulli(theta);
if (y[n] > 0)
  y[n] ~ poisson(lambda) T[1,];

if you change the convention of the pdf.

bob-carpenter commented 9 years ago

On page 248 of version 2.5.0-4 I see: The truncation has the same effect as the following direct update to the accumulated log probability (see the subsection of Section 24.3 contrasting log probability increment and sampling statements for more information).

increment_log_prob(normal_log(y, 0, 1));
increment_log_prob(-(1 - log(normal_cdf(-0.5, 0, 1))));

Should it read instead as?:

increment_log_prob(normal_log(y, 0, 1));
increment_log_prob(-   log(1 - normal_cdf(-0.5, 0, 1) )  );
bob-carpenter commented 9 years ago
aadler commented 9 years ago

From above: “lowercase all math densities, e.g., replacing "Normal" with "normal", including all the ones named after people such as "Weibull" and "Wishart"”. Doesn't every manual of style require the capitalization of proper names?

andrewgelman commented 9 years ago

You can get Bob and me all wound up on this one! In my books, I capitalize Poisson and Wishart and Bernoulli but not normal and binomial and gamma. But I recall that Bob made a compelling argument that, in a software manual, typographical consistency is a more important concern.

On Dec 18, 2014, at 9:51 PM, Avraham Adler notifications@github.com wrote:

From above: “lowercase all math densities, e.g., replacing "Normal" with "normal", including all the ones named after people such as "Weibull" and "Wishart"”. Doesn't every manual of style require the capitalization of proper names?

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/1081#issuecomment-67592016.

bob-carpenter commented 9 years ago

Typographically, the Stan manual uses:

sans-serif: mathematical probability function typewriter: code function serif: running text other mathematical functions like log() or exp()

Gelman and Hill; Gelman et al. use

typewriter: code function serif: running text mathematical probability function other mathematical functions

In Andrew's book with Jennifer, you see "dpois" in code, because it's BUGS code (which Andrew calls "Bugs"). And then you see three conventions for distributions: N for normal, upper cased for those named after a person, and lower cased for others, all in the running text font.

I'm OK with

sans-serif OR serif for mathematical prob functions

and OK with

lower-case or upper-case for probability functions

I'm less happy with "N" or "G" for normal or gamma because it looks inconsistent, and even less happy when people put it in a script (caligraphic in TeX-speak) font. And if we go lower case, I'd just as soon lower-case "poisson" and "weibull".

Andrew --- do you want to decide? There's some work in changing it because there are hundreds of pages packed with distribution names, but some of the changes are just macros.

On Dec 18, 2014, at 9:56 PM, Andrew Gelman notifications@github.com wrote:

You can get Bob and me all wound up on this one! In my books, I capitalize Poisson and Wishart and Bernoulli but not normal and binomial and gamma. But I recall that Bob made a compelling argument that, in a software manual, typographical consistency is a more important concern.

On Dec 18, 2014, at 9:51 PM, Avraham Adler notifications@github.com wrote:

From above: “lowercase all math densities, e.g., replacing "Normal" with "normal", including all the ones named after people such as "Weibull" and "Wishart"”. Doesn't every manual of style require the capitalization of proper names?

— Reply to this email directly or view it on GitHub https://github.com/stan-dev/stan/issues/1081#issuecomment-67592016.

— Reply to this email directly or view it on GitHub.

aadler commented 9 years ago

I think it boils down to the following question, do you view the Stan manual as a book or text in its own right or is it merely a convenient place to store documentation, which really could or should all be online. If the former, then you do need to follow some accepted manual of style, none of which I know of allow for the lowercase first letter for a person's name. If the latter (i.e. if Stan were interactive, you wouldn't have a document, you would have R-like webpages that pop-up by calling help, but the software as such doesn't allow that), then I think you can be a bit more lax. The fact that you have a separate citation for the manual from that of the software, implies that you view the Stan manual as a stand-alone written text (eligible for an ISBN, perhaps?), and I think you should, at the very least, uppercase proper names.

As for what to do with normal or negative binomial, I don't have a good source for that. In my own writing, I tend to leave them lowercase as per standard English, but I can hear the argument that in the realm of statistics, they are the "proper names" of the distributions.

If you are transcribing code (dpois for example) then I believe that as long as it is explicit that it is code (blockquote or typewriter font) it needs to be printed exactly as it should be in a program. Capitalization would be contraindicated if the code is meant to be used lowercase.

From reading math textbooks, I'm comfortable with calligraphic N for normal, and I think 99% of those who use Stan would be as well. I'm less comfortable with G for gamma, as I've also seen the actual Greek letter used, as well as it spelled out much more often than normal is spelled out. So if I had my druthers (which I don't, I know :) ), I'd go with the current Stan typography which seperates mathematical formulæ from text from code, and go with the "proper name Upper/standard words lower" split for probability functions, as if there isn't anything special about them (treat tem like the rest of the English language). As for N/G, I think you can get away with N, but it's probably better to spell everything out.

I certainly agree that regardless of the final decision, the manual must be consistent.

bob-carpenter commented 9 years ago

We think of it as a manual. There are really three parts to the manual:

Are you equating "online" with being in HTML format? Other people have said we should render in HTML for searchability.

Don't worry, we'll continue to capitalize names used as names. The question is only what to do with the mathematical function symbols, which are neither running text (where they're clearly capitalized) or computer code (where they're clearly lowercased).

On Dec 21, 2014, at 2:54 AM, Avraham Adler notifications@github.com wrote:

I think it boils down to the following question, do you view the Stan manual as a book or text in its own right or is it merely a convenient place to store documentation, which really could or should all be online. If the former, then you do need to follow some accepted manual of style, none of which I know of allow for the lowercase first letter for a person's name. If the latter (i.e. if Stan were interactive, you wouldn't have a document, you would have R-like webpages that pop-up by calling help, but the software as such doesn't allow that), then I think you can be a bit more lax. The fact that you have a separate citation for the manual from that of the software, implies that you view the Stan manual as a stand-alone written text (eligible for an ISBN, perhaps?), and I think you should, at the very least, uppercase proper names.

As for what to do with normal or negative binomial, I don't have a good source for that. In my own writing, I tend to leave them lowercase as per standard English, but I can hear the argument that in the realm of statistics, they are the "proper names" of the distributions.

If you are transcribing code (dpois for example) then I believe that as long as it is explicit that it is code (blockquote or typewriter font) it needs to be printed exactly as it should be in a program. Capitalization would be contraindicated if the code is meant to be used lowercase.

From reading math textbooks, I'm comfortable with calligraphic N for normal, and I think 99% of those who use Stan would be as well. I'm less comfortable with G for gamma, as I've also seen the actual Greek letter used, as well as it spelled out much more often than normal is spelled out. So if I had my druthers (which I don't, I know :) ), I'd go with the current Stan typography which seperates mathematical formulæ from text from code, and go with the "proper name Upper/standard words lower" split for probability functions, as if there isn't anything special about them (treat tem like the rest of the English language). As for N/G, I think you can get away with N, but it's probably better to spell everything out.

I certainly agree that regardless of the final decision, the manual must be consistent.

— Reply to this email directly or view it on GitHub.

bob-carpenter commented 9 years ago

Moved from issue #1180 created by @ksvanhorn:

On p. 189 of the Stan Modeling Language Manual it gives example code for reparameterizing a Wishart distribution. The code is incorrect -- the last column of the matrix A is never initialized.

    for (i in 1:(K-1))
        A[i,K] <- 0;
    A[K,K] <- sqrt(c[K]);
bob-carpenter commented 9 years ago
bob-carpenter commented 9 years ago

Krzysztof Sakrejda on stan-users suggested an alternative description for the center_lp function example:

Here is an example of a function to assign standard normal priors to a vector of coefficients, along with a center and scale, and return the translated and scaled coefficients.
bob-carpenter commented 9 years ago

http://www.r-project.org/doc/R-FDA.pdf

Something along the lines of the SDLC section would be great to have in your user-manual. Basically this describes the software life-cycle as how Stan is programmed, released and managed.

blindglobe commented 9 years ago

regarding the "process description chapter", I think you mean, "Process for Software Development and Release", which would be wonderful for supporting corporate IT computer systems validation work by providing justification that "STAN is developed in a way which makes it fit for the purpose of the process of performing a Bayesian statistical analyses". Sure it might seem obvious, but for dotting the i's and crossing the t's for a critical review, it makes things that much easier if the process for software development is described and followed. I'd be happy to review and comment if you want before you release (speaking as co-ghost-writer of the R-FDA document cited).

bob-carpenter commented 9 years ago
bob-carpenter commented 9 years ago
bob-carpenter commented 9 years ago

Note that the next sentence of the manual is wrong. It should say

At $\nu = 1$, the LKJ correlation distribution reduces to the uniform distribution over correlation matrices of order $K$.
bob-carpenter commented 9 years ago

Contributed by Gokcen Eraslan via a patch to master #1227 that we haven't merged because it was to master:

bob-carpenter commented 9 years ago

John Sutton mentioned on stan-users:

bob-carpenter commented 9 years ago
bob-carpenter commented 9 years ago
bob-carpenter commented 9 years ago

Krzysztof Sakrejda pointed out on stan-users that this is wrong:

Arrays, on the other hand, should be traversed in 
row-major (or first-index fastest) order.
sakrejda commented 9 years ago

One more, in the section on "reparameterizing the Cauchy", pg. 182: The text "The inverse of the cumulative distribution function, F X −1 : (0, 1) → R, is thus" is followed by an equation for F^-1(y) specified in terms of x.

bob-carpenter commented 9 years ago
bob-carpenter commented 9 years ago

Fixing some typos.

bob-carpenter commented 9 years ago