stan-dev / stan

Stan development repository. The master branch contains the current release. The develop branch contains the latest stable development. See the Developer Process Wiki for details.
https://mc-stan.org
BSD 3-Clause "New" or "Revised" License
2.57k stars 368 forks source link

fixing manual typos/brainos for 2.0 #211

Closed bob-carpenter closed 10 years ago

bob-carpenter commented 11 years ago
alpha <- sd(y) * (alpha_std + beta_std * mean(x) / sd(x)) + mean(y);

should read:

alpha <- sd(y) * (alpha_std - beta_std * mean(x) / sd(x)) + mean(y);
transformed data {
...
  real<lower=0> max_cov; real<lower=0> min_cov;

fix to be upper=0 for min_cov.

Can the min_cov have a lower bound of zero, when it is being reassigned to be the negative of max_cov in the last lineof the transformed data block?

bob-carpenter commented 11 years ago

Also clarify the cumulative distribution to indicate clearly that it's <= (lte) and not just < (lt).

bob-carpenter commented 10 years ago

From Ross Boylan on the users' list:

There are some points at which the documentation on subscripts and arrays/matrices/vectors is unclear or, possibly, incorrect. This is with the 1.3.0 pdf "Stan Modeling Language" dated 12 April, 2013.

e <- d[1]

e is an array with leading dimension 4 while d, after dropping the first dimension, has leading dimension 3. I don't think they are conformable.

x <- d[4, 3, 5, 2]
x <- e[3, 5, 2]

are said to be the same, but I think the second line is equivalent to d[1, 3, 5, 2] and so they are different. Have I misunderstood?

real a[3, 4];
matrix[3, 4] b;
row_vector[4] r;
b <- a;
b+a;
r <- a[2];

I would take that as a sign that such operations are illegal, and I think I've seen error messages when I inadvertently tried those operations.

It would be helpful to be explicit.

I think the only reasonable reading it that section 6.3 covers all subscriptable types, with the order of subscripts being consistent with the ordering given in other sections, namely that the array indices come first in mixed types and the ordering assumes the subscripts vary most rapidly from left to right ("last column major order" in some parts of the docs). But it might be good to be explicit.

bob-carpenter commented 10 years ago

Add this discussion, in response to a query from Sergio Pollini on the mailing list. There are several tips that should go in the manual.

The manual should reference Ben's paper on visualizing covariance matrices and should get a whole chapter on covariance and correlation priors. I know it will change when we add Cholesky factor types for correlation matrices, but it'd be good to get out there now.

> Three models:
> a) same data, parameters, and transformed parameters block:
>
> |
> data {
>    int<lower=0> N;
>    vector[2] y[N];
> }
> parameters {
>    vector[2] mu;
>    vector<lower=0>[2] sigma;
>    corr_matrix[2] R;
> }
> transformed parameters {
>    cov_matrix[2] Sigma;
>    matrix[2,2] D;
>    D <- diag_matrix(sigma);
>    Sigma <- D * R * D;
> }

This is right mathematically, but an explicit
loop should be faster in Stan because you don't have
to create the diagonal matrix and do all the needless
multiplications by 1.

So that'd be:

   Sigma[1,1] <- sigma[1] * sigma[1] * R[1,1];
   Sigma[2,2] <- sigma[2] * sigma[2] * R[2,2];
   Sigma[1,2] <- sigma[1] * sigma[2] * R[1,2];
   Sigma[2,1] <- Sigma[1,2];

The last step is to cut down on allocations of auto-diff variables.
It's valid because R and Sigma are symmetric.  In general, the
loop for size K, arranged to exploit memory locality in column-major
matrix storage would be:

  for (n in 1:K)
    for (m in 1:n)
       Sigma[m,n] <- sigma[m] * sigma[n] * R[m,n];
  for (n in 1:K)
    for (m in (n+1):K)
       Sigma[m,n] <- Sigma[n,m];

It'd be even hair faster if you split the diagonal out and
use the square() function [may not be available until 2.0],
but the above should be sufficient.

If I'm not mistaken, it might be even better to rescale
at the Cholesky factor level --- I'm too lazy to try
to figure that out because my matrix algebra's not that great.

> b) three model blocks:
>
> |
> model {                    // 1 : LKJ
>    R ~ lkj_corr(1.0);
>    for (n in 1:N)
>      y[n] ~ multi_normal(mu, Sigma);
> }
>
> model {                    // 2 : LKJ + Cholesky
>    matrix[2,2] L;
>    R ~ lkj_corr(1.0);
>    L <- cholesky_decompose(R);
>    for (n in 1:N)
>      y[n] ~ multi_normal_cholesky(mu, D * L);
> }
>
> model {                    // 3 : lkj_corr_cholesky
>    matrix[2,2] L;
>    L <- cholesky_decompose(R);
>    L ~ lkj_corr_cholesky(1.0);
>    for (n in 1:N)
>      y[n] ~ multi_normal_cholesky(mu, D * L);
> }
> |
>
> Same seed / iter / etc., pretty much same results / ESS / Rhat.
> Relative sampling time: 1 = 100, 2 = ~30, 3 = ~30.

That's not surprising.  These relative timings are likely
to change when we vectorize multi_normal, which will effectively
implement model (1) in the same way as (2).

I would strongly recommend putting that D*L outside of the
loop.  As is, it'll get computed each time.  May not be too bad
in low dimensions, but it'll be slowest part of the call in
higher dimensions.

> (As to scaled inverse Wishart, without Cholesky, ~1000!)

Yikes.  We should make that better.  The time's all going to
be spent in error checking.  We could put an inverse Wishart
and Wishart using Cholesky factors, which would make everything
faster.

> LKJ plus Cholesky decomposition looks very appealing! So I'd like to understand better.

> Bob: "I don't think we can just Cholesky factor a correlation matrix, because we won't be accounting for the Jacobian."
> Are you referring to something like: R ~ lkj_corr(1.0); L <- cholesky_decompose(R) ?
> You don't, I hope, because model 3 is going to be ruled out by Stan 2.0.
>
> Ben: "It would also yield the wrong answer if you manually constructed an L such that L * L' is a correlation matrix
> unless you also manually made the Jacobian correction." I'm aware of the Jacobian of the transformation of random
> variables, and there are several examples in (probability textbooks, and in) the manual, but I can't even imagine what
> you are speaking of. I mean that I'd only be able to "construct" such an L by L <- cholesky_decompose(R). But I'm just a
> learner.

There's a pretty thorough description of the transforms we use in
the chapter near the end of the manual on variable transforms.
It's easiest to understand in the 1D case, especially if you draw
out some examples and think about probability masses of intervals.
(Beyond 1D, it's volume, which is why you see the determinant term.)

> Are there any wrong manual constructions in model 2 or 3?

(2) is fine.  You can always transform a parameter and use it
on the right-side as an argument to a distribution.  The problem
is when you do it on the left side, as in:

parameters {
  real<lower=0> tau;
...
transformed parameters {
  real<lower=0> sigma;
...
model {
   sigma <- 1 / sqrt(tau);
   tau ~ gamma(alpha,beta);
   y ~ normal(mu, sigma);

It doesn't matter if sigma is a transformed parameter or declared
as a local variable, or just used implicitly through its
expression, as in:

  y ~ normal(mu, 1 / sqrt(tau));

What you can't do without the Jacobian adjustment is go the other way
around and sample something that's been transformed:

parameters {
  real<lower=0> sigma;
...
transformed parameters {
  real<lower=0> tau;
model {
  tau <- 1 / (sigma * sigma);
  tau ~ gamma(alpha, beta);
  y ~ normal(mu,sigma);

You need to account for the Jacobian of the transform, which in
the 1D case is just the absolute derivative.  And we work on
the log scale, so there's a log on the outside.
In this case, the transform is 1 / (sigma * sigma), which
is just sigma^{-2}.  The adjustment needed is then:

   log | d/d.sigma sigma^{-2} |

   = log | -2 sigma^{-3} |

   = log 2 sigma^{-3}

   = log 2 + log sigma^{-3}

   = log 2 - 3 * log sigma

We're working on a log scale, so you'd need to add to the model:

  lp__ <- lp__ + log(2) - 3 * log(sigma);

And you can drop the log(2) in Stan, because we only need the
log probability defined up to an additive constant.

Your model (3) is conceptually wrong because you're sampling a transformed Cholesky
factor L without the change-of-variables adjustment.   But,
you won't see it in this example's output because your choice of prior:

    L ~ lkj_corr_cholesky(1.0);

This is just uniform on L, which means it's effectively a no-op in Stan.
You're left with the implicit uniform distribution on R that is implicit
in its declaration as a correlation matrix.

By the way, this is true of the other models, too.  Correlation matrices
are bounded to (-1,1), so declaring a variable as a correlation matrix
gives it a proper uniform prior over correlation matrices.  You should
read Ben's paper on visualizing covariance priors to see what the implications
are.

Maybe I should put this example in the manual under the transforms
section.  After getting Ben to verify that it's right!
bob-carpenter commented 10 years ago

More from Sergio Pollini:

In general, the loop for size K, arranged to exploit memory locality in column-major matrix storage would be:

    for (n in 1:K)
      for (m in 1:n)
         Sigma[m,n] <- sigma[m] * sigma[n] * R[m,n];
    for (n in 1:K)
      for (m in (n+1):K)
         Sigma[m,n] <- Sigma[n,m];

It'd be even hair faster if you split the diagonal out and use the square() function [may not be available until 2.0], but the above should be sufficient.

... may I make a request?

There are a few examples in the manual (pages 101, 105, 125-126) but they look strange to an R user. In R the inner loop of the second for is wrong, because it would also be executed when n = K and (K+1):K is a valid sequence in R, so m would become K+1 (out of bounds). This is explained in the manual, but after those examples (pages 169-170). Syntax before examples, or a footnote referring to syntax, could be helpful. At least, at page 101 (and 105, 125-126) you could write:

for (i in 1:(N-1))
  for (j in (i+1):N)

or:

for (i in 1:N)
  for (j in 1:(i-1))

instead of:

for (i in 1:N)             // i == N looks useless, because
  for (j in (i+1):N)       // for (j in (N+1):N) is not executed 
bob-carpenter commented 10 years ago

Use inverse gamma example for Jacobians:

If you do this:

parameters {
  real<lower=0> tau;
}
transformed parameters {
  real<lower=0> phi;
  phi <- 1 / tau;
}
model {
  tau ~ gamma(0.001, 0.001);
}

then phi has an inverse gamma distribution. The other way to do this is as follows:

parameters {
  real<lower=0> phi;
}
transformed parameters {
  real<lower=0> tau;
  tau <- 1 / phi;
}
model {
  tau ~ gamma(0.001, 0.001);
  lp__ <- lp__ - 2 * log(phi);
}

The Jacobian adjustment is for the transform

   f(phi) = 1 / phi = phi^{-1}

   f'(phi) = - phi^{-2}

   log | f'(phi) | = log phi^{-2} = - 2 * log(phi);

The tighter way to do the above would be just this:

parameters {
  real<lower=0> phi;
}
model {
  (1 / phi) ~ gamma(0.001, 0.001);
  lp__ <- lp__ - 2 * log(phi);
}

and it should be clear that the new probability when added to log(gamma(1/phi | 0.001, 0.001) will get you to the right answer.

bob-carpenter commented 10 years ago

WAS: mention that Stan isn't compatible with Cygwin and there are no near-term plans for making it so.

bob-carpenter commented 10 years ago

Get rid of all acknowledgments for usual development stuff, and instead point people at GitHub.

bob-carpenter commented 10 years ago
parameters {
  real x;
}
model {
  0 ~ normal(sqrt(x-x), 1);
}

It needs the x - x, because that introduces the 0 in the numerator and the square root introduces it in the denominator.

Added fix to language section under expressions, but it should go in a "gotchas" section.

bob-carpenter commented 10 years ago

Look at Aki's comments on Gaussian processes.

bob-carpenter commented 10 years ago
bob-carpenter commented 10 years ago

multiply_log(0,0) = 0

Also, I'd like to see them done in formulas, not in words.

(Only added for multiply_log --- need big effort to do all of them, so opened new issue https://github.com/stan-dev/stan/issues/253)

bob-carpenter commented 10 years ago

From Mauricio Garnier-Villarreal on stan-users:

Sigma[m,m] <- sigma[m] * sigma[m] * Omega[m,n];

should be

Sigma[m,m] <- sigma[m] * sigma[m] * Omega[m,m];
bob-carpenter commented 10 years ago
bob-carpenter commented 10 years ago

From Dan Lakeland:

Add sampling statement notation to distribution doc.

bob-carpenter commented 10 years ago

moved here from a pull request

bob-carpenter commented 10 years ago

From Marcus:

For reference for the doc the parameters that were added are:

tol_obj - if | lp_i - lp_{i-1} | < tol_obj convergence is detected
tol_param - if || x_i - x_{i-1} || < tol_param convergence is detected
tol_grad - if || df_i || < tol_grad convergence is detected

where at iteration i lp_i is the log prob, x_i is the vector of (unconstrained) parameters and df_i is the gradient.

in addition:

init_alpha - The first step size to try on the initial iteration. If the first iteration takes a long time (and requires a lot of function evaluations) set init_alpha to be the roughly equal to the alpha used in that first iteration.

Then from later:

Just one of those conditions being satisfied will result in the algorithm terminating.

init_alpha has a default value (something like 0.0001) which is reasonable for many problems but might be too large or too small depending on the objective function and initialization. Being too big or too small just means that the first iteration will take longer (require more gradient evaluations) before the line search finds a good step length. It's not a critical parameter, but if you're optimizing the same model multiple times (as you tweak things or with different data) being able to change it can save some real time.

syclik commented 10 years ago

I just fixed some typos (stuff like missing commas, spaces, spelling errors).

Bob, can you double check these two changes? They've already been pushed to the branch.

I've also noticed inconsistency in spacing of the models in the manual. Do you want me to go through and make them consistent? If so, should we stick to 2 spaces?

bob-carpenter commented 10 years ago

See: http://en.cppreference.com/w/cpp/types/numeric_limits/epsilon

bob-carpenter commented 10 years ago
bob-carpenter commented 10 years ago

Add function doc for

bob-carpenter commented 10 years ago
bob-carpenter commented 10 years ago
bob-carpenter commented 10 years ago

OK, I think I'm done :-) If someone else wants to give it a quick look, that'd be great. I'd like to leave this merge until the last second so we don't just have to reopen this branch again.

bob-carpenter commented 10 years ago
bob-carpenter commented 10 years ago
bob-carpenter commented 10 years ago
bob-carpenter commented 10 years ago

These are getting pushed to post-2.0 and renamed to distance and squared_distance

bob-carpenter commented 10 years ago

From Jiqiang:

from

http://code.google.com/p/stan/downloads/list

to

https://github.com/stan-dev/stan/releases