fixing manual typos/brainos for 2.0

bob-carpenter commented 11 years ago

[x] Sergio Polini reports: fix manual typo, Page 135: in the generated quantities block,

alpha <- sd(y) * (alpha_std + beta_std * mean(x) / sd(x)) + mean(y);

should read:

alpha <- sd(y) * (alpha_std - beta_std * mean(x) / sd(x)) + mean(y);

[x] Asim wrote On Page 40 of the manual, we have the following:

transformed data {
...
  real<lower=0> max_cov; real<lower=0> min_cov;

fix to be upper=0 for min_cov.

Can the min_cov have a lower bound of zero, when it is being reassigned to be the negative of max_cov in the last lineof the transformed data block?

bob-carpenter commented 11 years ago

Also clarify the cumulative distribution to indicate clearly that it's <= (lte) and not just < (lt).

[x] programming language description
[ ] for each distribution

bob-carpenter commented 10 years ago

From Ross Boylan on the users' list:

There are some points at which the documentation on subscripts and arrays/matrices/vectors is unclear or, possibly, incorrect. This is with the 1.3.0 pdf "Stan Modeling Language" dated 12 April, 2013.

[x] p. 152 has

e <- d[1]

e is an array with leading dimension 4 while d, after dropping the first dimension, has leading dimension 3. I don't think they are conformable.

[x] same page

x <- d[4, 3, 5, 2]
x <- e[3, 5, 2]

are said to be the same, but I think the second line is equivalent to d[1, 3, 5, 2] and so they are different. Have I misunderstood?

[x] P. 166 says "see Section ?? for a full explanation" about 1/4 of the way down the page.
[x] The documentation seems to be silent on mixing arrays and vectors or matrices such as

real a[3, 4];
matrix[3, 4] b;
row_vector[4] r;
b <- a;
b+a;
r <- a[2];

I would take that as a sign that such operations are illegal, and I think I've seen error messages when I inadvertently tried those operations.

It would be helpful to be explicit.

[x] Comment on why there are two ways, to answer this comment: Also, it's confusing to have two ways of doing about the same thing (with apologies to Perl)--but that's a more complicated design issue.
[x] The discussion of the dump format in chapter 6 does not technically discuss matrices, vectors, row vectors, or arrays of those types. I did not realize this on first reading because I did not appreciate that array was a distinct type from those other subscriptable entities.

I think the only reasonable reading it that section 6.3 covers all subscriptable types, with the order of subscripts being consistent with the ordering given in other sections, namely that the array indices come first in mixed types and the ordering assumes the subscripts vary most rapidly from left to right ("last column major order" in some parts of the docs). But it might be good to be explicit.

[x] The discussion of partial subscripting in 20.4, 20.5 and 21.5 provides no indication that such expressions can be assigned to. Though 22.1 explains that such assignments are valid, some indication of that in the initial presentation of partial subscripting seems natural to me.

bob-carpenter commented 10 years ago

[x] add unary transform vs. change-of-variables example

Add this discussion, in response to a query from Sergio Pollini on the mailing list. There are several tips that should go in the manual.

The manual should reference Ben's paper on visualizing covariance matrices and should get a whole chapter on covariance and correlation priors. I know it will change when we add Cholesky factor types for correlation matrices, but it'd be good to get out there now.

> Three models:
> a) same data, parameters, and transformed parameters block:
>
> |
> data {
>    int<lower=0> N;
>    vector[2] y[N];
> }
> parameters {
>    vector[2] mu;
>    vector<lower=0>[2] sigma;
>    corr_matrix[2] R;
> }
> transformed parameters {
>    cov_matrix[2] Sigma;
>    matrix[2,2] D;
>    D <- diag_matrix(sigma);
>    Sigma <- D * R * D;
> }

This is right mathematically, but an explicit
loop should be faster in Stan because you don't have
to create the diagonal matrix and do all the needless
multiplications by 1.

So that'd be:

   Sigma[1,1] <- sigma[1] * sigma[1] * R[1,1];
   Sigma[2,2] <- sigma[2] * sigma[2] * R[2,2];
   Sigma[1,2] <- sigma[1] * sigma[2] * R[1,2];
   Sigma[2,1] <- Sigma[1,2];

The last step is to cut down on allocations of auto-diff variables.
It's valid because R and Sigma are symmetric.  In general, the
loop for size K, arranged to exploit memory locality in column-major
matrix storage would be:

  for (n in 1:K)
    for (m in 1:n)
       Sigma[m,n] <- sigma[m] * sigma[n] * R[m,n];
  for (n in 1:K)
    for (m in (n+1):K)
       Sigma[m,n] <- Sigma[n,m];

It'd be even hair faster if you split the diagonal out and
use the square() function [may not be available until 2.0],
but the above should be sufficient.

If I'm not mistaken, it might be even better to rescale
at the Cholesky factor level --- I'm too lazy to try
to figure that out because my matrix algebra's not that great.

> b) three model blocks:
>
> |
> model {                    // 1 : LKJ
>    R ~ lkj_corr(1.0);
>    for (n in 1:N)
>      y[n] ~ multi_normal(mu, Sigma);
> }
>
> model {                    // 2 : LKJ + Cholesky
>    matrix[2,2] L;
>    R ~ lkj_corr(1.0);
>    L <- cholesky_decompose(R);
>    for (n in 1:N)
>      y[n] ~ multi_normal_cholesky(mu, D * L);
> }
>
> model {                    // 3 : lkj_corr_cholesky
>    matrix[2,2] L;
>    L <- cholesky_decompose(R);
>    L ~ lkj_corr_cholesky(1.0);
>    for (n in 1:N)
>      y[n] ~ multi_normal_cholesky(mu, D * L);
> }
> |
>
> Same seed / iter / etc., pretty much same results / ESS / Rhat.
> Relative sampling time: 1 = 100, 2 = ~30, 3 = ~30.

That's not surprising.  These relative timings are likely
to change when we vectorize multi_normal, which will effectively
implement model (1) in the same way as (2).

I would strongly recommend putting that D*L outside of the
loop.  As is, it'll get computed each time.  May not be too bad
in low dimensions, but it'll be slowest part of the call in
higher dimensions.

> (As to scaled inverse Wishart, without Cholesky, ~1000!)

Yikes.  We should make that better.  The time's all going to
be spent in error checking.  We could put an inverse Wishart
and Wishart using Cholesky factors, which would make everything
faster.

> LKJ plus Cholesky decomposition looks very appealing! So I'd like to understand better.

> Bob: "I don't think we can just Cholesky factor a correlation matrix, because we won't be accounting for the Jacobian."
> Are you referring to something like: R ~ lkj_corr(1.0); L <- cholesky_decompose(R) ?
> You don't, I hope, because model 3 is going to be ruled out by Stan 2.0.
>
> Ben: "It would also yield the wrong answer if you manually constructed an L such that L * L' is a correlation matrix
> unless you also manually made the Jacobian correction." I'm aware of the Jacobian of the transformation of random
> variables, and there are several examples in (probability textbooks, and in) the manual, but I can't even imagine what
> you are speaking of. I mean that I'd only be able to "construct" such an L by L <- cholesky_decompose(R). But I'm just a
> learner.

There's a pretty thorough description of the transforms we use in
the chapter near the end of the manual on variable transforms.
It's easiest to understand in the 1D case, especially if you draw
out some examples and think about probability masses of intervals.
(Beyond 1D, it's volume, which is why you see the determinant term.)

> Are there any wrong manual constructions in model 2 or 3?

(2) is fine.  You can always transform a parameter and use it
on the right-side as an argument to a distribution.  The problem
is when you do it on the left side, as in:

parameters {
  real<lower=0> tau;
...
transformed parameters {
  real<lower=0> sigma;
...
model {
   sigma <- 1 / sqrt(tau);
   tau ~ gamma(alpha,beta);
   y ~ normal(mu, sigma);

It doesn't matter if sigma is a transformed parameter or declared
as a local variable, or just used implicitly through its
expression, as in:

  y ~ normal(mu, 1 / sqrt(tau));

What you can't do without the Jacobian adjustment is go the other way
around and sample something that's been transformed:

parameters {
  real<lower=0> sigma;
...
transformed parameters {
  real<lower=0> tau;
model {
  tau <- 1 / (sigma * sigma);
  tau ~ gamma(alpha, beta);
  y ~ normal(mu,sigma);

You need to account for the Jacobian of the transform, which in
the 1D case is just the absolute derivative.  And we work on
the log scale, so there's a log on the outside.
In this case, the transform is 1 / (sigma * sigma), which
is just sigma^{-2}.  The adjustment needed is then:

   log | d/d.sigma sigma^{-2} |

   = log | -2 sigma^{-3} |

   = log 2 sigma^{-3}

   = log 2 + log sigma^{-3}

   = log 2 - 3 * log sigma

We're working on a log scale, so you'd need to add to the model:

  lp__ <- lp__ + log(2) - 3 * log(sigma);

And you can drop the log(2) in Stan, because we only need the
log probability defined up to an additive constant.

Your model (3) is conceptually wrong because you're sampling a transformed Cholesky
factor L without the change-of-variables adjustment.   But,
you won't see it in this example's output because your choice of prior:

    L ~ lkj_corr_cholesky(1.0);

This is just uniform on L, which means it's effectively a no-op in Stan.
You're left with the implicit uniform distribution on R that is implicit
in its declaration as a correlation matrix.

By the way, this is true of the other models, too.  Correlation matrices
are bounded to (-1,1), so declaring a variable as a correlation matrix
gives it a proper uniform prior over correlation matrices.  You should
read Ben's paper on visualizing covariance priors to see what the implications
are.

Maybe I should put this example in the manual under the transforms
section.  After getting Ben to verify that it's right!

bob-carpenter commented 10 years ago

[x] fix example that looks wrong to R users

[I actually recoded it more efficiently while I was at it]
[x] discussion of matrix indexing efficiency

More from Sergio Pollini:

In general, the loop for size K, arranged to exploit memory locality in column-major matrix storage would be:

    for (n in 1:K)
      for (m in 1:n)
         Sigma[m,n] <- sigma[m] * sigma[n] * R[m,n];
    for (n in 1:K)
      for (m in (n+1):K)
         Sigma[m,n] <- Sigma[n,m];

It'd be even hair faster if you split the diagonal out and use the square() function [may not be available until 2.0], but the above should be sufficient.

... may I make a request?

There are a few examples in the manual (pages 101, 105, 125-126) but they look strange to an R user. In R the inner loop of the second for is wrong, because it would also be executed when n = K and (K+1):K is a valid sequence in R, so m would become K+1 (out of bounds). This is explained in the manual, but after those examples (pages 169-170). Syntax before examples, or a footnote referring to syntax, could be helpful. At least, at page 101 (and 105, 125-126) you could write:

for (i in 1:(N-1))
  for (j in (i+1):N)

or:

for (i in 1:N)
  for (j in 1:(i-1))

instead of:

for (i in 1:N)             // i == N looks useless, because
  for (j in (i+1):N)       // for (j in (N+1):N) is not executed

bob-carpenter commented 10 years ago

[x] add gamma example for transforms

Use inverse gamma example for Jacobians:

If you do this:

parameters {
  real<lower=0> tau;
}
transformed parameters {
  real<lower=0> phi;
  phi <- 1 / tau;
}
model {
  tau ~ gamma(0.001, 0.001);
}

then phi has an inverse gamma distribution. The other way to do this is as follows:

parameters {
  real<lower=0> phi;
}
transformed parameters {
  real<lower=0> tau;
  tau <- 1 / phi;
}
model {
  tau ~ gamma(0.001, 0.001);
  lp__ <- lp__ - 2 * log(phi);
}

The Jacobian adjustment is for the transform

   f(phi) = 1 / phi = phi^{-1}

   f'(phi) = - phi^{-2}

   log | f'(phi) | = log phi^{-2} = - 2 * log(phi);

The tighter way to do the above would be just this:

parameters {
  real<lower=0> phi;
}
model {
  (1 / phi) ~ gamma(0.001, 0.001);
  lp__ <- lp__ - 2 * log(phi);
}

and it should be clear that the new probability when added to log(gamma(1/phi | 0.001, 0.001) will get you to the right answer.

bob-carpenter commented 10 years ago

[x] include Kevin van Horn's instructions for Cygwin install.

WAS: mention that Stan isn't compatible with Cygwin and there are no near-term plans for making it so.

bob-carpenter commented 10 years ago

[x] No, don't!

Get rid of all acknowledgments for usual development stuff, and instead point people at GitHub.

bob-carpenter commented 10 years ago

[x] include some doc of why sqrt(square(x-x)) is problematic due to floating-point and chain rule. No idea where to put it, though. square() is a red herring, and issue is simplest here:

parameters {
  real x;
}
model {
  0 ~ normal(sqrt(x-x), 1);
}

It needs the x - x, because that introduces the 0 in the numerator and the square root introduces it in the denominator.

Added fix to language section under expressions, but it should go in a "gotchas" section.

bob-carpenter commented 10 years ago

Look at Aki's comments on Gaussian processes.

[x] move to new issue

bob-carpenter commented 10 years ago

[x] Remove all references to "graphical model" in manual that should be more general. Rolled into this issue from #214.

bob-carpenter commented 10 years ago

[x] add boundary conditions to function definitions such as

multiply_log(0,0) = 0

Also, I'd like to see them done in formulas, not in words.

(Only added for multiply_log --- need big effort to do all of them, so opened new issue https://github.com/stan-dev/stan/issues/253)

bob-carpenter commented 10 years ago

From Mauricio Garnier-Villarreal on stan-users:

[x] So, you can fix it, there is a typo in the stan manual (1.3), in page 95, for the correation example, the line:

Sigma[m,m] <- sigma[m] * sigma[m] * Omega[m,n];

should be

Sigma[m,m] <- sigma[m] * sigma[m] * Omega[m,m];

bob-carpenter commented 10 years ago

[x] Append preface with more info on why Stan 2

bob-carpenter commented 10 years ago

From Dan Lakeland:

Add sampling statement notation to distribution doc.

[x] latex macro, with index
[x] instances

bob-carpenter commented 10 years ago

[x] Change beta support from [0,1] to (0,1) in probability doc (and thank Dan Stowell)

moved here from a pull request

bob-carpenter commented 10 years ago

[x] Doc for BFGS parameters

From Marcus:

For reference for the doc the parameters that were added are:

tol_obj - if | lp_i - lp_{i-1} | < tol_obj convergence is detected
tol_param - if || x_i - x_{i-1} || < tol_param convergence is detected
tol_grad - if || df_i || < tol_grad convergence is detected

where at iteration i lp_i is the log prob, x_i is the vector of (unconstrained) parameters and df_i is the gradient.

in addition:

init_alpha - The first step size to try on the initial iteration. If the first iteration takes a long time (and requires a lot of function evaluations) set init_alpha to be the roughly equal to the alpha used in that first iteration.

Then from later:

Just one of those conditions being satisfied will result in the algorithm terminating.

init_alpha has a default value (something like 0.0001) which is reasonable for many problems but might be too large or too small depending on the objective function and initialization. Being too big or too small just means that the first iteration will take longer (require more gradient evaluations) before the line search finds a good step length. It's not a critical parameter, but if you're optimizing the same model multiple times (as you tweak things or with different data) being able to change it can save some real time.

syclik commented 10 years ago

I just fixed some typos (stuff like missing commas, spaces, spelling errors).

Bob, can you double check these two changes? They've already been pushed to the branch.

[x] page 97: "Vectorized MA(Q) Model". vector[2] theta; changed to vector[Q] theta;. I think that's the right bound, but I haven't run the model, so I just want another set of eyes on it.
[x] page 22: "Compiling a Stan Model". Changed the library versions, changed -I to -isystem. The -isystem treats the include as system includes and ignores warnings. That's how I was able to suppress compiler warnings in Boost and Eigen.

I've also noticed inconsistency in spacing of the models in the manual. Do you want me to go through and make them consistent? If so, should we stick to 2 spaces?

bob-carpenter commented 10 years ago

[x] change doc for epsilon() to machine_precision() or smallest number such that (1 + eps()) != 1.
[x] remove doc for negative_epsilon

See: http://en.cppreference.com/w/cpp/types/numeric_limits/epsilon

bob-carpenter commented 10 years ago

[x] Fix all models so that there are two spaces indent instead of four.

bob-carpenter commented 10 years ago

Add function doc for

[x] trace_quad_form for (matrix, vector) as arguments.
[x] int_step for real argument ???

bob-carpenter commented 10 years ago

[x] fix {comparison operators} text layout (looks like negative vspace); current p. 235

bob-carpenter commented 10 years ago

[x] change title text from 1.3.0 DRAFT to 2.0.0 for impending release candidate (no beta marker)

bob-carpenter commented 10 years ago

OK, I think I'm done :-) If someone else wants to give it a quick look, that'd be great. I'd like to leave this merge until the last second so we don't just have to reopen this branch again.

bob-carpenter commented 10 years ago

[x] add description of optimization in intro
[x] add optimization example in getting started section

bob-carpenter commented 10 years ago

[x] added full disclosure on initialization

bob-carpenter commented 10 years ago

[x] change doc for fabs() and abs(real) to indicate abs(real) is deprecated
[x] add warning to doc for floor, ceiling, round, fabs, etc. to indicate they're dangerous for gradient-based approaches

bob-carpenter commented 10 years ago

[x] doc for dist
[x] doc for squared_dist

These are getting pushed to post-2.0 and renamed to distance and squared_distance

bob-carpenter commented 10 years ago

From Jiqiang:

[x] change download link

from

http://code.google.com/p/stan/downloads/list

to

https://github.com/stan-dev/stan/releases

stan-dev / stan

fixing manual typos/brainos for 2.0 #211