stan-dev / docs

Documentation for the Stan language and CmdStan
https://mc-stan.org/docs/
Other
37 stars 107 forks source link

manual longer term #19

Open bob-carpenter opened 9 years ago

bob-carpenter commented 9 years ago

This is the issue for things that would be nice to get into the manual in the future. Rather than closing this issue, we'll just tick things off as part of the regular round of "next manual" issues.

bob-carpenter commented 9 years ago

Originally from Jeffrey Arnold here: https://github.com/stan-dev/stan/issues/1081#issuecomment-63598018

Regarding the Kalman filter examples, I have examples and fully implemented Kalman filtering (of several flavors: with / without missing values, as batch / sequentially) with backward sampling in the generated quantities block here: https://github.com/jrnold/ssmodels-in-stan. I could probably write up that section.

bob-carpenter commented 9 years ago
bob-carpenter commented 9 years ago

https://ariddell.org/horseshoe-prior-with-stan.html

bob-carpenter commented 9 years ago

For example, this model is problematic arithmetically:

fit <- stan(model_code = "parameters { real y; } model { y ~ normal(1,1e-20); }")

and produces

          mean  se_mean       sd      2.5%       25% 50% 75% 97.5% n_eff         Rhat
y            1      0.0        0         1         1   1   1     1  4000          Inf
lp__ -15407440 422002.9 26689805 -61629758 -15407440   0   0     0  4000 1.500106e+14

There's really nothing we can do about such models numerically. What you have to do from a user perspective is use a parameterizations z = y - 1, which can be handled stably because you can represent numbers close to zero well.

parameters { real z; } model { z ~ normal(0,1e-20); }

The problem is that there's still no way to represent y = z + 1 as a real value using double-precision floating-point arithmetic, but you can get close to log(y) by using

  log_y <- log1p(z);
bob-carpenter commented 9 years ago

Provide examples of log-mix for the manual mixture section.

bob-carpenter commented 9 years ago

From Jeffrey Arnold:

Regarding the Kalman filter examples, I have examples and fully implemented Kalman filtering (of several flavors: with / without missing values, as batch / sequentially) with backward sampling in the generated quantities block here: https://github.com/jrnold/ssmodels-in-stan. I could probably write up that section.

bob-carpenter commented 9 years ago

From Andrew on stan-dev, an example of a model for the problematic posteriors section:

Next I decided, just for fun, to fit a nonlinear model: y = a_exp(-b_x) + c + error, with N(0,sigma) errors. I simulated some data (x taking on the values between 0 and 23, setting a=.5, b=1/10, c=1, sigma=.2, and I fit it constraining all the parameters to be positive. (In retrospect I’m thinking I should’ve just constrained b and sigma to be positive; constraining a and c made sense in the context of my hypothetical example, but in general it’s poor practice to put in hard constraints unless absolutely necessary.)

I simulated data and fit the model and all was fine. But then I repeated and the fit blew up, I was getting some estimates of b that were either really close to 0 (1e-10, etc) or really high (1e10, etc), I can’t remember which. The point is that there can be instability at the extremes, and I realized the model need a bit of hand-holding. As a kluge I constrained b to be between 1/50 and 1 (which made sense in the context of the example). A hard constraint was ugly but was easy enough to understand. I didn’t want to confuse the class by introducing a normal prior here.

So this second example was a bit of a mess. Which suggests to me that in the manual/book we need to throw in an example to explicitly address instability and constraints. Since, ultimately, this is an inherent difficulty with modeling, it’s not anything to do with Stan in particular.

bob-carpenter commented 9 years ago

Here's what the model should look like, but there's a bug in Stan 2.6 preventing the compilation:

data {
  int<lower=2> K;             // num categories
  int<lower=1> D;             // num predictors (without intercept)
  int<lower=0> N;             // num observations
  vector[D] x[N];              // predictors
  int<lower=1,upper=K> y[N];  // observations
}
transformed data {
  vector[1] zero;
  zero[1] <- 0;
}
parameters {
  vector[K-1] alpha;          // intercepts
  matrix[K-1,D] beta;         // slopes
}
model {
  // priors
  alpha ~ normal(0,20);
  to_vector(beta) ~ normal(0,10);
  // likelihood
  for (n in 1:N) {
    y[n] ~ categorical_logit(append_row(zero, alpha + beta * x[n]));
  }  
}

Here's a program that will work, but isn't optimally efficient:

data {
 int<lower=2> K;             // num categories
 int<lower=1> D;             // num predictors (without intercept)
 int<lower=0> N;             // num observations
 vector[D] x[N];              // predictors
 int<lower=1,upper=K> y[N];  // observations
}
transformed data {
 vector[1] zero;
 zero[1] <- 0;
}
parameters {
 vector[K-1] alpha;          // intercepts
 matrix[K-1,D] beta;         // slopes
}
model {
 // temps
 vector[K] lambda;
 lambda[1] <- 0;

 // priors
 alpha ~ normal(0,20);
 // likelihood
 for (n in 1:N) {
   for (k in 2:K)
     lambda[k] <- alpha[k-1] + beta[k-1] * x[n];
   y[n] ~ categorical_logit(lambda);
 }
}
bob-carpenter commented 9 years ago
Computing pi
/**
 * Compute the mathematical constant pi by Monte Carlo integration.
 *
 * First, sample (x,y) uniformly from a square with vertices (-1,1),
 * (-1,1), (1,-1), (1,1) by sampling x and y independently from the
 * interval (-1,1).  
 * 
 * Second, determine if (x,y) lies within the circle circumscribed in
 * the square, by testing if (x^2 + y^2 < 1). Then multiply the
 * proportion of points inside the circle by the area of the square,
 * which is 4.
 *
 * The posterior mean of the generated quantity pi will compute pi
 * to within MC error.
 *
 * Run with fixed parameter algorithm and zero warmup iterations.
 */
model {
}
generated quantities {
  real<lower=-1,upper=1> x;
  real<lower=-1,upper=1> y;
  real<lower=0,upper=4> pi;
  x <- uniform_rng(-1,1);
  y <- uniform_rng(-1,1);
  pi <- 4 * (x^2 + y^2 < 1);
}
Generalization to volume of hypersphere
/**
 * Compute the volume of a hypersphere of a given radius in a
 * given number of dimensions by Monte Carlo integration.
 *
 * The volume will be the mean of the sapled sphere_volume values.
 */
data {
  int<lower=1> N;
  real<lower=0> r;
}
transformed data {
  real cube_volume;
  cube_volume <- (2 * r)^N;
}
model {
}
generated quantities {
  vector<lower=-1, upper=1>[N] x;
  real<lower=0,upper=cube_volume> sphere_volume;

  for (n in 1:N)
    x[n] <- uniform_rng(-1,1);

  sphere_volume <- cube_volume * (dot_self(x) < r);
}
bob-carpenter commented 9 years ago
bob-carpenter commented 9 years ago
bob-carpenter commented 9 years ago
bob-carpenter commented 9 years ago
bob-carpenter commented 9 years ago
bob-carpenter commented 9 years ago

From Aki on stan-users:

I recommend as an easy default option nu ~ gamma(2,0.1);

This was proposed and anlysed by Juárez and Steel (2010) (Model-based clustering of non-Gaussian panel data based on skew-t distributions. Journal of Business & Economic Statistics 28, 52–66.). Juárez and Steel compere this to Jeffreys prior and report that the difference is small. Simpson et al (2014) (arXiv:1403.4630) propose a theoretically well justified "penalised complexity (PC) prior", which they show to have a good behavior for the degrees of freedom, too. PC prior might be the best choice, but requires numerical computation of the prior (which could computed in a grid and interpolated etc.). It would be feasible to implement it in Stan, but it would require some work. Unfortunately no-one has compared PC prior and this gamma prior directly, but based on the discussion with Daniel Simpson, although PC prior would be better this gamma(2,0.1) prior is not a bad choice. Thus, I would use it until someone implements the PC prior for degrees of freedom of the Student's t in Stan.

bob-carpenter commented 9 years ago
bob-carpenter commented 9 years ago
bob-carpenter commented 9 years ago

I’d suggest an informative prior, for example if you think the elasticity should be between 0 and 1, you could use a normal prior with mean 0.5 and sd 0.5. The point is that you’d like to let the data speak but at the same time you want to control your estimate. If you do get a negative elasticity estimate, so be it (sometimes such things can happen due to unforeseen interactions in the data) but then you’ll probably also have a big standard error. And if the data really “want” to find a negative elasticity with a small s.e., you’d like to know this—as, at the very least, this will tell you that there’s something weird about the data. Better to let the data speak than to get an estimate right on the boundary which still doesn’t fit the data.

bob-carpenter commented 9 years ago

Originally issue #1431 (from Titus van der Malsburg):

The term local variable is used a lot in the manual but it is difficult to extract a complete definition. For example, section 25.7 says:

At the beginning of each block, local variables may be declared that are scoped over the rest of the statements in the block.

However, since all variables are defined at the beginning of a block, that would mean that all variables are local, which is clearly not the case. On the other hand, the section variable scope explains that variables declared in one program block are visible in subsequent blocks except for variables declared in the model block. This section, however, fails to mention that this is not the case for variables declared in nested blocks.

I think it would help if there was one place in the manual that gives the whole story. It may also make sense to use this opportunity to clarify why the term global variable isn't used (only variables in the data block are global).

bob-carpenter commented 9 years ago

General discussion of efficiency of restart so users don't feel like they're losing so much time.

Everyone keeps asking about whether Stan can just "keep going" and do more sampling.

You absolutely cannot just keep going. Nor do you want to. The problem is that if you haven't reached convergence, then you want to do more warmup, and then more sampling. So you'd have to toss out the sampling part of the run anyway.

If you run 1000 warmup iterations and 1000 sampling iterations and haven't converged, then you want to go back and run 2000 warmup iterations and 2000 sampling iterations. If you were to restart, after the original warmup iterations, you'd save around 15% of your work. Here's an example with the default number of iterations (which is probably way too high):

Rerun from beginning, doubling sizes

   1000 + 1000
   2000 + 2000
   -----------
   3000 + 3000

Restart warmup (warmup hasn't converged):

   1000     + 1000
   0 + 1000 + 2000
   -----------
   2000 + 3000

The exception to this is when you have converged during warmup, but just haven't drawn enough iterations for the n_eff you need. In that case, you'd get the following.

Resample (sorry for header --- GitHub flavored markdown glitch in action)

   1000 + 1000
      0 + 1000
   -----------
   3000

Restart sampling (warmup converged)

   1000 + 1000
   1000 + 2000
   -----------
   5000

So you'd save about 40% of your total effort if you could just run more samples.

bob-carpenter commented 9 years ago
bob-carpenter commented 9 years ago

Andre Pfeuffer asked about this on stan-users:

if (theta < -5 || theta > 5)
  increment_log_prob(-1.0e10);

I wrote back

Aside from the modeling issue, that statement will, perhaps surprisingly, have absolutely no effect on sampling. The problem is that it's branching on the value of a parameter, then adding a constant to the log density. Constants added to the log density never do anything to sampling. The log density is either lp or lp + 1e-10, neither of which is an expression involving theta. It got there via examining theta, but it has no way of exploiting that discontinuity at theta = 5 or theta = -5.
maverickg commented 9 years ago

I don't get the argument why the if condition does not have effect on the posterior. It indeed has no way of exploiting the discontinuity at theta = (-)5, but whether the log-posterior is lp or lp + 1e-10 does involves the value of theta.

Andre Pfeuffer asked about this on stan-users: ``` if (theta < -5 || theta > 5) increment_log_prob(-1.0e10); ``` I wrote back
Aside from the modeling issue, that statement will, perhaps surprisingly, have absolutely no effect on sampling. The problem is that it's branching on the value of a parameter, then adding a constant to the log density. Constants added to the log density never do anything to sampling. The log density is either lp or lp + 1e-10, neither of which is an expression involving theta. It got there via examining theta, but it has no way of exploiting that discontinuity at theta = 5 or theta = -5.
bob-carpenter commented 9 years ago

What I first said was wrong.

I was right in that it won't affect the gradient, but was wrong in that it will affect the posterior by changing the slice sampling in NUTS (or the Metropolis step in HMC). It'll have a similarly indirect affect on optimization having to do with termination criteria.

As to why it doesn't affect the gradients, the C++ generated looks identical to the Stan code. So there's nothing connecting the log probability accumulator back to theta. It's used for the conditional, but not tied into the result.

So if you think of the expression tree, when I do this

increment_log_prob(-sigma); increment_log_prob(-sqaure((y - mu) / sigma));

you get an expression that looks like:

lp == 0 - sigma - square((y - mu) / sigma);

So the derivative w.r.t. sigma and y and mu is defined.

But if you

if (theta < -5 || theta > 5) increment_log_prob(-1e10);

then you have

lp == 0 - 1e10

or

lp == 0

and there's no connection back to the parameters. So no gradient.

On Jul 6, 2015, at 2:12 PM, maverickg notifications@github.com wrote:

I don't get the argument why the if condition does not have effect on the posterior. It indeed has no way of exploiting the discontinuity at theta = (-)5, but whether the log-posterior is lp or lp + 1e-10 does involves the value of theta.

Andre Pfeuffer asked about this on stan-users: if (theta < -5 || theta > 5) increment_log_prob(-1.0e10); I wrote back Aside from the modeling issue, that statement will, perhaps surprisingly, have absolutely no effect on sampling. The problem is that it's branching on the value of a parameter, then adding a constant to the log density. Constants added to the log density never do anything to sampling. The log density is either lp or lp + 1e-10, neither of which is an expression involving theta. It got there via examining theta, but it has no way of exploiting that discontinuity at theta = 5 or theta = -5.

— Reply to this email directly or view it on GitHub.

bob-carpenter commented 9 years ago

Add the advice from Ben Goodrich and Krzysztof Sakrejda on stan-users in response to a query from Jonathan Gilligan:

Jonathan:

There have been some great discussions here about how important it is to use pairs() plots to check for sampling problems (bottlenecks, correlations, etc.), but I would love advice about how to work with big multilevel models where there are a small number of top-level hyperparameters characterizing hyperpriors, but at lower levels have one or more priors (each with one or more parameters) for each of a large number of groups, so pairs() plots of all combinations of parameters would require too many panes to be useful or practical.

Ben:

I would say to always include lp, top-level hyperparameters, and any variance / standard deviation that goes into the likelihood. If that does not point to the source of your problem, then maybe do additional pairs plots with lp and a batch of lower-level parameters. High correlations are not a big deal, per se, but can be problematic when combined with changing variances.

Krzysztof

When choosing which lower-level parameters to plot it helps to divide them into groups based on the amount of data available.

bob-carpenter commented 9 years ago

Following Ben's advice on stan-users:

For scale parameters, I do

bob-carpenter commented 9 years ago
bob-carpenter commented 9 years ago

from #1564:

bob-carpenter commented 8 years ago

From Krzysztof Sakrejda on stan-users:

... slice sampling and MH don't make sense since we have HMC but ADVI and optimization have different enough properties that they're worth including. I just looked at the Stan manual and it doesn't really address the choice of algorithms in any clear section (maybe it's in the intro in story form) so maybe right before section 1.5 which launches into talking about sampling there could be a little section about choice of algorithms.
bob-carpenter commented 8 years ago

Comment from @bgoodri on stan-users:

There are many ways of defining a partial correlation. Some include

bob-carpenter commented 8 years ago

Add missing data imputation for multivariate normal. Use the bivariate example, modified slightly from the one @betanalpha discussed on thestan-users list. It marginalizes out the missing data. In this particular case, a total of N_miss instances of y have y[n,2] missing:

data {
  int<lower=0> N_nonmiss;
  vector[2] y[N_nonmiss];
  int<lower=0> N_miss;
  real y1_obs[N_miss];
}
parameters {
  vector[2] mu;
  cov_matrix[2] Sigma;
}
model {
  // Prior on mu
  // Prior on Sigma, see the manual for examples

  y_obs ~ multi_normal(mu, Sigma);
  y1_obs ~ normal(mu[1], sqrt(Sigma[1, 1]);
}

We can also sample y[n,2] using analytic conditional p(y[n,2] | y[n,1], mu, Sigma) given here:

https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Conditional_distributions

And be sure to emphasize as Michael says that the missingness is a modeling problem, not just an afterthought. We can compare this all to the brute-force approach to missing data currently inthe manual and talk about how the analytic marginalizations speed things up.

I'll probably also include the brute-force approach and change the prior to a Cholesky factor.

bob-carpenter commented 8 years ago

Add the log logistic as an example of a user-defined distribution:

real loglogistic_log(real y, real a, real b) {
  return log(b) - log(a) + (b - 1) * (log(y) - log(a))
         - 2 * log1p_exp(b * (log(y) - log(a));
}

from: https://en.wikipedia.org/wiki/Log-logistic_distribution

bob-carpenter commented 8 years ago

Include a naive Bayes classifier example to show how to normalize. It can include discrete or continuous outputs. Add continuous HMM example (perhaps movement HMM, because that'll include circular stats, too).

bob-carpenter commented 8 years ago

Have @bgoodri (or someone else who understands it) write about the QR decomposition used in RStanArm.

bob-carpenter commented 8 years ago
bob-carpenter commented 7 years ago

From Camelia (last name?)

I think I have come across an untamed part of stan, and was wondering if you had any insights.

I'm trying to understand the idea of using 'scale-free' parameters. Gelman says "Aim to keep all parameters scale-free. [A way to do this is to] scale by some conventional value, for example if a parameter has a "typical" value of 4.5, you could work with log(theta/4.5) ". Full post here: http://stats.stackexchange.com/questions/189970/in-bayesian-linear-regression-why-do-we-assume-parameter-prior-has-zero-mean

My questions are

transformed parameters {
   // signal distribution parameters    
    phi[i]    <- inv_logit(phi_r[r[i]] + phi_d[d[i]]);
    lambda[i] <- 1 + exp(lambda_r[r[i]] + lambda_d[d[i]]);

    // transformed signal distribution parameters
    a <- lambda[i] * phi[i];
    b <- lambda[i] * (1 - phi[i]);

    // implied search and hit rates
    search_rate[i] <- beta_ccdf(t_i[i], a, b);
    hit_rate[i]    <- beta_conditional_mean(t_i[i], a, b);
}

I haven't been able to find much in the way of explanation of "scale-free" parameters online. If you have any insights / any this makes sense, would be super helpful!

From Jim Savage:

I've always understood the quest for scale-free priors as some transformation we want to make to the data or parameters so that a) the geometry of the posterior (params, not transformed params) is easy for HMC to navigate b) the scale of your data don't upset the quality of the model fit. If we make transformations to the data, it's important not to implicitly be informing your prior with data (which is probably the case if you convert your data to z-scores--this is irrelevant to your particular problem, but I found it interesting when I learned it).

This might be helpful: https://groups.google.com/forum/#!searchin/stan-users/scale-free|sort:relevance/stan-users/FjWL2290A5w/X0_tjLFbBgAJ

From Andrew Gelman:

Hi, scale-free is about computation but it's also about interpretability and it's also about being able to do a better job with priors, especially weakly informative priors. The idea is that scale free parameters should typically (tho not necessarily) in the (0, 1) range in absolute value, suggesting that things like N(0,1) could be weakly informative priors. So I'm not a fan of a prior centered at 4.5. The answer would have to depend on context, but I think maybe you'd \want some multiplicative factor here?

From Michael Betancourt:

It’s not really “scale free” but rather “scale unified”. As you noted a nice way to think about this is in terms of units. Once you’ve identified the appropriate units of your problem then you can redefine your system of units so that all of the expected effects are all around 1 in those units.

This has nice computational benefits — if your effects vary on the order of 10^{6} in one direction and 10^{-6} in the other, then it will take 10^{12} steps to explore everything, but if everything is order 1 then it will take order 1 steps to jump to a new point.

But the real benefit is that it facilities building good priors. We like to use “weakly” informative priors where we don’t know if an effect exists but if it does we have some idea of it’s scale (a recent example that has come up lately is in physiological models — however complicated the body is, blood can’t floor faster than a certain amount, so the effects of any drug can’t change blood flow by more than that amount). Ultimately these manifest as priors of the form

x ~ normal(0, x_scale);

or

x ~ cauchy(0, x_scale);

Now if all of the scales are order 1, then all of your priors will look the same and you have much less bookkeeping to do.

Adding a mean to the prior introduces more information, turning the weakly informative prior into something much more informative. Now if that is information that you actually believe then go for it, but always verify that the fit was good and the prior mean wasn’t in tension with the rest of the model.

bob-carpenter commented 7 years ago

Stephen Martin on stan-users suggests adding mean and variance to the univariate distributions.

As an aside, I also really like the plots with various parameter values that you get in the Wikipedia.

bob-carpenter commented 7 years ago

Moved from #757

It'd be nice to add some factor models with constraints as examples in the manual.

From Rick Farouni on stan-users:

For a factor model y=Lf +e, where f~ N(0,Sigma) , I think there are three important cases to consider:

  1. The positive lower triangular (PLT) identification scheme mentioned here http://apps.olin.wustl.edu/faculty/zhou/GZ-RFS96.pdf , where Sigma=I (identity matrix) and the the diagonal entries of L are polarity constrained. 2.The modified PLT where the factors,f, are still orthogonal but the elements of diagonal of Sigma Aguilar, O. & West, M.(2000) Bayesian dynamic factor models and portfolio allocation. 3.. Sigma is a correlation matrix, but now instead of requiring D(D-1)/2 constraints on L, we need D(D-1) to ensure local uniqueness under rotation and the polarity constraint to ensure global rotational uniqueness.This paper lists the conditions http://link.springer.com/article/10.1007/s11336-012-9259-3

But as this paper shows https://www.stat.washington.edu/research/reports/2011/tr589.pdf , imposing constraints is problematic.

bob-carpenter commented 7 years ago
bob-carpenter commented 7 years ago

John Hall suggests referring to section 21.6 of Gelman and Hill (first edition), formula 21.14 to get an estimate of how much pooling would be done by adding a hierarchical component to a model. Add this discussion in the reparameterization of hierarchical models section of the manual.

bob-carpenter commented 7 years ago

From @jmh530 on the 2.16 manual issue: https://github.com/stan-dev/stan/issues/2265#issuecomment-292336592

In Section 8.2 on QR factorization, I feel like the paragraph that recommends you use it when you do not have an informative prior on the location of beta is mistaken (also mentioned in 8.3).

Consider a probability distribution Y ~ N(XB, sigma) with a prior B ~ N(mu_B, sd_B). Y is then re-written as Y ~ N(Q theta, sigma). The equivalent distribution on the prior of theta would be theta ~ MVN(mu_B R, t(R) diag(sd_B ^ 2) R), assuming mu_B is 1XN.

My point is that you can still include a prior on the mean of beta, you just have to do the appropriate transformation in order to express it in terms of the new beta. This raises some other questions about the reasoning in that section, but I'll leave off here.

@jgabry replied @jmh530 yeah you're right that it's possible to use the QR approach even with a non-zero prior mean on the coefficients, so perhaps the recommendation in the manual is worded too strongly. @bgoodri should comment on this though because he wrote that section and I forget if he had a reason for discouraging this (he's definitely aware that's it's possible). One reason might just be that it is strange to have a prior belief that there is a dense covariance matrix for the MVN on theta since theta represents the coefficients on the columns of the Q matrix and those columns are orthogonal.

@jmh530 replied True. When I was playing around with it, I had extracted the standard deviations from the covariance matrix and just used the univariate normal distribution.

bob-carpenter commented 7 years ago

See https://github.com/stan-dev/stan/issues/2273 for a discussion of numerical difficulties with simplexes used as regression coefficients in 5000 dimensions. I added a note to the simplex section of the manual on possible solutions, but going over the example itself would be a nice case study.

bob-carpenter commented 6 years ago

I like the idea of defining the Bayesian posterior for R2, defined by @bgoodri in a response on StackOverflow: https://stackoverflow.com/questions/44759319/overall-predictive-power-e-g-r2-for-bayesian-linear-mixed-models

bob-carpenter commented 6 years ago
data {
  real<lower = 0, upper = 1> theta;  // clutter ratio
  int<lower = 0> N;
  vector[N] y;
}
parameters {
  real mu;
}
model {
  for (n in 1:N)
    target += log_mix(theta,
                      normal_lpdf(y[n] | mu, 1),
                      normal_lpdf(y[n] | 0, 10));
}
theta <- 0.5
N <- 200
mu <- 4.3

y <- rep(0, N);
for (n in 1:N) {
  if (rbinom(1, 1, 0.5)) {
    y[n] <- rnorm(1, mu, 1)
  } else {
    y[n] <- rnorm(1, 0, 10)
  }
}

library(rstan)
fit <- stan("clutter.stan", data = list(theta=theta, N=N, y=y))
bob-carpenter commented 6 years ago

http://discourse.mc-stan.org/t/large-poisson-model-with-individual-effects-is-too-slow/2112/2

I started this in the programming.tex and added the reference, but I can't quite follow what is being reparameterized to what.

bob-carpenter commented 6 years ago

There used to be a section in the manual on error and warning messages. It was very incomplete and was removed for 2.18. Add one back in that properly catalogues all the

bob-carpenter commented 6 years ago
X == X_miss + X_obs

so

X * beta == X_miss * beta + X_obs * beta

and we can use csr_matrix_times_vector for the multiplies.