stan-dev / rstanarm

rstanarm R package for Bayesian applied regression modeling
https://mc-stan.org/rstanarm
GNU General Public License v3.0
388 stars 133 forks source link

posterior distribution of linear predictor #85

Closed jgabry closed 8 years ago

jgabry commented 8 years ago

Moving from stan-users post:

@tomwallis wrote:

I have a bernoulli GLMM for which I would like to plot predicted probabilities (similar to e.g. the predict method for a normal R glm, type = "response"). Is the only way to approximate these by switching the model formula to binomial (with one trial per row in my original data) and then using posterior_predict to approximate the probabilities? ... I think a user-facing function for this would be really useful (similar to predict(type = response)). The reason my data are bernoulli rather than binomial is that I've got two random effects that cross over. What I'd like to do is plot the predicted probabilities marginalising over the levels of the random effects, against aggregated data (also marginalising over the levels of the random effects). Requiring this to be done by simulating N binomial draws from the model, then averaging those draws, seems unnecessary. How many draws should be used? If one is interested in the plogis, why not allow the uncertainty on this to be visualized directly, rather than adding another source of uncertainty (the number of binomial draws used to approximate it)?

jgabry commented 8 years ago

@tomwallis The new branch https://github.com/stan-dev/rstanarm/tree/feature/linear-predictor has a posterior_linpred function that should give you the posterior distribution of the linear predictor (also allowing newdata). You can then do whatever you want with it (e.g. apply the inverse-link). To install from the new branch you can do:

library("devtools")
install_github("stan-dev/rstanarm", ref = "feature/linear-predictor", args = "--preclean")

Let me know if that works for you.

@bgoodri Do you still prefer not including a function like this in the released version?

bgoodri commented 8 years ago

I don't hate it, but my teaching mantra is to only analyze (functions of) the posterior predictive distribution. What is wrong with doing colMeans(posterior_predict(posterior, newdata)) when posterior is generated by a Bernoulli likelihood?

On Thu, Mar 17, 2016 at 6:00 PM, Jonah Gabry notifications@github.com wrote:

@tomwallis https://github.com/tomwallis The new branch https://github.com/stan-dev/rstanarm/tree/feature/linear-predictor has a posterior_linpred function that should give you the posterior distribution of the linear predictor (also allowing newdata). You can then do whatever you want with it (e.g. apply the inverse-link). To install from the new branch you can do:

library("devtools") install_github("stan-dev/rstanarm", ref = "feature/linear-predictor", args = "--preclean")

Let me know if that works for you.

@bgoodri https://github.com/bgoodri Do you still prefer not including a function like this in the released version?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/stan-dev/rstanarm/issues/85#issuecomment-198102004

paul-buerkner commented 8 years ago

When it's just about getting the predicted values, it of course won't matter which way you go. However, if you want to compute the uncertainty (i.e. CIs), you may either do this for the response or for the "regression line" which yields different results. For marginal plots (as mentioned by the OP), plotting the "regression line" is likely what most users would want to do (at least from my experience) so I think the new function will be a nice addition to rstanarm.

tomwallis commented 8 years ago

@bgoodri: I suppose that's fine, but if one is interested in the shape and uncertainty of the assumed sigmoid (and its fit to the data points), then it would be nice to be able to plot that uncertainty without the additional uncertainty of sampling from the bernoulli / binomial. That is, imagine I only take one trial (i.e. bernoulli): then my "sigmoid" is a series of ones and zeros. Taking 1000 trials makes the mean approximation quite good, but this is arbitrary. Why add this additional uncertainty?

@jgabry: I should get time this afternoon to try the new branch, thanks for making it.

In trying to test our @bgoodri 's suggestion, I have come across an error (perhaps it's trivial but I haven't worked out why it's happening yet):

library(rstanarm)

dat <- mtcars

fit1 <- stan_glm("am ~ wt + cyl", data = dat,
                 family = binomial(link = "logit"))

fit1

# I would like to plot probability of `am` as a function of `wt`:
newdata = expand.grid(am = 1, wt = seq(1, 6, length.out = 10),
                      reps = 1:10,
                      cyl = unique(dat$cyl))

yhat <- posterior_predict(fit1, newdata, draws = 100)

the final line results in Error in formula(object)[[2L]] : subscript out of bounds. Am I doing something stupid here?

tomwallis commented 8 years ago

Here's an example of a plot similar to what I'd like to do (from this paper-- it's not exactly a GLM but serves to illustrate). The data points show proportion correct at some level of an x variable (i.e. the data points are the means of binomial outcomes), and the curves show 100 draws from the posterior distribution of a model. Each curve is plotted with a low alpha level (transparency), giving an idea for where the posterior distribution is more certain.

tmp

Unless I'm missing something, to generate a plot like this using the colMeans(posterior_predict(posterior, newdata)) recommendation I would need to sample hundreds of bernoulli trials from the posterior at each x level I want to plot a curve at, for each MCMC sample – and depending how many bernoulli trials I sampled it wouldn't be a smooth curve any more, but vertically jittered according to the binomial uncertainty of the draw I happened to make.

In this case, it seems more straightforward to me to plot samples of the inverse link directly, as the posterior_linpred function should allow.

jgabry commented 8 years ago

For marginal plots (as mentioned by the OP), plotting the "regression line" is likely what most users would want to do (at least from my experience) so I think the new function will be a nice addition to rstanarm.

@paul-buerkner I do agree that's what most users would do, but I'm not sure they should do it. For marginal plots, I think it's almost always preferable to plot how the posterior predictive distribution changes as one predictor is varied.

jgabry commented 8 years ago

@tomwallis How about something like this:

nd <- data.frame(x1 = 1, x2 = 0, x3 = seq(0, 10, .5))  # fix x1,x2 and vary x3
pp <- replicate(reps, posterior_predict(fit, newdata = nd)) # maybe reps=50 or 100 or something 
plotdata <- apply(pp, 3, colMeans)
plot(plotdata[, 1], type = "l")  # make whatever kind of plot using columns of plotdata
for (j in 2:ncol(plotdata)) {
  lines(plotdata[, j])
}
tomwallis commented 8 years ago

@jgabry I'm trying to run your code in an example, and I'm still getting the error posted above on the prediction method. I get this for the minimal example posted above and also for an example using my own data.

pp <- replicate(10, posterior_predict(fit5, newdata = nd)) # maybe reps=50 or 100 or something

Error in formula(object)[[2L]] : subscript out of bounds

Any idea what's going wrong here?

bgoodri commented 8 years ago

I think it is because you started with

fit1 <- stan_glm("am ~ wt + cyl", data = dat, ...

R somehow allows a formula in quotation marks, but it seems to confuse posterior_predict. They are unnecessary, so you can just do

fit1 <- stan_glm(am ~ wt + cyl, data = dat, ...

Ben

On Thu, Mar 24, 2016 at 9:20 AM, Tom Wallis notifications@github.com wrote:

@jgabry https://github.com/jgabry I'm trying to run your code in an example, and I'm still getting the error posted above on the prediction method. I get this for the minimal example posted above and also for an example using my own data.

pp <- replicate(10, posterior_predict(fit5, newdata = nd)) # maybe reps=50 or 100 or something Error in formula(object)[[2L]] : subscript out of bounds

Any idea what's going wrong here?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/stan-dev/rstanarm/issues/85#issuecomment-200829286

tomwallis commented 8 years ago

Yes, that fixed it. Thanks @bgoodri.

@jgabry 's suggestion produces the following in that minimal example. The first plot shows the code above computed using all 4000 MCMC samples.

4000_draws

The next plot is created by setting draws = 100 in posterior predict:

100_draws

The added uncertainty is because the binomial mean is being computed from 100 draws (replicated 100 times) rather than 4000 draws (replicated 100 times). Perhaps this is ok (because in this case the uncertainty in the underlying linear predictor is also reduced with more samples), but I still don't see why one would want to add that additional uncertainty if one is interested in visualising the uncertainty in the linear predictor / inverse link (as in my example). The plots here look jagged because of the binomial variability – each MCMC sample defines one smooth inverse link function.

tomwallis commented 8 years ago

I think @jgabry's plot code that I used above is averaging over the wrong dimension:

nd <- data.frame(x1 = 1, x2 = 0, x3 = seq(0, 10, .5))  # fix x1,x2 and vary x3
pp <- replicate(reps, posterior_predict(fit, newdata = nd)) # maybe reps=50 or 100 or something 
plotdata <- apply(pp, 3, colMeans)

is taking the mean over MCMC samples rather than over replicates. I imagine we want something more like

pp <- replicate(n_reps, posterior_predict(fit1, newdata = nd))
plotdata <- apply(pp, 2, rowMeans)  # average bernoulli trials over reps

One can then e.g. take quantiles over the MCMC estimates for plotting:

quants <- apply(plotdata, 2, quantile, probs = c(.1, .5, .9))  # quantiles over mcmc samples
nd <- cbind(nd, t(quants))

ggplot(dat, aes(x = wt, y = am, colour = factor(cyl), fill = factor(cyl))) +
  geom_line(aes(y=`50%`), data = nd) +
  geom_ribbon(aes(ymax = `90%`, ymin = `10%`, colour = NULL), data = nd, alpha = 0.2) +
  geom_point(position = position_jitter(height = 0.05))

tmp1

Or by plotting individual MCMC samples:

# plot of individual MCMC sample chains:
samples <- sample(nrow(plotdata), 100)  # plot 100 draws from the posterior

plt <- ggplot(dat, aes(x = wt, y = am, colour = factor(cyl), fill = factor(cyl)))

for (i in 1:length(samples)){
  tmp_dat <- data.frame(wt = nd$wt, am = plotdata[samples[i], ],
                        cyl = nd$cyl)
  plt <- plt + geom_line(data = tmp_dat, alpha = 0.2)
}
plt <- plt + geom_point(position = position_jitter(height = 0.05))
plt

tmp2

I can see the value of only allowing posterior_predict as a user-facing function (as implied by @bgoodri) because it ensures that you can't get the link function wrong. Offering posterior_linpred creates the chance that someone misspecifies the link function (relative to the stan model) and therefore has errors. On the other hand, the model's link function is smooth, not jagged (as shown in plot 2); sampling from the model will only reveal this visually if you use a lot of samples (the above plots were based on 500 replicates).

jgabry commented 8 years ago

I think @jgabry's plot code that I used above is averaging over the wrong dimension

That might be true, I'll try to take a look later. I always forget in what order the array dimensions will be after calling replicate and I didn't have R open at the time. It's definitely possible I got the dimension wrong.

jgabry commented 8 years ago

I can see the value of only allowing posterior_predict as a user-facing function (as implied by @bgoodri) because it ensures that you can't get the link function wrong. Offering posterior_linpred creates the chance that someone misspecifies the link function (relative to the stan model) and therefore has errors.

I suppose we could also provide an option so that the inverse-link transformation is done internally before returning the result to the user, which would prevent that kind of error.

jgabry commented 8 years ago

@tomwallis And yes, I do think my use of apply was with the wrong margin specified. Thanks for catching that.

tomwallis commented 8 years ago

Ok, sorry for the delay – I tested out the posterior_linpred function and that works as intended. You can see the results here:

tmp

The top row shows the plots generated from samples / replicates from the bernoulli posterior. The bottom row shows plots generated from the inverse link function. You can see that the quantile plots are basically indistinguishable, but the "samples from the posterior" plots are quite different. The jaggedness of the top right plot is created because of the replicates, whereas the bottom right plot reflects the smooth underlying link function. The top right plot (posterior samples) will increasingly approximate the bottom right plot as you increase the number of replicates.

I'm still open to having my mind changed, but the bottom right plot makes more sense to me (perhaps imagine a case where one can bin bernoulli trials in the data and thus visually assess fit, as in the example I posted above). Do you imagine that posterior_linpred or, even better, both that and posterior_invlink would be included in future versions of rstanarm?

Code to generate:

# Minimal example
library(rstanarm)
library(ggplot2)

dat <- mtcars

fit1 <- stan_glm(am ~ wt + cyl, data = dat,
                 family = binomial(link = "logit"))

fit1

nd <- data.frame(wt = seq(0, 6, length.out = 50),
                 cyl = 6,
                 am = 0)

n_reps <- 500  # number of reps for bernoulli trials...
pp <- replicate(n_reps, posterior_predict(fit1, newdata = nd))
plotdata <- apply(pp, 2, rowMeans)  # average bernoulli trials over reps
quants <- apply(plotdata, 2, quantile, probs = c(.1, .5, .9))  # quantiles over mcmc samples
nd_samples <- cbind(nd, t(quants))

ggplot(dat, aes(x = wt, y = am, colour = factor(cyl), fill = factor(cyl))) +
  geom_line(aes(y=`50%`), data = nd_samples) +
  geom_ribbon(aes(ymax = `90%`, ymin = `10%`, colour = NULL), data = nd_samples, alpha = 0.2) +
  geom_point(position = position_jitter(height = 0.05)) +
  ggtitle("quantiles of posterior samples")

# plot of individual MCMC sample chains:
samples <- sample(nrow(plotdata), 100)  # plot 100 draws from the posterior

plt <- ggplot(dat, aes(x = wt, y = am, colour = factor(cyl), fill = factor(cyl)))

for (i in 1:length(samples)){
  tmp_dat <- data.frame(wt = nd$wt, am = plotdata[samples[i], ],
                        cyl = nd$cyl)
  plt <- plt + geom_line(data = tmp_dat, alpha = 0.2)
}
plt <- plt + geom_point(position = position_jitter(height = 0.05)) +
  ggtitle("posterior samples with reps")
plt

### using the "linear_predictor" function in the development branch:
# https://github.com/stan-dev/rstanarm/issues/85
# library("devtools")
# install_github("stan-dev/rstanarm", ref = "feature/linear-predictor", args = "--preclean")

# for new data, take samples from the linear predictor:
pp_eta <- posterior_linpred(fit1, newdata = nd)
pp_invlog <- plogis(pp_eta)
quants_invlog <- apply(pp_invlog, 2, quantile, probs = c(.1, .5, .9))  # quantiles over mcmc samples
nd_invlog <- cbind(nd, t(quants))

ggplot(dat, aes(x = wt, y = am, colour = factor(cyl), fill = factor(cyl))) +
  geom_line(aes(y=`50%`), data = nd_invlog) +
  geom_ribbon(aes(ymax = `90%`, ymin = `10%`, colour = NULL), data = nd_invlog, alpha = 0.2) +
  geom_point(position = position_jitter(height = 0.05)) +
  ggtitle("quantiles of the inverse link")

plt <- ggplot(dat, aes(x = wt, y = am, colour = factor(cyl), fill = factor(cyl)))

for (i in 1:length(samples)){
  tmp_dat <- data.frame(wt = nd$wt, am = pp_invlog[samples[i], ],
                        cyl = nd$cyl)
  plt <- plt + geom_line(data = tmp_dat, alpha = 0.2)
}
plt <- plt + geom_point(position = position_jitter(height = 0.05)) +
  ggtitle("posterior samples of inverse link")
plt
jgabry commented 8 years ago

I'm still open to having my mind changed, but the bottom right plot makes more sense to me (perhaps imagine a case where one can bin bernoulli trials in the data and thus visually assess fit, as in the example I posted above). Do you imagine that posterior_linpred or, even better, both that and posterior_invlink would be included in future versions of rstanarm?

@tomwallis I'm not sure if we'd do it as two separate functions, e.g. posterior_linpred and posterior_linkinv, or as a single function with an argument to do the inverse-link transformation, but either way I'm not opposed to officially having this functionality in rstanarm. It does seem like it would be useful in some cases and we could put a note in the doc about preferring the posterior predictive distribution (in any situation where it makes sense).

@bgoodri I'm also ok with not including this functionality in rstanarm if that's still your preference, although I'm curious what alternative you'd prefer to the plots @tomwallis is proposing.

bgoodri commented 8 years ago

Maybe we can merge that branch but not expose posterior_linpred. That way, people who know what they are doing can make plots like these and people who don't know what they are doing will have a harder time doing naive Bayes classification. We could have posterior_predict call posterior_linpred internally and then draw from the likelihood so there is not much duplicated code.

jgabry commented 8 years ago

There's also the option of exposing it but flagging it as internal (easy with roxygen2) so it isn't included in the index of functions in the R help but can still be called without ::: and the doc would still be accessible via help() (if you knew it was there).

On Tuesday, March 29, 2016, bgoodri notifications@github.com wrote:

Maybe we can merge that branch but not expose posterior_linpred. That way, people who know what they are doing can make plots like these and people who don't know what they are doing will have a harder time doing naive Bayes classification. We could have posterior_predict call posterior_linpred internally and then draw from the likelihood so there is not much duplicated code.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/stan-dev/rstanarm/issues/85#issuecomment-202922157

tomwallis commented 8 years ago

@bgoodri: It's nice that you're implicitly including me in the people who know what they're doing category, but I actually don't understand what you mean by "doing naive Bayes classification". As in, taking the linear predictor (corresponding to log odds) and thresholding it to get discrete event predictions? Aside from discarding most of the nice uncertainty a Bayesian analysis quantifies, why would that be a bad idea?

bgoodri commented 8 years ago

That is the main reason it is a bad idea. But is also failing to utilize the fact that a posterior distribution implies a posterior predictive distribution for the outcome, which in a lot of ways is more useful than the posterior distribution itself. It isn't coherent to use update your beliefs about the parameters and then use some other mechanism to predict.

bgoodri commented 8 years ago

Even if it is marked as internal, if it is documented, people will find its help page when they scroll through the documentation. I think for anyone who wants to use it, it ought to be pretty self-explanatory as to what it does.

jgabry commented 8 years ago

I think marking it with @keywords internal internal should be sufficient. In fact I think this situation (i.e., a function intended only for experienced users) is the primary (maybe only) use of the @keywords tag. From Hadley:

Using the internal keyword removes all functions in the associated .Rd file from the documentation index and disables some of their automated tests. A common use case is to both export a function (using @export) and marking it as internal. That way, advanced users can access a function that new users would be confused about if they were to see it in the index.

(from https://cran.r-project.org/web/packages/roxygen2/vignettes/rd.html)

It won't show up in the pdf manual (but who reads that anyway) and more importantly it won't show up if you use RStudio's Packages/Help interface. For example, this is actually the case currently with our get_x, get_y, get_z functions. They're exported but you really can't find them at all unless you do help("get_x", "rstanarm") and you'd have to know to do that.

bgoodri commented 8 years ago

What is the status of this? I'm okay with keywork internal.

jgabry commented 8 years ago

Ok cool. I can take care of it. I've been using it on another branch too for pp_check stuff (makes it simpler to do those binned residual plots Andrew likes) so I can include it when I merge that soon.

On Friday, June 10, 2016, bgoodri notifications@github.com wrote:

What is the status of this? I'm okay with keywork internal.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/rstanarm/issues/85#issuecomment-225317421, or mute the thread https://github.com/notifications/unsubscribe/AHb4Q6gzwMWYkwZmxwWw53xdoULFxifdks5qKe_OgaJpZM4HzZFj .

jgabry commented 8 years ago

Ok, this is in so it will be in the next release. The posterior distribution of the linear predictor (or transformed linear predictor) will be accessible via posterior_linpred.

tomwallis commented 8 years ago

Awesome, thanks Jonah.