stan-dev / rstanarm

rstanarm R package for Bayesian applied regression modeling
https://mc-stan.org/rstanarm
GNU General Public License v3.0
379 stars 134 forks source link

not all stanreg point estimates are posterior medians #189

Open CDEager opened 7 years ago

CDEager commented 7 years ago

Summary:

Not all point estimates in stanreg objects are actually posterior median values (notably fitted.values, residuals, and, for mixed effects regressions, estimates returned by coef).

Description:

In the rstanarm:::stanreg function, the median values of the regression coefficients are taken and then used to calculate the fitted.values element of the stanreg object. However, since the median (unlike the mean) is not a linear function, the resulting fitted values are not the medians of the posterior distribution for the fitted values. This then effects the residuals element of the object, and a similar issue exists for the calculation of the estimates returned by coef for mixed effects models, since the median fixed effects values and median random effects values are being added together in rstanarm:::coef.stanreg.

Reproducible Steps:

I will demonstrate with the fitted values for linear mixed effects regressions, because this is a simple case; but the issue generalizes at least to the other cases mentioned above, and for other regression families as well. The following two functions should return TRUE and a message indicating the Mean relative difference, respectively, for any model fit with stan_lmer:

rstanarm_fitted_values <- function(object) {
  if (!inherits(object, "stanreg") || !inherits(object, "lmerMod") ||
  !isTRUE(all.equal(object$family, gaussian()))) {
    stop("Function only tested on linear mixed effects regressions.")
  }

  modmat <- posterior_linpred(object, XZ = TRUE)
  coefmat <- as.matrix(object)[, 1:ncol(modmat), drop = FALSE]
  coefs <- t(apply(coefmat, 2, median))
  point_estimates <- as.vector(coefs %*% t(modmat))

  return(all.equal(unname(point_estimates), unname(object$fitted.values)))
}

median_fitted_values <- function(object) {
  if (!inherits(object, "stanreg") || !inherits(object, "lmerMod") ||
  !isTRUE(all.equal(object$family, gaussian()))) {
    stop("Function only tested on linear mixed effects regressions.")
  }

  modmat <- posterior_linpred(object, XZ = TRUE)
  coefmat <- as.matrix(object)[, 1:ncol(modmat), drop = FALSE]
  mu <- coefmat %*% t(modmat)
  posterior_medians <- apply(mu, 2, median)

  return(all.equal(unname(posterior_medians), unname(object$fitted.values)))
}

RStanARM Version:

2.15.3

R Version:

3.3.3

Operating System:

Windows 10

bgoodri commented 7 years ago

It is true that fitted.values, residuals, and, coef.mer are not posterior medians. They were calculated to be consistent with the way glm or glmer calculates them from point estimates. Also, the stuff in VarCorr() are posterior means. If the documentation is claiming that they are medians, that is a bug.

CDEager commented 7 years ago

For fitted.values (in the help page stanreg-objects), they are referenced as means. For coef (in the help page stanreg-methods) they are referenced as medians. For other functions on the stanreg-methods page (such as VarCorr), it doesn't say how the point estimates are calculated, but in the help page print.stanreg it says "Regardless of the estimation algorithm, point estimates are medians computed from simulations. For models fit using MCMC ("sampling") the posterior sample is used." So there are some mismatches between the documentation and what the stanreg object actually contains/prints.

But regardless of the documentation, wouldn't it be better to get the actual posterior medians rather than using the posterior medians of different parameters to generate point estimates? I don't think all of the point estimates currently returned even have a formal definition (or what that formal definition would look like), whereas obtaining posterior medians (or means or whatever) has a pretty easy-to-understand meaning that is common within Bayesian analysis. This would still be imitating glm and glmer in the relevant sense (a vector of estimates is being returned rather than a matrix), but without losing the interpretability of the results (with MCMC whatever you return is not going to actually be the same thing as what's returned by glm and glmer anyway). I could see how this might lead to confusion for some users if the coefficients element of the stanreg object can't be applied to the data to obtain the linear.predictors element, but if the posterior sample is consistently used in generating estimates, this could be handled with a sentence or two on the stanreg-methods page; or, alternatively, if means are used rather than medians, then the issue goes away with proper documentation adjustments.

Also, as another example, for models fit with stan_lm, if you use predict with se.fit = TRUE and newdata = NULL, the fit element of the returned list contains means that won't match the linear.predictors element that is returned when predict is called with se.fit = FALSE and newdata = NULL; that is:

# if mod is fit with stan_lm, this statement will not be TRUE
all.equal(predict(mod), predict(mod, se.fit = TRUE)$fit)