stan-dev / rstanarm

rstanarm R package for Bayesian applied regression modeling
https://mc-stan.org/rstanarm
GNU General Public License v3.0
388 stars 133 forks source link

`stan_glmer()` and argument `offset` #541

Open fweber144 opened 3 years ago

fweber144 commented 3 years ago

Summary:

A stan_glmer() fit with offset seems to be handled incorrectly when posterior_linpred() (for example) is called with newdata but without offset.

Description:

First, unlike a stan_glm() fit, a stan_glmer() fit with offsets specified via argument offset doesn't produce an appropriate warning when posterior_linpred() (for example) is called with newdata but without offset. Secondly (and more importantly), in that posterior_linpred() call, the stan_glm() and the stan_glmer() fit differ in what is added to the linear predictor: For stan_glm(), a vector of zeros (and of appropriate length) is added, while for stan_glmer(), the original offsets are recycled so that they match the number of observations in newdata. The underlying issue might be that .pp_data_mer() lacks a call to .pp_data_offset(), unlike .pp_data() and .pp_data_nlmer().

Reproducible Steps:

library(rstanarm)
options(mc.cores = parallel::detectCores(logical = FALSE))
data("kidiq")
kidiq_gr <- within(kidiq, {
  agegr <- cut(mom_age,
               breaks = unique(quantile(mom_age)),
               include.lowest = TRUE)
  levels(agegr) <- paste0("lvl", seq_len(nlevels(agegr)))
})
set.seed(3492)
offs_vec <- rnorm(nrow(kidiq))
glmm_fit <- stan_glmer(kid_score ~ mom_iq + (1 | agegr),
                       data = kidiq_gr,
                       offset = offs_vec,
                       seed = 734572)
kidiq_gr_new <- head(kidiq_gr, 3)
glmm_pl <- posterior_linpred(glmm_fit, newdata = kidiq_gr_new)

The last line throws the warning

Warning in sweep(eta, 2L, offset, `+`) :
  STATS is longer than the extent of 'dim(x)[MARGIN]'

In contrast, with stan_glm(), an appropriate warning is thrown:

glm_fit <- stan_glm(kid_score ~ mom_iq + agegr,
                    data = kidiq_gr,
                    offset = offs_vec,
                    seed = 734572)
glm_pl <- posterior_linpred(glm_fit, newdata = kidiq_gr_new)

namely

Warning: 'offset' argument is NULL but it looks like you estimated the model using an offset term.

Calling debug(rstanarm:::linear_predictor) before those posterior_linpred() calls reveals that for the stan_glm() fit, offset is a vector of 3 zeros, while for the stan_glmer() fit, offset is the original offs_vec vector (of length nrow(kidiq) = 434).

RStanARM Version:

2.21.2 (from https://mc-stan.org/r-packages/)

R Version:

4.1.0

Operating System:

Windows 10

fweber144 commented 2 years ago

The reprex above where the offset argument was used doesn't seem to have changed from rstanarm 2.21.2 (from https://mc-stan.org/r-packages/, used above) to the new CRAN version 2.21.3, but the alternative way via an offset() term in the formula seems to have changed between these two versions:

library(rstanarm)
options(mc.cores = parallel::detectCores(logical = FALSE))
data("kidiq")
kidiq_gr <- within(kidiq, {
  agegr <- cut(mom_age,
               breaks = unique(quantile(mom_age)),
               include.lowest = TRUE)
  levels(agegr) <- paste0("lvl", seq_len(nlevels(agegr)))
})
set.seed(3492)
kidiq_gr$offs_col <- rnorm(nrow(kidiq))
# GLMM:
glmm_fit <- stan_glmer(kid_score ~ mom_iq + (1 | agegr) + offset(offs_col),
                       data = kidiq_gr,
                       seed = 734572)
glmm_drws <- as.matrix(glmm_fit)
kidiq_gr_new <- head(kidiq_gr, 3)
glmm_pl_new <- posterior_linpred(glmm_fit, newdata = kidiq_gr_new)
## Throws:
# Warning message:
# In sweep(eta, 2L, offset, `+`) :
#   STATS is longer than the extent of 'dim(x)[MARGIN]'
##
glmm_pl_new_man <- glmm_drws[, "(Intercept)"] +
  glmm_drws[, "mom_iq", drop = FALSE] %*%
  t(kidiq_gr_new[, "mom_iq", drop = FALSE]) +
  glmm_drws[, paste0("b[(Intercept) agegr:", kidiq_gr_new$agegr, "]"), drop = FALSE]
all.equal(unname(glmm_pl_new), unname(glmm_pl_new_man), tolerance = 1e-15)
## --> With rstanarm v2.21.2: TRUE, but incorrect.
## --> With rstanarm v2.21.3: "Mean relative difference: 0.008601649".
all.equal(unname(glmm_pl_new),
          unname(glmm_pl_new_man +
                   matrix(kidiq_gr$offs_col,
                          nrow = nrow(glmm_drws),
                          ncol = nrow(kidiq_gr_new),
                          byrow = TRUE)),
          tolerance = 1e-15)
## --> With rstanarm v2.21.2: "Mean relative difference: 0.008608396".
## --> With rstanarm v2.21.3: TRUE, but incorrect.
# Desired:
all.equal(unname(glmm_pl_new),
          unname(glmm_pl_new_man +
                   matrix(kidiq_gr_new$offs_col,
                          nrow = nrow(glmm_drws),
                          ncol = nrow(kidiq_gr_new),
                          byrow = TRUE)),
          tolerance = 1e-15)
## --> With rstanarm v2.21.2: "Mean relative difference: 0.009548977".
## --> With rstanarm v2.21.3: "Mean relative difference: 0.01161201"

# GLM:
glm_fit <- stan_glm(kid_score ~ mom_iq + offset(offs_col),
                    data = kidiq_gr,
                    seed = 734572)
glm_drws <- as.matrix(glm_fit)
glm_pl_new <- posterior_linpred(glm_fit, newdata = kidiq_gr_new)
## Throws:
# Warning message:
# 'offset' argument is NULL but it looks like you estimated the model using an offset term.
##
glm_pl_new_man <- glm_drws[, "(Intercept)"] +
  glm_drws[, "mom_iq", drop = FALSE] %*%
  t(kidiq_gr_new[, "mom_iq", drop = FALSE])
all.equal(unname(glm_pl_new), unname(glm_pl_new_man), tolerance = 1e-15)
## --> With rstanarm v2.21.2 and v2.21.3: TRUE (excludes the offsets, but that is acceptable, because
## a warning is thrown).
# Desired:
all.equal(unname(glm_pl_new),
          unname(glm_pl_new_man +
                   matrix(kidiq_gr_new$offs_col,
                          nrow = nrow(glm_drws),
                          ncol = nrow(kidiq_gr_new),
                          byrow = TRUE)),
          tolerance = 1e-15)
## --> With rstanarm v2.21.2 and v2.21.3: "Mean relative difference: 0.009612115"

So now, with rstanarm v2.21.3, the issue described above for the offset argument also occurs for the alternative formula way. (In v2.21.2, no offsets were added at all when using the formula way, which is basically issue #542, but with !is.null(newdata).)