`stan_glmer()` and argument `offset`

stan-dev / rstanarm

rstanarm R package for Bayesian applied regression modeling

GNU General Public License v3.0

388 stars 133 forks source link

Summary:

A stan_glmer() fit with offset seems to be handled incorrectly when posterior_linpred() (for example) is called with newdata but without offset.

Description:

First, unlike a stan_glm() fit, a stan_glmer() fit with offsets specified via argument offset doesn't produce an appropriate warning when posterior_linpred() (for example) is called with newdata but without offset. Secondly (and more importantly), in that posterior_linpred() call, the stan_glm() and the stan_glmer() fit differ in what is added to the linear predictor: For stan_glm(), a vector of zeros (and of appropriate length) is added, while for stan_glmer(), the original offsets are recycled so that they match the number of observations in newdata. The underlying issue might be that .pp_data_mer() lacks a call to .pp_data_offset(), unlike .pp_data() and .pp_data_nlmer().

Reproducible Steps:

library(rstanarm)
options(mc.cores = parallel::detectCores(logical = FALSE))
data("kidiq")
kidiq_gr <- within(kidiq, {
  agegr <- cut(mom_age,
               breaks = unique(quantile(mom_age)),
               include.lowest = TRUE)
  levels(agegr) <- paste0("lvl", seq_len(nlevels(agegr)))
})
set.seed(3492)
offs_vec <- rnorm(nrow(kidiq))
glmm_fit <- stan_glmer(kid_score ~ mom_iq + (1 | agegr),
                       data = kidiq_gr,
                       offset = offs_vec,
                       seed = 734572)
kidiq_gr_new <- head(kidiq_gr, 3)
glmm_pl <- posterior_linpred(glmm_fit, newdata = kidiq_gr_new)

The last line throws the warning

Warning in sweep(eta, 2L, offset, `+`) :
  STATS is longer than the extent of 'dim(x)[MARGIN]'

In contrast, with stan_glm(), an appropriate warning is thrown:

glm_fit <- stan_glm(kid_score ~ mom_iq + agegr,
                    data = kidiq_gr,
                    offset = offs_vec,
                    seed = 734572)
glm_pl <- posterior_linpred(glm_fit, newdata = kidiq_gr_new)

namely

Warning: 'offset' argument is NULL but it looks like you estimated the model using an offset term.

Calling debug(rstanarm:::linear_predictor) before those posterior_linpred() calls reveals that for the stan_glm() fit, offset is a vector of 3 zeros, while for the stan_glmer() fit, offset is the original offs_vec vector (of length nrow(kidiq) = 434).

RStanARM Version:

2.21.2 (from https://mc-stan.org/r-packages/)

R Version:

4.1.0

Operating System:

Windows 10

library(rstanarm) options(mc.cores = parallel::detectCores(logical = FALSE)) data("kidiq") kidiq_gr <- within(kidiq, { agegr <- cut(mom_age, breaks = unique(quantile(mom_age)), include.lowest = TRUE) levels(agegr) <- paste0("lvl", seq_len(nlevels(agegr))) }) set.seed(3492) kidiq_gr$offs_col <- rnorm(nrow(kidiq)) # GLMM: glmm_fit <- stan_glmer(kid_score ~ mom_iq + (1 | agegr) + offset(offs_col), data = kidiq_gr, seed = 734572) glmm_drws <- as.matrix(glmm_fit) kidiq_gr_new <- head(kidiq_gr, 3) glmm_pl_new <- posterior_linpred(glmm_fit, newdata = kidiq_gr_new) ## Throws: # Warning message: # In sweep(eta, 2L, offset, `+`) : # STATS is longer than the extent of 'dim(x)[MARGIN]' ## glmm_pl_new_man <- glmm_drws[, "(Intercept)"] + glmm_drws[, "mom_iq", drop = FALSE] %*% t(kidiq_gr_new[, "mom_iq", drop = FALSE]) + glmm_drws[, paste0("b[(Intercept) agegr:", kidiq_gr_new$agegr, "]"), drop = FALSE] all.equal(unname(glmm_pl_new), unname(glmm_pl_new_man), tolerance = 1e-15) ## --> With rstanarm v2.21.2: TRUE, but incorrect. ## --> With rstanarm v2.21.3: "Mean relative difference: 0.008601649". all.equal(unname(glmm_pl_new), unname(glmm_pl_new_man + matrix(kidiq_gr$offs_col, nrow = nrow(glmm_drws), ncol = nrow(kidiq_gr_new), byrow = TRUE)), tolerance = 1e-15) ## --> With rstanarm v2.21.2: "Mean relative difference: 0.008608396". ## --> With rstanarm v2.21.3: TRUE, but incorrect. # Desired: all.equal(unname(glmm_pl_new), unname(glmm_pl_new_man + matrix(kidiq_gr_new$offs_col, nrow = nrow(glmm_drws), ncol = nrow(kidiq_gr_new), byrow = TRUE)), tolerance = 1e-15) ## --> With rstanarm v2.21.2: "Mean relative difference: 0.009548977". ## --> With rstanarm v2.21.3: "Mean relative difference: 0.01161201" # GLM: glm_fit <- stan_glm(kid_score ~ mom_iq + offset(offs_col), data = kidiq_gr, seed = 734572) glm_drws <- as.matrix(glm_fit) glm_pl_new <- posterior_linpred(glm_fit, newdata = kidiq_gr_new) ## Throws: # Warning message: # 'offset' argument is NULL but it looks like you estimated the model using an offset term. ## glm_pl_new_man <- glm_drws[, "(Intercept)"] + glm_drws[, "mom_iq", drop = FALSE] %*% t(kidiq_gr_new[, "mom_iq", drop = FALSE]) all.equal(unname(glm_pl_new), unname(glm_pl_new_man), tolerance = 1e-15) ## --> With rstanarm v2.21.2 and v2.21.3: TRUE (excludes the offsets, but that is acceptable, because ## a warning is thrown). # Desired: all.equal(unname(glm_pl_new), unname(glm_pl_new_man + matrix(kidiq_gr_new$offs_col, nrow = nrow(glm_drws), ncol = nrow(kidiq_gr_new), byrow = TRUE)), tolerance = 1e-15) ## --> With rstanarm v2.21.2 and v2.21.3: "Mean relative difference: 0.009612115"

stan-dev / rstanarm