stan-dev / rstanarm

rstanarm R package for Bayesian applied regression modeling
https://mc-stan.org/rstanarm
GNU General Public License v3.0
385 stars 132 forks source link

`stan_gamm4` with `random = NULL` and `offset()` formula term #546

Open fweber144 opened 3 years ago

fweber144 commented 3 years ago

Summary:

A stan_gamm4() fit with an offset() formula term seems to be handled incorrectly when posterior_linpred() (for example) is called with newdata but without offset. This is not a duplicate of #253 since #253 had multilevel terms included (whereas here, random = NULL).

Description:

This is similar to #541, but now for stan_gamm4 with random = NULL (instead of stan_glmer()) and for an offset() formula term (instead of the offset argument). I'll be repeating an adapted issue description from #541 for the sake of completeness:

First, unlike a stan_glm() fit, a stan_gamm4() fit with random = NULL and with offsets specified via an offset() formula term doesn't produce an appropriate warning when posterior_linpred() (for example) is called with newdata but without offset. Secondly (and more importantly), in that posterior_linpred() call, the stan_glm() and the stan_gamm4() fit differ in what is added to the linear predictor: For stan_glm(), a vector of zeros (and of appropriate length) is added, while for stan_gamm4(), the original offsets are recycled so that they match the number of observations in newdata.

Reproducible Steps:

# Original source: Example section of `?rstanarm::stan_gamm4`.
library(rstanarm)
options(mc.cores = parallel::detectCores(logical = FALSE))
dat <- mgcv::gamSim(1, n = 400, scale = 2) ## simulate 4 term additive truth
## Now add 20 level random effect`fac'...
dat$fac <- fac <- as.factor(sample(1:20, 400, replace = TRUE))
dat$y <- dat$y + model.matrix(~ fac - 1) %*% rnorm(20) * .5
gam_fit <- stan_gamm4(y ~ s(x0) + x1 + s(x2) + offset(x3),
                      data = dat,
                      chains = 1,
                      iter = 500, # for example speed
                      seed = 1140350788)

dat_new <- head(dat, 3)
gam_pl <- posterior_linpred(gam_fit, newdata = dat_new)

The last line throws the warning

In sweep(eta, 2L, offset, `+`) :
  STATS is longer than the extent of 'dim(x)[MARGIN]'

In contrast, with stan_glm(), an appropriate warning is thrown:

glm_fit <- stan_glm(y ~ x0 + x1 + x2 + offset(x3),
                    data = dat,
                    chains = 1,
                    iter = 500, # for example speed
                    seed = 1140350788)
glm_pl <- posterior_linpred(glm_fit, newdata = dat_new)

namely

'offset' argument is NULL but it looks like you estimated the model using an offset term.

Calling debug(rstanarm:::linear_predictor) before those posterior_linpred() calls reveals that for the stan_glm() fit, offset is a vector of 3 zeros, while for the stan_gamm4() fit, offset is the original offset vector (of length nrow(dat) = 400).

RStanARM Version:

2.21.2 (from https://mc-stan.org/r-packages/)

R Version:

4.1.1

Operating System:

Ubuntu 20.04.2