strengejacke / ggeffects

Estimated Marginal Means and Marginal Effects from Regression Models for ggplot2
https://strengejacke.github.io/ggeffects
Other
558 stars 35 forks source link

Population-level predictions for gam/comparable predictions for gam and glmer fits #417

Closed bbolker closed 11 months ago

bbolker commented 11 months ago

(Apologies in advance that this question is not more boiled down/minimal: I think it's mostly likely a documentation issue, but I'm having trouble figuring it out. Happy to close/repost elsewhere if there's a more appropriate venue.)

tl;dr for two equivalent models fitted respectively via (1) lme4::glmer with an intercept RE ((1|id)) and (2) mgcv::gam with s(id, bs = "re"), ggpredict() gives population-level predictions for glmer and district-1-level predictions for gam. I can see how to get predict.gam() to produce population-level predictions in this case (via exclude), but I'm having trouble passing those arguments through ggpredict().

package load/model fits

library(lme4)
library(mgcv)
library(ggeffects)
library(patchwork)
library(broom.mixed)
library(broom)
library(dplyr)

data("Contraception", package = "mlmRev")
Contraception <- transform(Contraception,
                           use_n = as.numeric(use) - 1,
                           age_sc = drop(scale(age)))

m_gam <- gam(use_n ~ poly(age_sc,2) + urban + s(district, bs = "re"),
            family = binomial,  data = Contraception, method = "ML")
m_glmer <- glmer(use_n ~ poly(age_sc,2) + urban + (1|district),
          family = binomial, data = Contraception)

comparisons of coefficients, raw predictions, etc.

## coefficients
t_gam <- tidy(m_gam, parametric = TRUE) |> select(estimate, std.error)
t_glmer <- tidy(m_glmer, effects = "fixed")|> select(estimate, std.error)
all.equal(t_gam, t_glmer, tolerance = 0.015)  ## TRUE
## observation-level predictions
all.equal(as.numeric(predict(m_gam)), unname(predict(m_glmer)), tol = 0.015) ## TRUE
## population-level predictions
nd <- with(Contraception, expand.grid(age_sc = unique(age_sc),  urban = unique(urban)))
pred_glmer <- unname(predict(m_glmer, newdata = nd, re.form = NA,
                             type = "response"))
pred_gam <- as.numeric(predict(m_gam, newdata = nd, exclude = "s(district)",
                               newdata.guaranteed = TRUE,
                               type = "response"))
all.equal(pred_glmer, pred_gam, tol = 0.015) ## TRUE

The help page for ggpredict says that I can pass arguments through to predict():

...: For ‘ggpredict()’, further arguments passed down to predict();

so I would have expected these to give equivalent answers:

p0_gam <- ggpredict(m_gam, terms = c("age_sc [all]", "urban"), exclude = "s(district)",
                                        newdata.guaranteed = TRUE)
p0_glmer <- ggpredict(m_glmer, terms = c("age_sc [all]", "urban"))

... but I can see that the predictions are different, and furthermore that the ggpredict output prints "Adjusted for: district = 0 (population-level)" in the case of the glmer prediction and Adjusted for: district = 1 in the case of the gam prediction ...

strengejacke commented 11 months ago

It's mostly due to some old code, where ... are not passed down to the gam-predict-method. Let's see if it's really that easy and I just need to add ... to the predict() call.

library(lme4)
#> Loading required package: Matrix
library(mgcv)
#> Loading required package: nlme
#> 
#> Attaching package: 'nlme'
#> The following object is masked from 'package:lme4':
#> 
#>     lmList
#> This is mgcv 1.9-0. For overview type 'help("mgcv-package")'.
library(ggeffects)

data("Contraception", package = "mlmRev")
Contraception <- transform(Contraception,
                           use_n = as.numeric(use) - 1,
                           age_sc = drop(scale(age)))

m_gam <- gam(use_n ~ poly(age_sc,2) + urban + s(district, bs = "re"),
            family = binomial,  data = Contraception, method = "ML")
m_glmer <- glmer(use_n ~ poly(age_sc,2) + urban + (1|district),
          family = binomial, data = Contraception)

p0_gam <- ggpredict(m_gam,
  terms = c("age_sc [all]", "urban"), exclude = "s(district)",
  newdata.guaranteed = TRUE
)
p0_glmer <- ggpredict(m_glmer, terms = c("age_sc [all]", "urban"))

p0_gam[1:5, ]
#> # Predicted probabilities of use_n
#> 
#> # urban = N
#> 
#> age_sc | Predicted |       95% CI
#> ---------------------------------
#>  -1.50 |      0.14 | [0.11, 0.19]
#>  -1.39 |      0.17 | [0.13, 0.21]
#>  -1.28 |      0.20 | [0.16, 0.24]
#> 
#> # urban = Y
#> 
#> age_sc | Predicted |       95% CI
#> ---------------------------------
#>  -1.50 |      0.24 | [0.18, 0.31]
#>  -1.39 |      0.28 | [0.22, 0.34]
#> 
#> Adjusted for:
#> * district = 1
p0_glmer[1:5, ]
#> # Predicted probabilities of use_n
#> 
#> # urban = N
#> 
#> age_sc | Predicted |       95% CI
#> ---------------------------------
#>  -1.50 |      0.14 | [0.11, 0.18]
#>  -1.39 |      0.17 | [0.13, 0.21]
#>  -1.28 |      0.19 | [0.16, 0.24]
#> 
#> # urban = Y
#> 
#> age_sc | Predicted |       95% CI
#> ---------------------------------
#>  -1.50 |      0.24 | [0.18, 0.30]
#>  -1.39 |      0.27 | [0.21, 0.34]
#> 
#> Adjusted for:
#> * district = 0 (population-level)

Created on 2023-12-11 with reprex v2.0.2

strengejacke commented 11 months ago

Adjusted for: district = 1 in the case of the gam prediction...

This issue is still left. I'm not very familiar with gam's, so the footnote blindly prints the value that is used when creating the data grid, no matter if that value is actually used or not. Maybe I could check for s(..., bs = "re"), and the adjust the message Adjusted for...?

bbolker commented 11 months ago

It's a bit of a special case. I don't honestly know how often people will want to include "old-style" random effects (sensu Jim Hodges, i.e. bs="re") in GAM(M)s, or more generally to exclude random effects/latent variables. In this case I was doing it to make a point for a class. (Without checking, I don't know how the "adjusted for" message is generated, or what ggpredict does in general for latent variables/smooths that aren't included in terms ...)

strengejacke commented 11 months ago

It basically looks for all variables in the model that were not defined as focal terms, and sets these remaining variables to the value at which these are "held constant" (either selects mean or mode, based on the type of the variable):

model_frame <- insight::get_data(m_gam, source = "mf")
focal_terms <- c("age_sc", "urban")
model_predictors <- colnames(model_frame)
# keep those, which we did not process yet
model_predictors <- model_predictors[!(model_predictors %in% focal_terms)]

constant_values <- lapply(model_predictors, function(x) {
  pred <- model_frame[[x]]
  if (is.factor(pred)) {
    pred <- droplevels(pred)
  }
  ggeffects:::.typical_value(
    pred,
    fun = "mean",
    weights = NULL,
    predictor = x,
    log_terms = NULL
  )
})
names(constant_values) <- model_predictors
constant_values
#> $use_n
#> [1] 0.3924509
#> 
#> $district
#> [1] "1"
bbolker commented 11 months ago

OK, I see by digging a bit that .typical_value() uses the first level of factors as the typical value by default (unless fun = "mode" is specified ...)

To be honest, I'm not sure if there's going to be an easy way to adjust the labeling without getting way into the weeds ... the bug-fix is good enough for me. Maybe (??) it's worth adding a very brief section to the "introduction_randomeffects" vignette to mention that things work slightly differently in the gam(... + s(..., bs = "re")) case?