rvlenth / emmeans

Estimated marginal means
https://rvlenth.github.io/emmeans/
364 stars 32 forks source link

Error with model-averaged binomial GLMM (with glmmTMB): subscript out of bounds #455

Closed MarcRieraDominguez closed 1 year ago

MarcRieraDominguez commented 1 year ago

Hello Russell! Thank you for maintaining the package!

Describe the bug

emmeans::emmeans() won't work with a model-averaged binomial GLMM (class averaging, MuMIn package), when fitted with glmmTMB::glmmTMB(): Error in ME[, nms, drop = FALSE] : subscript out of bounds

I suspect it has to do with trouble accessing the data, perhaps due to shifting variable names. Unfortunately, don't know how to fix it.

The problem appears to be specific to model-averaged glmmTMB: a model-averaged binomial GLM fitted with glmmTMB throws the same error; but model-averaged lme4::glmer() (binomial GLMM) and stats::glm() (binomial GLM) work fine. I'd rather stay with glmmTMB because it allows parallelization (my data has > 61000 rows, which I have not shared. Instead, I've prepared a reprex that throws the same error).

Expected behavior

The common output of emmeans::emmeans(): estimated marginal means, with their standard errors, and Tukey contrats.

Additional context

I ran into this issue with a model-averaged binomial GLMM that related a proportion to: categorical factors (unordered), numerical variables with second-order polynomials (poly(x, 2, raw = TRUE)), and numerical variables without polynomials. My dataset is large (around 62000 rows, and not yet published), so I've devised a reprex that throws the same error message.

To reproduce

library(emmeans)
#> Warning: package 'emmeans' was built under R version 4.2.3
library(glmmTMB)
#> Warning: package 'glmmTMB' was built under R version 4.2.3
#> Warning in checkMatrixPackageVersion(): Package version inconsistency detected.
#> TMB was built with Matrix version 1.6.1
#> Current Matrix version is 1.4.1
#> Please re-install 'TMB' from source using install.packages('TMB', type = 'source') or ask CRAN for a binary version of 'TMB' matching CRAN's 'Matrix' package
library(lme4)
#> Loading required package: Matrix
library(MuMIn)
#> Warning: package 'MuMIn' was built under R version 4.2.3
library(tidyverse)
#> Warning: package 'ggplot2' was built under R version 4.2.2
#> Warning: package 'tibble' was built under R version 4.2.3
#> Warning: package 'tidyr' was built under R version 4.2.3
#> Warning: package 'purrr' was built under R version 4.2.3
#> Warning: package 'dplyr' was built under R version 4.2.3
#> Warning: package 'stringr' was built under R version 4.2.3

# Setting for MuMin
options(na.action = "na.fail")

# Create response variable for binomial GLMM
# Inspired from here (page 4): https://glmmtmb.github.io/glmmTMB/articles/glmmTMB.html
Owls <- mutate(glmmTMB::Owls,
               NCalls = if_else(SiblingNegotiation == 0, 0, 1))
summary(Owls)
#>           Nest      FoodTreatment  SexParent    ArrivalTime   
#>  Oleyes     : 52   Deprived:320   Female:245   Min.   :21.71  
#>  Montet     : 41   Satiated:279   Male  :354   1st Qu.:23.11  
#>  Etrabloz   : 34                               Median :24.38  
#>  Yvonnand   : 34                               Mean   :24.76  
#>  Champmartin: 30                               3rd Qu.:26.25  
#>  Lucens     : 29                               Max.   :29.25  
#>  (Other)    :379                                              
#>  SiblingNegotiation   BroodSize      NegPerChick     logBroodSize  
#>  Min.   : 0.00      Min.   :1.000   Min.   :0.000   Min.   :0.000  
#>  1st Qu.: 0.00      1st Qu.:4.000   1st Qu.:0.000   1st Qu.:1.386  
#>  Median : 5.00      Median :4.000   Median :1.200   Median :1.386  
#>  Mean   : 6.72      Mean   :4.392   Mean   :1.564   Mean   :1.439  
#>  3rd Qu.:11.00      3rd Qu.:5.000   3rd Qu.:2.500   3rd Qu.:1.609  
#>  Max.   :32.00      Max.   :7.000   Max.   :8.500   Max.   :1.946  
#>                                                                    
#>      NCalls      
#>  Min.   :0.0000  
#>  1st Qu.:0.0000  
#>  Median :1.0000  
#>  Mean   :0.7396  
#>  3rd Qu.:1.0000  
#>  Max.   :1.0000  
#> 
str(Owls)
#> 'data.frame':    599 obs. of  9 variables:
#>  $ Nest              : Factor w/ 27 levels "AutavauxTV","Bochet",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ FoodTreatment     : Factor w/ 2 levels "Deprived","Satiated": 1 2 1 1 1 1 1 2 1 2 ...
#>  $ SexParent         : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 1 2 1 ...
#>  $ ArrivalTime       : num  22.2 22.4 22.5 22.6 22.6 ...
#>  $ SiblingNegotiation: int  4 0 2 2 2 2 18 4 18 0 ...
#>  $ BroodSize         : int  5 5 5 5 5 5 5 5 5 5 ...
#>  $ NegPerChick       : num  0.8 0 0.4 0.4 0.4 0.4 3.6 0.8 3.6 0 ...
#>  $ logBroodSize      : num  1.61 1.61 1.61 1.61 1.61 ...
#>  $ NCalls            : num  1 0 1 1 1 1 1 1 1 0 ...

##############
### Binomial GLMM with glmmTMB
mixmod.tmb <-  glmmTMB(NCalls ~ SexParent + poly(BroodSize, 2, raw = TRUE) + FoodTreatment + (1|Nest), data=Owls, family=binomial(link = "logit"))

# Model-averaging
mixmod.tmb.d <- MuMIn::dredge(mixmod.tmb, rank = "AICc")
#> Fixed terms are "cond((Int))" and "disp((Int))"
mixmod.tmb.avg <- MuMIn::model.avg(subset(mixmod.tmb.d, delta <= 6), revised.var = TRUE, fit = TRUE)

# The glmmTMB model works fine, no need to indicated cond(variable_name)
MuMIn::getAllTerms(mixmod.tmb)
#> [1] "cond(FoodTreatment)"                 
#> [2] "cond(poly(BroodSize, 2, raw = TRUE))"
#> [3] "cond(SexParent)"                     
#> attr(,"random.terms")
#> [1] "cond(1 | Nest)"
#> attr(,"random")
#> . ~ . + cond(1 | Nest)
#> attr(,"response")
#> NCalls
#> attr(,"order")
#> [1] 3 2 1
#> attr(,"intercept")
#> cond   zi disp 
#>    1    0    1 
#> attr(,"interceptLabel")
#> [1] "cond((Int))" "disp((Int))"
#> attr(,"deps")
#>                                      cond(FoodTreatment)
#> cond(FoodTreatment)                                   NA
#> cond(poly(BroodSize, 2, raw = TRUE))               FALSE
#> cond(SexParent)                                    FALSE
#>                                      cond(poly(BroodSize, 2, raw = TRUE))
#> cond(FoodTreatment)                                                 FALSE
#> cond(poly(BroodSize, 2, raw = TRUE))                                   NA
#> cond(SexParent)                                                     FALSE
#>                                      cond(SexParent)
#> cond(FoodTreatment)                            FALSE
#> cond(poly(BroodSize, 2, raw = TRUE))           FALSE
#> cond(SexParent)                                   NA
summary(model.frame(mixmod.tmb))
#>      NCalls        SexParent  
#>  Min.   :0.0000   Female:245  
#>  1st Qu.:0.0000   Male  :354  
#>  Median :1.0000               
#>  Mean   :0.7396               
#>  3rd Qu.:1.0000               
#>  Max.   :1.0000               
#>                               
#>  poly(BroodSize, 2, raw = TRUE).1  poly(BroodSize, 2, raw = TRUE).2
#>  Min.   :1.000000    Min.   : 1.00000                              
#>  1st Qu.:4.000000    1st Qu.:16.00000                              
#>  Median :4.000000    Median :16.00000                              
#>  Mean   :4.392321    Mean   :20.61603                              
#>  3rd Qu.:5.000000    3rd Qu.:25.00000                              
#>  Max.   :7.000000    Max.   :49.00000                              
#>                                                                    
#>   FoodTreatment          Nest    
#>  Deprived:320   Oleyes     : 52  
#>  Satiated:279   Montet     : 41  
#>                 Etrabloz   : 34  
#>                 Yvonnand   : 34  
#>                 Champmartin: 30  
#>                 Lucens     : 29  
#>                 (Other)    :379
emmeans::emmeans(mixmod.tmb, specs = pairwise ~ SexParent, adjust = "Tukey", regrid = "response")
#> $emmeans
#>  SexParent  prob     SE  df asymp.LCL asymp.UCL
#>  Female    0.684 0.0485 Inf     0.589     0.779
#>  Male      0.758 0.0394 Inf     0.681     0.835
#> 
#> Results are averaged over the levels of: FoodTreatment 
#> Confidence level used: 0.95 
#> 
#> $contrasts
#>  contrast      estimate     SE  df z.ratio p.value
#>  Female - Male  -0.0741 0.0413 Inf  -1.794  0.0728
#> 
#> Results are averaged over the levels of: FoodTreatment
emmeans::emmeans(mixmod.tmb, specs = pairwise ~ cond(SexParent), adjust = "Tukey", regrid = "response")
#> $emmeans
#>  SexParent  prob     SE  df asymp.LCL asymp.UCL
#>  Female    0.684 0.0485 Inf     0.589     0.779
#>  Male      0.758 0.0394 Inf     0.681     0.835
#> 
#> Results are averaged over the levels of: FoodTreatment 
#> Confidence level used: 0.95 
#> 
#> $contrasts
#>  contrast      estimate     SE  df z.ratio p.value
#>  Female - Male  -0.0741 0.0413 Inf  -1.794  0.0728
#> 
#> Results are averaged over the levels of: FoodTreatment

# For the "averaging" object, things are more tricky. For instance, model.frame() won't work
MuMIn::getAllTerms(mixmod.tmb.avg)
#> [1] "FoodTreatment"                  "poly(BroodSize, 2, raw = TRUE)"
#> [3] "SexParent"                     
#> attr(,"intercept")
#> [1] 0
#> attr(,"interceptLabel")
#> [1] "(Intercept)"
#> attr(,"response")
#> NCalls
#> attr(,"order")
#> [1] 1 2 3
#> attr(,"deps")
#>                                FoodTreatment poly(BroodSize, 2, raw = TRUE)
#> FoodTreatment                             NA                          FALSE
#> poly(BroodSize, 2, raw = TRUE)         FALSE                             NA
#> SexParent                              FALSE                          FALSE
#>                                SexParent
#> FoodTreatment                      FALSE
#> poly(BroodSize, 2, raw = TRUE)     FALSE
#> SexParent                             NA
model.frame(mixmod.tmb.avg)
#> Error in eval(predvars, data, env): objeto 'NCalls' no encontrado

# Data needs to be provided explicitly
emmeans::emmeans(mixmod.tmb.avg, specs = pairwise ~ SexParent, adjust = "Tukey", regrid = "response")
#> Error in model.frame.default(formula, data = data, ...) : 
#>   'data' must be a data.frame, environment, or list
#> Error in (function (object, at, cov.reduce = mean, cov.keep = get_emm_option("cov.keep"), : Perhaps a 'data' or 'params' argument is needed

# The first impulse is to feed it the raw data, but it doesn't work
emmeans::emmeans(mixmod.tmb.avg, specs = pairwise ~ SexParent, adjust = "Tukey", regrid = "response", data = Owls)
#> Error in ME[, nms, drop = FALSE]: subíndice fuera de  los límites

# I've tried multiple ways to feed the data to emmeans(), perhaps some are nonsense
emmeans::emmeans(mixmod.tmb.avg, specs = pairwise ~ SexParent, adjust = "Tukey", regrid = "response", data = model.matrix(mixmod.tmb.avg))
#> Error in eval(predvars, data, env): objeto 'FoodTreatment' no encontrado
emmeans::emmeans(mixmod.tmb.avg, specs = pairwise ~ SexParent, adjust = "Tukey", regrid = "response", data = model.frame(mixmod.tmb))
#> Error in poly(BroodSize, 2, raw = TRUE): objeto 'BroodSize' no encontrado
emmeans::emmeans(mixmod.tmb.avg, specs = pairwise ~ SexParent, adjust = "Tukey", regrid = "response", data = mixmod.tmb.avg$x)
#> Error in eval(predvars, data, env): objeto 'FoodTreatment' no encontrado
emmeans::emmeans(mixmod.tmb.avg, specs = pairwise ~ SexParent, adjust = "Tukey", regrid = "response", data = mixmod.tmb$frame)
#> Error in poly(BroodSize, 2, raw = TRUE): objeto 'BroodSize' no encontrado

##############
### Binomial GLM fitted with glmmTMB
mod.tmb <-  glmmTMB(NCalls ~ SexParent + poly(BroodSize, 2, raw = TRUE) + FoodTreatment, data=Owls, family=binomial(link = "logit"))

# Model-averaging
mod.tmb.d <- MuMIn::dredge(mod.tmb, rank = "AICc")
#> Fixed terms are "cond((Int))" and "disp((Int))"
mod.tmb.avg <- MuMIn::model.avg(subset(mod.tmb.d, delta <= 6), revised.var = TRUE, fit = TRUE)

# emmeans
emmeans::emmeans(mod.tmb.avg, specs = pairwise ~ SexParent, adjust = "Tukey", regrid = "response", data = Owls)
#> Error in ME[, nms, drop = FALSE]: subíndice fuera de  los límites

##############
### A mixed-effects binomial GLMM declared with lme4::glmer works as expected
# We get warnings about convergence, but emmeans performs as expected
glmer <-  glmer(NCalls ~ SexParent + poly(BroodSize, 2, raw = TRUE) + FoodTreatment + (1|Nest), data=Owls, family=binomial(link = "logit"))
glmer.d <- MuMIn::dredge(glmer, rank = "AICc")
#> Warning in formals(fun): argument is not a function
#> Fixed term is "(Intercept)"
#> Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
#> Model failed to converge with max|grad| = 0.0035348 (tol = 0.002, component 1)
glmer.avg <- MuMIn::model.avg(subset(glmer.d, delta <= 6), revised.var = TRUE, fit = TRUE)
#> Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
#> Model failed to converge with max|grad| = 0.0035348 (tol = 0.002, component 1)

emmeans::emmeans(glmer, specs = pairwise ~ SexParent, adjust = "Tukey", regrid = "response", data = Owls)
#> $emmeans
#>  SexParent  prob     SE  df asymp.LCL asymp.UCL
#>  Female    0.684 0.0483 Inf     0.589     0.779
#>  Male      0.758 0.0393 Inf     0.681     0.835
#> 
#> Results are averaged over the levels of: FoodTreatment 
#> Confidence level used: 0.95 
#> 
#> $contrasts
#>  contrast      estimate     SE  df z.ratio p.value
#>  Female - Male   -0.074 0.0411 Inf  -1.802  0.0715
#> 
#> Results are averaged over the levels of: FoodTreatment
emmeans::emmeans(glmer.avg, specs = pairwise ~ SexParent, adjust = "Tukey", regrid = "response", data = Owls)
#> $emmeans
#>  SexParent  prob     SE  df lower.CL upper.CL
#>  Female    0.700 0.0505 593    0.601    0.800
#>  Male      0.749 0.0417 593    0.667    0.831
#> 
#> Results are averaged over the levels of: FoodTreatment 
#> Confidence level used: 0.95 
#> 
#> $contrasts
#>  contrast      estimate     SE  df t.ratio p.value
#>  Female - Male  -0.0489 0.0481 593  -1.016  0.3101
#> 
#> Results are averaged over the levels of: FoodTreatment

##############
### A binomial GLM declared with stats::glm works as expected
glm <-  glm(NCalls ~ SexParent + poly(BroodSize, 2, raw = TRUE) + FoodTreatment, data=Owls, family=binomial(link = "logit"))
glm.d <- MuMIn::dredge(glm, rank = "AICc")
#> Warning in formals(fun): argument is not a function
#> Fixed term is "(Intercept)"
glm.avg <- MuMIn::model.avg(subset(glm.d, delta <= 6), revised.var = TRUE, fit = TRUE)

emmeans::emmeans(glm, specs = pairwise ~ SexParent, adjust = "Tukey", regrid = "response", data = Owls)
#> $emmeans
#>  SexParent  prob     SE  df asymp.LCL asymp.UCL
#>  Female    0.668 0.0321 Inf     0.605     0.731
#>  Male      0.760 0.0246 Inf     0.711     0.808
#> 
#> Results are averaged over the levels of: FoodTreatment 
#> Confidence level used: 0.95 
#> 
#> $contrasts
#>  contrast      estimate     SE  df z.ratio p.value
#>  Female - Male  -0.0913 0.0372 Inf  -2.454  0.0141
#> 
#> Results are averaged over the levels of: FoodTreatment
emmeans::emmeans(glm.avg, specs = pairwise ~ SexParent, adjust = "Tukey", regrid = "response", data = Owls)
#> $emmeans
#>  SexParent  prob     SE  df lower.CL upper.CL
#>  Female    0.675 0.0359 594    0.604    0.745
#>  Male      0.756 0.0268 594    0.703    0.808
#> 
#> Results are averaged over the levels of: FoodTreatment 
#> Confidence level used: 0.95 
#> 
#> $contrasts
#>  contrast      estimate     SE  df t.ratio p.value
#>  Female - Male  -0.0808 0.0456 594  -1.772  0.0769
#> 
#> Results are averaged over the levels of: FoodTreatment

sessionInfo()
#> R version 4.2.0 (2022-04-22 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 22000)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=Spanish_Spain.utf8  LC_CTYPE=Spanish_Spain.utf8   
#> [3] LC_MONETARY=Spanish_Spain.utf8 LC_NUMERIC=C                  
#> [5] LC_TIME=Spanish_Spain.utf8    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] forcats_0.5.1   stringr_1.5.0   dplyr_1.1.2     purrr_1.0.1    
#>  [5] readr_2.1.2     tidyr_1.3.0     tibble_3.2.1    ggplot2_3.4.0  
#>  [9] tidyverse_1.3.1 MuMIn_1.47.5    lme4_1.1-29     Matrix_1.4-1   
#> [13] glmmTMB_1.1.7   emmeans_1.8.9  
#> 
#> loaded via a namespace (and not attached):
#>  [1] httr_1.4.3          jsonlite_1.8.0      splines_4.2.0      
#>  [4] modelr_0.1.8        assertthat_0.2.1    highr_0.9          
#>  [7] stats4_4.2.0        cellranger_1.1.0    yaml_2.3.5         
#> [10] numDeriv_2016.8-1.1 pillar_1.9.0        backports_1.4.1    
#> [13] lattice_0.20-45     glue_1.6.2          digest_0.6.33      
#> [16] rvest_1.0.2         minqa_1.2.4         colorspace_2.0-3   
#> [19] sandwich_3.0-1      htmltools_0.5.6     pkgconfig_2.0.3    
#> [22] broom_0.8.0         haven_2.5.0         xtable_1.8-4       
#> [25] mvtnorm_1.1-3       scales_1.2.1        tzdb_0.3.0         
#> [28] generics_0.1.2      ellipsis_0.3.2      TH.data_1.1-1      
#> [31] withr_2.5.0         TMB_1.9.6           cli_3.6.1          
#> [34] crayon_1.5.1        readxl_1.4.0        survival_3.3-1     
#> [37] magrittr_2.0.3      estimability_1.4.1  evaluate_0.15      
#> [40] fs_1.5.2            fansi_1.0.3         nlme_3.1-157       
#> [43] MASS_7.3-57         xml2_1.3.3          tools_4.2.0        
#> [46] hms_1.1.1           lifecycle_1.0.3     multcomp_1.4-19    
#> [49] munsell_0.5.0       reprex_2.0.2        compiler_4.2.0     
#> [52] rlang_1.1.1         grid_4.2.0          nloptr_2.0.1       
#> [55] rstudioapi_0.13     rmarkdown_2.14      boot_1.3-28        
#> [58] gtable_0.3.0        codetools_0.2-18    DBI_1.1.2          
#> [61] R6_2.5.1            zoo_1.8-10          lubridate_1.8.0    
#> [64] knitr_1.39          fastmap_1.1.0       utf8_1.2.2         
#> [67] stringi_1.7.6       Rcpp_1.0.11         vctrs_0.6.3        
#> [70] dbplyr_2.1.1        tidyselect_1.2.0    xfun_0.40          
#> [73] coda_0.19-4

Created on 2023-11-14 with reprex v2.0.2

Many thanks in advance!

rvlenth commented 1 year ago

This is due to the fact that some models have multiple portions and the coefficient names are typically prefixed or wrapped. In this case, we have:

> coef(mixmod.tmb.avg)
                          cond((Int))           cond(FoodTreatmentSatiated) 
                            1.7410094                            -1.8463991 
cond(poly(BroodSize, 2, raw = TRUE)1) cond(poly(BroodSize, 2, raw = TRUE)2) 
                           -0.5129341                             0.1280694 
                  cond(SexParentMale) 
                            0.4261344

i.e., the coefficient names are wrapped with cond(). This is provided for via the optional subset argument -- see the [documentation] (https://cran.r-project.org/web/packages/emmeans/vignettes/models.html#I) for the models vignette for averaging objects. Using this we have

> emmeans(mixmod.tmb.avg, specs = pairwise ~ SexParent, adjust = "Tukey", regrid = "response", data = Owls, 
+     subset = "wrap:cond")
$emmeans
 SexParent  prob     SE  df lower.CL upper.CL
 Female    0.701 0.0506 593    0.601    0.800
 Male      0.749 0.0418 593    0.667    0.832

Results are averaged over the levels of: FoodTreatment 
Confidence level used: 0.95 

$contrasts
 contrast      estimate     SE  df t.ratio p.value
 Female - Male  -0.0489 0.0483 593  -1.014  0.3112

Results are averaged over the levels of: FoodTreatment 

That said, there was a glitch in the subset code, so if using version 1.8.9 or earlier, you need to install emmeans from GitHub.

MarcRieraDominguez commented 1 year ago

Hello! You are right. I just downloaded version 1.8.9-900001 from GitHub, and emmeans() and emtrends() work fine with the subset argument you provided. I'll be closing this. Thank you very much!