Error with model-averaged hurdle models fitted through pscl

MarcRieraDominguez commented 1 year ago

Describe the bug

emmeans won't work with model-averaged hurdle models, fitted through pscl::hurdle(). The error is Error in update.default(object$formula, . ~ . + 1): need an object with call component

Expected behavior

I would expect to get the estimated marginal means. emmeans() will not complain with used on a hurdle model fitted through glmmTMB, alhtough the probabilities are not in the 0-1 interval (see reproducible example below).

Additional context

Model-averaging was done with MuMIn::model.avg(). I get error messages if I don't specify fit = TRUE inside MuMIn::model.avg(), or use emmeans() without an explicit data argument, which is already anticipated by the emmeans documentation. In my research, I would store models in a list, and use lapply() to calculate model averaging and emmeans. However, since the error could be reproduced with model outside a list, I've opted to keep it simple, and provide a reproducible example without storing models in lists. Note that while glmmTMB can fit hurdle models, pscl is substantially faster. (besides, glmmTMB binomial component calculates the probability of observing a zero, which I don't find very inutitive).

Thanks in advance!

To reproduce

library(emmeans)
#> Warning: package 'emmeans' was built under R version 4.2.3
library(glmmTMB)
#> Warning: package 'glmmTMB' was built under R version 4.2.3
#> Warning in checkMatrixPackageVersion(): Package version inconsistency detected.
#> TMB was built with Matrix version 1.6.1
#> Current Matrix version is 1.4.1
#> Please re-install 'TMB' from source using install.packages('TMB', type = 'source') or ask CRAN for a binary version of 'TMB' matching CRAN's 'Matrix' package
library(MuMIn)
#> Warning: package 'MuMIn' was built under R version 4.2.3
library(pscl)
#> Warning: package 'pscl' was built under R version 4.2.3
#> Classes and Methods for R developed in the
#> Political Science Computational Laboratory
#> Department of Political Science
#> Stanford University
#> Simon Jackman
#> hurdle and zeroinfl functions by Achim Zeileis

# Create datasets: number of papers during a Phd
data("bioChemists", package = "pscl")

# Setting for the dredge() function
options(na.action = na.fail)

## emmeans fails when applied to hurdle models fitted through glmmTMB
mod.pscl <- pscl::hurdle(art ~ fem + mar + kid5, data = bioChemists, dist = "poisson", zero.dist = "binomial", link = "logit")
dredge.pscl <- MuMIn::dredge(mod.pscl, rank = "AICc")
#> Fixed terms are "count_(Intercept)" and "zero_(Intercept)"
avg.pscl <- MuMIn::model.avg(dredge.pscl, revised.var = TRUE, fit = TRUE)
avg.pscl
#> 
#> Call:
#> model.avg(object = get.models(object = dredge.pscl, subset = NA), 
#>     revised.var = TRUE)
#> 
#> Component models: 
#> '1245'   '123456' '14'     '1346'   '2356'   '(Null)' '25'     '36'    
#> 
#> Coefficients: 
#>        count_(Intercept) count_femWomen count_kid5 zero_(Intercept)
#> full           0.8885741     -0.2699105 -0.0948322         1.044665
#> subset         0.8885741     -0.2699486 -0.1007132         1.044665
#>        zero_femWomen  zero_kid5 count_marMarried zero_marMarried
#> full      -0.3377742 -0.2156843       0.03171487       0.1065742
#> subset    -0.3378219 -0.2290600       0.07760844       0.2607944
avg.pscl$call
#> model.avg(object = get.models(object = dredge.pscl, subset = NA), 
#>     revised.var = TRUE)
emmeans::emmeans(avg.pscl, specs = pairwise ~ mar, adjust = "Tukey", mode = "prob0", data = bioChemists)
#> Error in update.default(object$formula, . ~ . + 1): need an object with call component

## emmeans does not fail when applied to hurdle models fitted through glmmTMB
mod.tmb <- glmmTMB::glmmTMB(art ~ fem + mar + kid5, data = bioChemists, family = truncated_poisson(link = "log"), zi = ~ fem + mar + kid5)
dredge.tmb <- MuMIn::dredge(mod.tmb, rank = "AICc")
#> Fixed terms are "cond((Int))", "zi((Int))" and "disp((Int))"
avg.tmb <- MuMIn::model.avg(dredge.tmb, revised.var = TRUE, fit = TRUE)
avg.tmb
#> 
#> Call:
#> model.avg(object = get.models(object = dredge.tmb, subset = NA), 
#>     revised.var = TRUE)
#> 
#> Component models: 
#> '12456'  '1245'   '123456' '12345'  '124'    '1256'   '1456'   '145'    
#> '1234'   '12356'  '12'     '125'    '1246'   '123'    '1235'   '13456'  
#> '126'    '1345'   '12346'  '14'     '156'    '1236'   '1'      '15'     
#> '146'    '134'    '1356'   '16'     '13'     '135'    '1346'   '136'    
#> '456'    '45'     '23456'  '2345'   '3456'   '2456'   '345'    '245'    
#> '4'      '56'     '234'    '2356'   '34'     '24'     '356'    '256'    
#> '(Null)' '5'      '23'     '235'    '46'     '2346'   '3'      '2'      
#> '35'     '25'     '6'      '236'    '346'    '246'    '36'     '26'    
#> 
#> Coefficients: 
#>        cond((Int)) cond(femWomen)  cond(kid5)  zi((Int)) zi(femWomen)  zi(kid5)
#> full     0.8849057     -0.2659466 -0.08203013 -0.9690547    0.2472904 0.1744610
#> subset   0.8849057     -0.2661091 -0.10012955 -0.9690547    0.3237021 0.2268707
#>        zi(marMarried) cond(marMarried)
#> full       -0.1245142       0.02614982
#> subset     -0.2501199       0.06987478
avg.tmb$call
#> model.avg(object = get.models(object = dredge.tmb, subset = NA), 
#>     revised.var = TRUE)
emmeans::emmeans(avg.tmb, specs = pairwise ~ mar, adjust = "Tukey", comp = "zi", type = "response", data = bioChemists)
#> $emmeans
#>  mar     rate    SE  df lower.CL upper.CL
#>  Single  2.04 0.203 907     1.67     2.48
#>  Married 1.80 0.107 907     1.60     2.02
#> 
#> Results are averaged over the levels of: fem 
#> Confidence level used: 0.95 
#> Intervals are back-transformed from the log scale 
#> 
#> $contrasts
#>  contrast         ratio    SE  df null t.ratio p.value
#>  Single / Married  1.13 0.164 907    1   0.860  0.3900
#> 
#> Results are averaged over the levels of: fem 
#> Tests are performed on the log scale

sessionInfo()
#> R version 4.2.0 (2022-04-22 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 22000)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=Spanish_Spain.utf8  LC_CTYPE=Spanish_Spain.utf8   
#> [3] LC_MONETARY=Spanish_Spain.utf8 LC_NUMERIC=C                  
#> [5] LC_TIME=Spanish_Spain.utf8    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] pscl_1.5.5.1  MuMIn_1.47.5  glmmTMB_1.1.7 emmeans_1.8.8
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.10         nloptr_2.0.1        compiler_4.2.0     
#>  [4] highr_0.9           TMB_1.9.6           tools_4.2.0        
#>  [7] boot_1.3-28         lme4_1.1-29         digest_0.6.29      
#> [10] nlme_3.1-157        evaluate_0.15       lifecycle_1.0.3    
#> [13] lattice_0.20-45     rlang_1.1.1         reprex_2.0.2       
#> [16] Matrix_1.4-1        cli_3.6.1           rstudioapi_0.13    
#> [19] yaml_2.3.5          mvtnorm_1.1-3       xfun_0.40          
#> [22] fastmap_1.1.0       coda_0.19-4         withr_2.5.0        
#> [25] stringr_1.5.0       knitr_1.39          fs_1.5.2           
#> [28] vctrs_0.6.3         stats4_4.2.0        grid_4.2.0         
#> [31] glue_1.6.2          survival_3.3-1      rmarkdown_2.14     
#> [34] multcomp_1.4-19     minqa_1.2.4         TH.data_1.1-1      
#> [37] magrittr_2.0.3      codetools_0.2-18    htmltools_0.5.6    
#> [40] splines_4.2.0       MASS_7.3-57         xtable_1.8-4       
#> [43] numDeriv_2016.8-1.1 sandwich_3.0-1      estimability_1.4.1 
#> [46] stringi_1.7.6       zoo_1.8-10

^{Created on 2023-09-14 with reprex v2.0.2}

rvlenth commented 1 year ago

The problem seems to be that the formula gets distorted and its class is call instead of formula:

> formula(avg.pscl)
art ~ fem + kid5 + mar | fem + kid5 + mar
> class(.Last.value)
[1] "call"

> formula(avg.tmb)
art ~ 0 + fem + kid5 + mar
> class(.Last.value)
[1] "formula"

In addition, we have

> formula(mod.pscl)
art ~ fem + mar + kid5

which is of class formula. So where the heck does that divided model formula come from? If I try to hack around it, I get a different error:

> tmp = avg.pscl
> tmp$formula = formula(mod.pscl)
> emmeans(tmp, specs = pairwise ~ mar, adjust = "Tukey", mode = "prob0", data = bioChemists)
Error in count_(Intercept) : could not find function "count_"

> coef(tmp)
count_(Intercept)    count_femWomen        count_kid5  zero_(Intercept)     zero_femWomen 
       0.88857411       -0.26994856       -0.10071323        1.04466537       -0.33782187 
        zero_kid5  count_marMarried   zero_marMarried 
      -0.22905995        0.07760844        0.26079442

where it appears that it's confused thinking that count_(Intercept) is a function. But OTOH, it's not confused with the non-averaged model

> emmeans(mod.pscl, specs = pairwise ~ mar, adjust = "Tukey", mode = "prob0", data = bioChemists)
$emmeans
 mar     emmean     SE  df lower.CL upper.CL
 Single   0.340 0.0297 907    0.282    0.399
 Married  0.284 0.0196 907    0.245    0.322

Results are averaged over the levels of: fem 
Confidence level used: 0.95 

$contrasts
 contrast         estimate     SE  df t.ratio p.value
 Single - Married   0.0563 0.0379 907   1.486  0.1376

Results are averaged over the levels of: fem 

> coef(mod.pscl)
count_(Intercept)    count_femWomen  count_marMarried        count_kid5  zero_(Intercept) 
        0.8630115        -0.2664203         0.0789141        -0.1123130         0.9631991 
    zero_femWomen   zero_marMarried         zero_kid5 
       -0.3258336         0.2642771        -0.2696642

At this time, I don't really see a way to anticipate or work around this. It seems like a glitch in model.avg

MarcRieraDominguez commented 1 year ago

Thank you for the detailed response! I'll get in touch with the developper of MuMIn, let's see.

rvlenth commented 1 year ago

The problem was so obvious I couldn't see it. We have two models here, one for the counts and one for the zeros. So we have to have a way of specifying which. So I devised a subset option (and also a formula option):

> emmeans(avg.pscl, "mar", formula = art ~ fem + kid5 + mar, subset = "count_", tran = "log", type = "response", data = bioChemists)
 mar     response     SE  df lower.CL upper.CL
 Single      2.03 0.1054 907     1.83     2.25
 Married     2.09 0.0768 907     1.95     2.25

Results are averaged over the levels of: fem 
Confidence level used: 0.95 
Intervals are back-transformed from the log scale 

> emmeans(avg.pscl, "mar", formula = art ~ fem + kid5 + mar, subset = "zero_", tran = "logit", type = "response", data = bioChemists)
 mar     response     SE  df lower.CL upper.CL
 Single     0.683 0.0285 907    0.625    0.736
 Married    0.706 0.0196 907    0.666    0.743

Results are averaged over the levels of: fem 
Confidence level used: 0.95 
Intervals are back-transformed from the logit scale

Having subset = "cond_" or subset = "prefix:cond_"says to use the coefficients prefixed by "cond_"

And the same for avg.tmb. Here, the coeffiicient names are wrapped in cond() or zi()

> emmeans(avg.tmb, "mar", subset = "wrap:cond", tran = "log", type = "response", data = bioChemists)
 mar     rate     SE  df lower.CL upper.CL
 Single  2.04 0.1031 907     1.84     2.25
 Married 2.09 0.0756 907     1.95     2.24

Results are averaged over the levels of: fem 
Confidence level used: 0.95 
Intervals are back-transformed from the log scale 

> emmeans(avg.tmb, "mar", subset = "wrap:zi", tran = "logit", type = "response", data = bioChemists)
 mar      rate      SE  df lower.CL upper.CL
 Single  0.319 0.01867 907    0.283    0.357
 Married 0.292 0.00977 907    0.274    0.312

Results are averaged over the levels of: fem 
Confidence level used: 0.95 
Intervals are back-transformed from the logit scale

Interesting discrepancy with the zero part. One is more or less 1 minus the other. I need to investigate that.

We can also specify subset to be a named list of indexes of the coefficients to use. I'll eventually push this up.

rvlenth commented 1 year ago

Please note in all of this, we do not have available any of the options for hurdle objects, because we are working directly with coefficients and covariances of the coefficients in the averaging object, not paying attention to what class those models are.

Another approach would be to create a fake hurdle object with the coefficients and covariances from the averaging model. We have to take pains to align the coefficients correctly.

fakeavg.pscl = mod.pscl

bhat = coef(avg.pscl, full = TRUE)
V = vcov(avg.pscl, full = TRUE)

fakeavg.pscl$coefficients = list(
    count = bhat[c(1,2,7,3)], zero = bhat[c(4,5,8,6)])
fakeavg.pscl$vcov = V[c(1,2,7,3,4,5,8,6), c(1,2,7,3,4,5,8,6)]

With this we have available all the options for regular hurdle models:

> emmeans(fakeavg.pscl, "mar", mode = "prob0")
 mar     emmean     SE  df lower.CL upper.CL
 Single   0.318 0.0284 907    0.262    0.373
 Married  0.295 0.0197 907    0.257    0.334

Results are averaged over the levels of: fem 
Confidence level used: 0.95 
> emmeans(fakeavg.pscl, "mar", mode = "count")
 mar     emmean     SE  df lower.CL upper.CL
 Single    2.05 0.1059 907     1.84     2.25
 Married   2.11 0.0753 907     1.96     2.26

Results are averaged over the levels of: fem 
Confidence level used: 0.95 
> emmeans(fakeavg.pscl, "mar", mode = "response")
 mar     emmean     SE  df lower.CL upper.CL
 Single    1.61 0.1021 907     1.41     1.81
 Married   1.70 0.0682 907     1.57     1.84

Results are averaged over the levels of: fem 
Confidence level used: 0.95

It seems to me that this is more fruitful and hardly more tedious. The estimates are slightly off from what I showed in the previous answer because they are averaged after re-gridding. We can get identical results if we insist on averaging on the link scale then summarizing with type = "response"

> emmeans(fakeavg.pscl, "mar", mode = "count", lin.pred = TRUE, type = "response")
 mar     count     SE  df lower.CL upper.CL
 Single   2.03 0.1054 907     1.83     2.25
 Married  2.09 0.0768 907     1.95     2.25

Results are averaged over the levels of: fem 
Confidence level used: 0.95 
Intervals are back-transformed from the log scale

MarcRieraDominguez commented 1 year ago

Thank you very much Russell! Looking forward to these new functionalities in the next release! About the discrepancy between pscl and glmmTMB in the zero part: while pscl estimates the probability of a non-zero, glmmTMB estimates the probability of a 0 (https://github.com/glmmTMB/glmmTMB/issues/470).

rvlenth commented 1 year ago

I think this is resolved, so am closing

rvlenth / emmeans