rvlenth / emmeans

Estimated marginal means
https://rvlenth.github.io/emmeans/
364 stars 32 forks source link

Hurdle models in pscl package: how to get truncated counts and binomial probabilities? #440

Closed MarcRieraDominguez closed 1 year ago

MarcRieraDominguez commented 1 year ago

Hi! Congratulations on the great package!

My goal is to interpret the two components of a hurdle model: the binomial hurdle (probability of observation > 0 ) and the truncated count, at the untrasformed scale (probabilities rather than logits, etc).

Given the complexity of predicting from such models, I compared results from pscl::hurdle() to glmmTMB() to make sure I was getting the desired result. Unfortunately, the model coefficient's differed slightly between packages (5th decimal position), which makes direct comparison a bit more tricky. However, the estimated means were quite different, more than coefficient difference would allow (I think). (see below for a reproducible example).

To get truncated counts, I thought the appropiate code would be emmeans(pscl::hurdle(), mode = "count"), based om the emmeans documentation, while glmmTMB documentation indicated that an appropiate code would be: emmeans(glmmTMB::glmmTMB(), mode = "cmean").

However, emmeans(pscl::hurdle(), mode = "count") was not similar to emmeans(glmmTMB::glmmTMB(), mode = "cmean"). Instead, it was similar to: emmeans(glmmTMB::glmmTMB(), component = "cond", type = "response"), which the glmmTMB documentation indicates that yields untruncated counts (page 7). Using emmeans(glmmTMB::glmmTMB(), component = "cond", regrid = "response") gave the same results as emmeans(pscl::hurdle(), mode = "count").

Regarding the obtention of binomial probabilities, I did not find guidance in package documentation. Apologies if I have misread the documentation! (this question is in fact an expansion on a previous question I asked over at Stack Overflow)

After my exploration, my question is: Is the following code appropiate to obtain estimated means at the scale of the response, when fitting hurdle models thourgh the pscl package?

As a side note, while the binomial and truncated count models could be fitted separately, I prefer to fit a hurdle model since it provides a pseudo-r2 for the two processes combined (since it produces residuals etc), and it is more compact code-wise than fitting two models.

See below a reproducible example of the behaviour of the different functions, let me know if you need further information. Many thanks for all the work you put in the package!

library(emmeans)
#> Warning: package 'emmeans' was built under R version 4.2.3
library(glmmTMB)
#> Warning: package 'glmmTMB' was built under R version 4.2.3
#> Warning in checkMatrixPackageVersion(): Package version inconsistency detected.
#> TMB was built with Matrix version 1.6.1
#> Current Matrix version is 1.4.1
#> Please re-install 'TMB' from source using install.packages('TMB', type = 'source') or ask CRAN for a binary version of 'TMB' matching CRAN's 'Matrix' package
library(pscl)
#> Warning: package 'pscl' was built under R version 4.2.3
#> Classes and Methods for R developed in the
#> Political Science Computational Laboratory
#> Department of Political Science
#> Stanford University
#> Simon Jackman
#> hurdle and zeroinfl functions by Achim Zeileis

# Create datasets: number of papers during a Phd
data("bioChemists", package = "pscl")

# Declare models
pscl.hurdle <- pscl::hurdle(art ~ fem + mar, data = bioChemists, dist = "poisson", zero.dist = "binomial", link = "logit")
glmmtmb.hurdle <- glmmTMB::glmmTMB(art ~ fem + mar, data = bioChemists, family = truncated_poisson(link = "log"), zi = ~ fem + mar)

# Coefficients are similar but not identical (besides the reverse sign for the binomial part in glmmTMB)
coef(summary(pscl.hurdle))
#> $count
#>                 Estimate Std. Error    z value     Pr(>|z|)
#> (Intercept)  0.847597547 0.06408409 13.2263338 6.183060e-40
#> femWomen    -0.237351249 0.06419884 -3.6971267 2.180535e-04
#> marMarried   0.008846489 0.06694361  0.1321484 8.948669e-01
#> 
#> $zero
#>                Estimate Std. Error   z value     Pr(>|z|)
#> (Intercept)  0.90794254  0.1563982  5.805326 6.424077e-09
#> femWomen    -0.24195114  0.1492268 -1.621365 1.049394e-01
#> marMarried   0.07802262  0.1563561  0.499006 6.177752e-01
coef(summary(glmmtmb.hurdle))
#> $cond
#>                Estimate Std. Error    z value     Pr(>|z|)
#> (Intercept)  0.84758505 0.06408433 13.2260887 6.203247e-40
#> femWomen    -0.23733876 0.06419862 -3.6969449 2.182097e-04
#> marMarried   0.00886199 0.06694368  0.1323798 8.946839e-01
#> 
#> $zi
#>               Estimate Std. Error    z value     Pr(>|z|)
#> (Intercept) -0.9079429  0.1563982 -5.8053282 6.424006e-09
#> femWomen     0.2419518  0.1492268  1.6213692 1.049385e-01
#> marMarried  -0.0780230  0.1563561 -0.4990084 6.177734e-01
#> 
#> $disp
#> NULL

# However, the predicted and estimated marginal means are quite different

#### Means for truncated count: hurdle model fitted with pscl hurdle

summary(predict(pscl.hurdle, type = "count"))
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   1.841   1.857   2.334   2.120   2.355   2.355
emmeans(pscl.hurdle, pairwise ~ mar, mode = "count")
#> $emmeans
#>  mar     emmean     SE  df lower.CL upper.CL
#>  Single    2.09 0.1136 909     1.86     2.31
#>  Married   2.11 0.0789 909     1.95     2.26
#> 
#> Results are averaged over the levels of: fem 
#> Confidence level used: 0.95 
#> 
#> $contrasts
#>  contrast         estimate   SE  df t.ratio p.value
#>  Single - Married  -0.0185 0.14 909  -0.132  0.8947
#> 
#> Results are averaged over the levels of: fem

summary(predict(glmmtmb.hurdle, type = "conditional"))
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   2.188   2.201   2.584   2.412   2.602   2.602
emmeans::emmeans(glmmtmb.hurdle, specs = pairwise ~ mar, component = "cmean")
#> $emmeans
#>  mar     emmean     SE  df asymp.LCL asymp.UCL
#>  Single    2.39 0.0918 Inf      2.21      2.57
#>  Married   2.40 0.0633 Inf      2.28      2.53
#> 
#> Results are averaged over the levels of: fem 
#> Confidence level used: 0.95 
#> 
#> $contrasts
#>  contrast         estimate    SE  df z.ratio p.value
#>  Single - Married   -0.015 0.113 Inf  -0.133  0.8945
#> 
#> Results are averaged over the levels of: fem

# Howver, the pscl estimates can also be obtained from a hurdle fitted in glmmTMB,
# the glmmTMB documentation indicates that this is a truncated count
emmeans::emmeans(glmmtmb.hurdle, specs = pairwise ~ mar, component = "cond", type = "response")
#> $emmeans
#>  mar     rate     SE  df asymp.LCL asymp.UCL
#>  Single  2.07 0.1122 Inf      1.86      2.30
#>  Married 2.09 0.0807 Inf      1.94      2.26
#> 
#> Results are averaged over the levels of: fem 
#> Confidence level used: 0.95 
#> Intervals are back-transformed from the log scale 
#> 
#> $contrasts
#>  contrast         ratio     SE  df null z.ratio p.value
#>  Single / Married 0.991 0.0664 Inf    1  -0.132  0.8947
#> 
#> Results are averaged over the levels of: fem 
#> Tests are performed on the log scale

# This is even more similar than the pscl results
emmeans::emmeans(glmmtmb.hurdle, specs = pairwise ~ mar, component = "cond", regrid = "response")
#> $emmeans
#>  mar     rate     SE  df asymp.LCL asymp.UCL
#>  Single  2.09 0.1136 Inf      1.86      2.31
#>  Married 2.11 0.0789 Inf      1.95      2.26
#> 
#> Results are averaged over the levels of: fem 
#> Confidence level used: 0.95 
#> 
#> $contrasts
#>  contrast         estimate   SE  df z.ratio p.value
#>  Single - Married  -0.0186 0.14 Inf  -0.133  0.8945
#> 
#> Results are averaged over the levels of: fem

#### Means for binomial

emmeans::emmeans(pscl.hurdle, specs = pairwise ~ mar, mode = "prob0", regrid = "response")
#> $emmeans
#>  mar     emmean     SE  df lower.CL upper.CL
#>  Single   0.313 0.0265 909    0.261    0.365
#>  Married  0.297 0.0191 909    0.259    0.334
#> 
#> Results are averaged over the levels of: fem 
#> Confidence level used: 0.95 
#> 
#> $contrasts
#>  contrast         estimate     SE  df t.ratio p.value
#>  Single - Married   0.0165 0.0332 909   0.497  0.6195
#> 
#> Results are averaged over the levels of: fem
emmeans::emmeans(glmmtmb.hurdle, specs = pairwise ~ mar, component = "zi", regrid = "response")
#> $emmeans
#>  mar     response     SE  df asymp.LCL asymp.UCL
#>  Single     0.313 0.0265 Inf     0.261     0.365
#>  Married    0.297 0.0191 Inf     0.260     0.334
#> 
#> Results are averaged over the levels of: fem 
#> Confidence level used: 0.95 
#> 
#> $contrasts
#>  contrast         estimate     SE  df z.ratio p.value
#>  Single - Married   0.0165 0.0332 Inf   0.497  0.6194
#> 
#> Results are averaged over the levels of: fem

sessionInfo()
#> R version 4.2.0 (2022-04-22 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 22000)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=Spanish_Spain.utf8  LC_CTYPE=Spanish_Spain.utf8   
#> [3] LC_MONETARY=Spanish_Spain.utf8 LC_NUMERIC=C                  
#> [5] LC_TIME=Spanish_Spain.utf8    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] pscl_1.5.5.1  glmmTMB_1.1.7 emmeans_1.8.8
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.10         nloptr_2.0.1        compiler_4.2.0     
#>  [4] highr_0.9           TMB_1.9.6           tools_4.2.0        
#>  [7] boot_1.3-28         lme4_1.1-29         digest_0.6.29      
#> [10] nlme_3.1-157        evaluate_0.15       lifecycle_1.0.3    
#> [13] lattice_0.20-45     rlang_1.1.1         reprex_2.0.2       
#> [16] Matrix_1.4-1        cli_3.6.1           rstudioapi_0.13    
#> [19] yaml_2.3.5          mvtnorm_1.1-3       xfun_0.40          
#> [22] fastmap_1.1.0       coda_0.19-4         withr_2.5.0        
#> [25] stringr_1.5.0       knitr_1.39          fs_1.5.2           
#> [28] vctrs_0.6.3         grid_4.2.0          glue_1.6.2         
#> [31] survival_3.3-1      rmarkdown_2.14      multcomp_1.4-19    
#> [34] minqa_1.2.4         TH.data_1.1-1       magrittr_2.0.3     
#> [37] codetools_0.2-18    htmltools_0.5.6     splines_4.2.0      
#> [40] MASS_7.3-57         xtable_1.8-4        numDeriv_2016.8-1.1
#> [43] sandwich_3.0-1      estimability_1.4.1  stringi_1.7.6      
#> [46] zoo_1.8-10

Created on 2023-09-13 with reprex v2.0.2

rvlenth commented 1 year ago

It's easy indeed to get confused by all of this. But let's look at where we can get comparable results: One is by looking only at the count component:

> emmeans(pscl.hurdle, ~ mar, mode = "count", lin.pred = TRUE, type = "response")
 mar     count     SE  df lower.CL upper.CL
 Single   2.07 0.1122 909     1.86     2.31
 Married  2.09 0.0807 909     1.94     2.26

Results are averaged over the levels of: fem 
Confidence level used: 0.95 
Intervals are back-transformed from the log scale 

> emmeans(glmmtmb.hurdle, ~ mar, comp = "cond", type = "response")
 mar     rate     SE  df asymp.LCL asymp.UCL
 Single  2.07 0.1122 Inf      1.86      2.30
 Married 2.09 0.0807 Inf      1.94      2.26

Results are averaged over the levels of: fem 
Confidence level used: 0.95 
Intervals are back-transformed from the log scale 

... or at just the zero component:

> emmeans(pscl.hurdle, ~ mar, mode = "prob0")
 mar     emmean     SE  df lower.CL upper.CL
 Single   0.313 0.0265 909    0.261    0.365
 Married  0.297 0.0191 909    0.259    0.334

Results are averaged over the levels of: fem 
Confidence level used: 0.95 

> emmeans(glmmtmb.hurdle, ~ mar, comp = "zi", type = "response")
 mar     response     SE  df asymp.LCL asymp.UCL
 Single     0.313 0.0267 Inf     0.263     0.367
 Married    0.296 0.0190 Inf     0.261     0.335

Results are averaged over the levels of: fem 
Confidence level used: 0.95 
Intervals are back-transformed from the logit scale 

... or by combining the two components into the overall response mean:

> emmeans(pscl.hurdle, ~ mar, mode = "response")
 mar     emmean     SE  df lower.CL upper.CL
 Single    1.64 0.0900 909     1.47     1.82
 Married   1.69 0.0633 909     1.57     1.82

Results are averaged over the levels of: fem 
Confidence level used: 0.95 

> emmeans(glmmtmb.hurdle, ~ mar, comp = "response")
 mar     emmean     SE  df asymp.LCL asymp.UCL
 Single    1.64 0.0900 Inf      1.47      1.82
 Married   1.69 0.0633 Inf      1.57      1.82

Results are averaged over the levels of: fem 
Confidence level used: 0.95 

So I hope this will give you more faith that there is some measure of consistency.

The main thing to know is that results comparable to comp = "cmean" in the glmmTMB support is simply not available for models fitted via pscl::hurdle(). You can compute the estimates (but not their SEs) manually via the count mean divided by (1 - P(0)) based on that same count distribution (not the zi part):

> untrunc = predict(emmeans(glmmtmb.hurdle, ~ mar, comp = "cond", type = "response"))
> untrunc
[1] 2.072832 2.091283
> p0 = exp(-untrunc)
> untrunc / (1 - p0)
[1] 2.371197 2.386025

Finally, your result when you re-gridded is slightly different than without because with the regridding, the results are converted to the response scale before averaging over fem.

Actually, the above manual calculations differ slightly from what you got with comp = "cmean" for exactly the opposite reason. In the glmmTMB support, the cmean results were computed for all combinations of mar and fem, then averaged over fem:

> untrunc = predict(emmeans(glmmtmb.hurdle, ~ mar*fem, comp = "cond", type = "response"))
> (trunc = matrix(untrunc / (1 - exp(-untrunc)), nrow = 2))
         [,1]     [,2]
[1,] 2.584455 2.188083
[2,] 2.601720 2.200814
> apply(.Last.value, 1, mean)
[1] 2.386269 2.401267
MarcRieraDominguez commented 1 year ago

Thank you for the response, my faith in consistency has been restored :)