Likely disagreement between plot_predictions() and _avg_comparisons() with two categorical variables in an interaction term

tonisoto commented 1 year ago

I found this summary of _marginaleffects:avgcomparisons() of @PhDemetri very clear, useful and, in line with the by argument example at the bottom of the reference page:

However, if I translate the example of @PhDemetr substituting numerical by categorical predictors results are weird.

If I fit a brms beta-zoi mixed-model including an interaction term: *GENDER (F/M) STEM_CHOICE (4 levels)** and plot predictions (as @PhDemetri did it) with:

plot_predictions(fit, by=c("GENDER","STEM_CHOICE"))

.. I get the figure I've just shared in this tweet (I could not upload it here) with a large (and significant) difference between SCIENCE and NoSTEM both in boys and girls:

  GENDER STEM_CHOICE  estimate  conf.low conf.high
1      F   UNDECIDED 0.7953682 0.7250717 0.8550434
2      F      NoSTEM 0.7807114 0.7461759 0.8138231
3      F     SCIENCE 0.8725138 0.8553751 0.8887944
4      F        TECH 0.8364045 0.8183127 0.8530912
5      M   UNDECIDED 0.7616019 0.7163880 0.8036248
6      M      NoSTEM 0.7769111 0.7468556 0.8065620
7      M     SCIENCE 0.8546280 0.8340399 0.8739704
8      M        TECH 0.8387133 0.8219018 0.8540801

Following @PhDemetri explanation, the next code should give exactly the difference between SCIENCE and NoSTEM for females:

avg_comparisons(fit,newdata = datagrid(GENDER = c("F")),variables=list(STEM_CHOICE = c("NoSTEM","SCIENCE")))

.. which should be a difference of 0.0918028 obtained by resting 0.8725138 (darkgreen dashed-line in plot) - 0.780711 (blue dashed-line in plot). However, avg_comparisons() gives 0.04, 95CI[-0.02,0.10] which it doesn't correspond to plot_predictions().

       Term         Contrast Estimate   2.5 % 97.5 %
 STEM_CHOICE SCIENCE - NoSTEM   0.0403 -0.0176  0.101

In sum, unless I had missed something important (e.g., I'm fitting a Bayesian mixed-model, a particular brms-family or that I'm working with an interaction term with two categorical variables, etc.), I would expect a larger (0.09) estimate (see the distance between dashed-lines in the figure) and a confidence interval not including zero for that comparison.

I repeated the same test but removing other predictors to fit the same model with a simpler formula: OUTCOME ~ STEM_CHOICE * GENDER + (1|FACTOR1) +(1 | IDnumber) .. and it yielded the same disagreement between _plotpredictions() and _avgcomparisons().

I hope not to be missing something important to be concise in describing the issue. Unfortunately, the mtcars dataset doesn't fit well to reproduce my model but, if needed I can send my own dataset to reproduce this issue. I updated yesterday to marginaleffects 0.11.1 and my sessionInfo() output is below:

Thanks in advance :-)

Toni Soto (@ToniSoto_Vigo)

R version 4.2.2 (2022-10-31 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19044)

Matrix products: default

Random number generation: RNG: Mersenne-Twister Normal: Inversion Sample: Rounding

locale: [1] LC_COLLATE=Spanish_Spain.utf8 LC_CTYPE=Spanish_Spain.utf8 LC_MONETARY=Spanish_Spain.utf8 [4] LC_NUMERIC=C LC_TIME=Spanish_Spain.utf8

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] bayestestR_0.13.0 modelsummary_1.3.0 marginaleffects_0.11.1 broom.mixed_0.2.9.4
[5] broom_1.0.4 tidybayes_3.0.4 brms_2.19.0 Rcpp_1.0.10
[9] lubridate_1.9.1 forcats_1.0.0 stringr_1.5.0 dplyr_1.1.0
[13] purrr_1.0.1 readr_2.1.4 tidyr_1.3.0 tibble_3.1.8
[17] ggplot2_3.4.1 tidyverse_2.0.0 skimr_2.1.5 lmerTest_3.1-3
[21] lme4_1.1-31 BayesFactor_0.9.12-4.4 Matrix_1.5-1 coda_0.19-4

loaded via [1] Hmisc_4.7-2 [6] foreach_1.5.2 [11] posterior_1.4.0 [16] performance_0.10.2 [21] collapse_1.9.2 [26] sjPlot_2.8.13 [31] tidyselect_1.2.0 [36] ggpubr_0.6.0 [41] magrittr_2.0.3 [46] miniUI_0.1.1.1 [51] EnvStats_2.7.0 [56] xfun_0.36 [61] caTools_1.18.2 [66] threejs_0.3.3 [71] withr_2.5.0 [76] e1071_1.7-13 [81] multcomp_1.4-20 [86] ellipsis_0.3.2 [91] tools_4.2.2 [96] munsell_0.5.0 [101] compiler_4.2.2 [106] gridExtra_2.3 [111] deldir_1.0-6 [116] GGally_2.1.2 [121] carData_3.0-5 [126] car_3.1-1 [131] sandwich_3.0-2 [136] igraph_1.3.5 [141] systemfonts_1.0.4 [146] rstantools_2.3.1 [151] viridisLite_0.4.1 [156] Rttf2pt1_1.3.12 [161] shinythemes_1.2.0 [166] report_0.5.5 [171] base64enc_0.1-3 [176] groupdata2_2.0.2 [181] tables_0.9.10 [186] lifecycle_1.0.3 [191] ggsignif_0.6.4 [196] RColorBrewer_1.1-3 [201] timechange_0.2.0 [206] globals_0.16.2 [211] codetools_0.2-18 [216] psych_2.2.9 [221] ggpmisc_0.5.2 [226] reshape2_1.4.4 [231] xml2_1.3.3 [236] interp_1.1-3 [241] janitor_2.2.0 [246] knitr_1.42 a namespace (and not attached): svglite_2.1.1 mclogit_0.9.6 class_7.3-20 ps_1.7.2
crayon_1.5.2 MASS_7.3-58.1 nlme_3.1-160 backports_1.4.1
bench_1.1.2 colourpicker_1.2.0 rlang_1.0.6 ROCR_1.0-11
SparseM_1.81 nloptr_2.0.3 extrafontdb_1.0 callr_3.7.3
extrafont_0.19 glue_1.6.2 loo_2.5.1 bayesrules_0.0.2
pbkrtest_0.5.2 rstan_2.21.8 parallel_4.2.2 processx_3.8.0
usethis_2.1.6 interactions_1.1.5 zoo_1.8-11 sysfonts_0.8.8
sjmisc_2.8.9 distributional_0.3.1 xtable_1.8-4 MatrixModels_0.5-1
evaluate_0.20 gdtools_0.3.0 cli_3.6.0 rstudioapi_0.14
furrr_0.3.1 rpart_4.1.19 jtools_2.2.1 sjlabelled_1.2.0
effsize_0.8.1 svUnit_1.0.6 shinystan_2.6.0 shiny_1.7.4
parameters_0.20.2 inline_0.3.19 pkgbuild_1.4.0 cluster_2.1.4
bridgesampling_1.1-2 gfonts_0.2.0 Brobdingnag_1.2-9 quantreg_5.94
listenv_0.9.0 png_0.1-8 reshape_0.8.9 future_1.30.0
showtextdb_3.0 bitops_1.0-7 ranger_0.14.1 plyr_1.8.8
pillar_1.8.1 RcppParallel_5.1.7 gplots_3.1.3 cachem_1.0.6
fs_1.6.0 partykit_1.2-16 xts_0.12.2 vctrs_0.5.2
generics_0.1.3 dygraphs_1.1.1.6 devtools_2.4.5 outliers_0.15
lsr_0.5.2 foreign_0.8-83 entropy_1.3.1 funModeling_1.9.4
emmeans_1.8.4-1 proxy_0.4-27 pkgload_1.3.2 fastmap_1.1.0
abind_1.4-5 httpuv_1.6.8 DataCombine_0.2.21 sessioninfo_1.2.2
ppcor_1.1 ggpp_0.5.0 arrayhelpers_1.1-0 lattice_0.20-45
utf8_1.2.3 later_1.3.0 jsonlite_1.8.4 arm_1.13-1
scales_1.2.1 lsmeans_2.30-0 hrbrthemes_0.8.0 pbapply_1.7-0
estimability_1.4.1 lazyeval_0.2.2 promises_1.2.0.1 doParallel_1.0.17
latticeExtra_0.6-30 effectsize_0.8.3 checkmate_2.1.0 rmarkdown_2.20
blme_1.0-5 moments_0.14.1 webshot_0.5.4 pander_0.6.5
survival_3.4-0 numDeriv_2016.8-1.1 yaml_2.3.6 plotrix_3.8-2
extraDistr_1.9.1 bmlm_1.3.12 bayesplot_1.10.0 htmltools_0.5.4
memoise_2.0.1 dlookr_0.6.1 fastGHQuad_1.0.1 profvis_0.3.7
digest_0.6.31 mime_0.12 repr_1.1.5 DHARMa_0.4.6
remotes_2.4.2 data.table_1.14.6 urlchecker_1.0.1 labeling_0.4.2
splines_4.2.2 Formula_1.2-4 reactable_0.4.3 showtext_0.9-5
memisc_0.99.31.3 hms_1.1.2 modelr_0.1.10 colorspace_2.0-3
mnormt_2.1.1 libcoin_1.0-9 inum_1.0-4 nnet_7.3-18
binom_1.1-1.1 mvtnorm_1.1-3 fansi_1.0.3 tzdb_0.3.0
parallelly_1.34.0 R6_2.5.1 grid_4.2.2 crul_1.3
StanHeaders_2.21.0-7 datawizard_0.7.0 EMAtools_0.1.4 curl_5.0.0
minqa_1.2.5 snakecase_0.11.0 robustbase_0.95-0 TH.data_1.1-1
iterators_1.0.14 htmlwidgets_1.6.1 markdown_1.5 crosstalk_1.2.0
robustlmm_3.1-2 rvest_1.0.3 mgcv_1.8-41 rstanarm_2.21.3
pagedown_0.20 insight_0.19.1 htmlTable_2.4.1 tensorA_0.36.2
matrixStats_0.63.0 phia_0.2-1 gtools_3.9.4 prettyunits_1.1.1
ggdist_3.2.1 correlation_0.8.3 gtable_0.3.1 DBI_1.1.3
stats4_4.2.2 KernSmooth_2.23-20 httr_1.4.4 stringi_1.7.12
farver_2.1.1 simr_1.0.6 see_0.7.4 DT_0.27
boot_1.3-28 shinyjs_2.1.0 kableExtra_1.3.4 ggeffects_1.2.0
DEoptimR_1.0-11 equatiomatic_0.3.1 sjstats_0.18.2 jpeg_0.1-10
pkgconfig_2.0.3 rstatix_0.7.2 RLRsim_3.1-8 cmdstanr_0.5.3
merTools_0.6.1 httpcode_0.3.0

vincentarelbundock commented 1 year ago

Can you supply an ULTRA-MINIMALIST reproducible example using public data? Maybe one of the datasets on the Rdatasets repository: https://vincentarelbundock.github.io/Rdatasets/articles/data.html

vincentarelbundock commented 1 year ago

Also can you show me the output of avg_comparisons() by gender?

tonisoto commented 1 year ago

Hi again Vincent,

First, the output of my avg_comparisons() by gender (after removing other unrelated predictors):

avg_comparisons(fit, by="GENDER")

                Term                        Contrast GENDER Estimate    2.5 %  97.5 %
 STEM_CHOICE         mean(NoSTEM) - mean(UNDECIDED)       F -0.01376 -0.08422 0.06994
 STEM_CHOICE         mean(NoSTEM) - mean(UNDECIDED)       M -0.01079 -0.06863 0.05034
 STEM_CHOICE         mean(SCIENCE) - mean(UNDECIDED)      F  0.01799 -0.04528 0.09889
 STEM_CHOICE         mean(SCIENCE) - mean(UNDECIDED)      M  0.02284 -0.02860 0.08018
 STEM_CHOICE         mean(TECH) - mean(UNDECIDED)         F  0.01604 -0.04451 0.09353
 STEM_CHOICE         mean(TECH) - mean(UNDECIDED)         M  0.02526 -0.02418 0.08017
 GENDER              mean(M) - mean(F)                    F  0.00795 -0.00884 0.02415
 GENDER              mean(M) - mean(F)                    M  0.00969 -0.00945 0.02872

Second, the reproducible example using public data: I got to reproduce the issue with data("Fatalities") :

Fatalities$state <- as.factor(Fatalities$state)
Fatalities$year <- as.factor(Fatalities$year)
Fatalities$breath <- as.factor(Fatalities$breath)

fit <- brm(income ~ year * breath + (1|state), data = Fatalities,  control = list(adapt_delta = 0.95),
           cores=4, backend = "cmdstanr")

plot_predictions(fit, by=c("year","breath"), draw =TRUE)
plot_predictions(fit, by=c("year","breath"), draw =FALSE)

   year breath estimate conf.low conf.high
1  1982     no 12973.66 12785.77  13162.95
2  1982    yes 13034.28 12786.65  13266.36
3  1983     no 12975.01 12781.18  13170.85
4  1983    yes 13260.92 13062.47  13462.56
5  1984     no 13465.54 13268.05  13659.19
6  1984    yes 13720.12 13508.11  13935.58
7  1985     no 13709.81 13510.15  13908.61
8  1985    yes 14003.13 13781.59  14216.47
9  1986     no 13952.47 13747.88  14151.80
10 1986    yes 14441.48 14227.07  14641.88
11 1987     no 14179.86 13976.33  14379.25
12 1987    yes 14920.59 14721.66  15126.62
13 1988     no 14561.94 14357.52  14769.16
14 1988    yes 15226.32 15026.13  15431.05

avg_comparisons(fit, newdata = datagrid( year = c("1988")), variables=list(breath = c("no","yes"))) 

   Term Contrast Estimate 2.5 % 97.5 %
 breath yes - no      220  -185    634

If I'm not wrong the estimate should be 664.38 (15226.32-14561.94= 664.38) and statistically significant according to plot_predictions.

Thank you so much for your time Vincent!

vincentarelbundock commented 1 year ago

I don't see a bug in your reproducible example. The commands your are comparing are just -- intentionally and in a way that is well-documented -- estimating different quantities.

Also note that some minor differences can arise in bayesian models between posterior means and medians. See the bayesian vignette for details.

Here are some equivalence to get you started as you explore what is possible with the package:

library(AER)
library(brms)
library(marginaleffects)
data(Fatalities)
Fatalities$state <- as.factor(Fatalities$state)
Fatalities$year <- as.factor(Fatalities$year)
Fatalities$breath <- as.factor(Fatalities$breath)

fit <- brm(income ~ year * breath + (1 | state),
    data = Fatalities, control = list(adapt_delta = 0.95), cores = 4)

# equivalence 1
nd = datagrid(model = fit, year = "1988")
avg_comparisons(fit,
    newdata = nd,
    variables = list(breath = c("no", "yes")))

avg_predictions(fit, variables = "breath", newdata = nd)

avg_predictions(fit, variables = "breath", newdata = nd)$estimate |>
    diff()

# equivalence 2
plot_predictions(
    fit,
    by = c("year", "breath"),
    draw = FALSE)

predictions(
    fit,
    by = c("year", "breath"))

# equivalence 3
avg_comparisons(fit, variables = "breath", by = "year")

avg_predictions(fit, variables = "breath", by = c("breath", "year")) |> 
    dplyr::group_by(year) |>
    dplyr::summarize(estimate = diff(estimate))

tonisoto commented 1 year ago

Ok, I will review again all my notes and the Bayesian vignette one more time.

However, to close this issue, how can I read the plot? Can I safely argue that since 1986 the difference in incomes between "breath= yes" and "breath=no" is larger and significant? And, if so, which difference in incomes in 1998 between both conditions should I report? 664.30 as it is shown in the plot or, 220, 95CI[-185, 634] as reported by avg_comparisons?

vincentarelbundock commented 1 year ago

No, I think that if you want to make a claim about the « difference in income », you should user the *_comparisons() family of functions. Then, the next question is: At what covariance values are we estimating the contrast?

vincentarelbundock / marginaleffects

Likely disagreement between plot_predictions() and _avg_comparisons() with two categorical variables in an interaction term #738