vincentarelbundock / marginaleffects

R package to compute and plot predictions, slopes, marginal means, and comparisons (contrasts, risk ratios, odds, etc.) for over 100 classes of statistical and ML models. Conduct linear and non-linear hypothesis tests, or equivalence tests. Calculate uncertainty estimates using the delta method, bootstrapping, or simulation-based inference
https://marginaleffects.com
Other
435 stars 45 forks source link

AMEs for logistic model and binomial model don't match #813

Closed alesvomacka closed 1 year ago

alesvomacka commented 1 year ago

Hi, I'm not sure if this is the right place to ask - if not, I'd appreciate being pointed in the right direction.

I'm trying to match results of logistic regression based on "expanded" data with binomial regression based on aggregated data. The idea is that working with aggregate data is much more computationally efficient, especially with big samples or bayesian models, while the results should be identical, because the underlying data generating process is the same.

The problem: While both models give the same predictions, their AMEs don't match.

Models

I've estimated two models using the Palmer penguins dataset. The first model is classical binary logistic regression predicting sex using body mass. The second is a binomial model predicting number of "success", i.e. penguin occurrence based on sex and body mass:

library(tidyverse)
library(palmerpenguins)
library(marginaleffects)

m_logit <- glm(sex ~ body_mass_g, data = penguins, family = binomial())

m_binom <- penguins |> 
  mutate(body_mass_g = as.factor(body_mass_g)) |> 
  count(sex, body_mass_g, name = "success", .drop = FALSE) |> 
  mutate(body_mass_g = as.numeric(as.character(body_mass_g))) |> 
  filter(!is.na(sex)) |> 
  mutate(total = sum(success),
         fail = total - success,
         .by = body_mass_g) |> 
  glm(cbind(success, fail) ~ sex * body_mass_g, data = _, family = binomial())

These two models give the same predicted values. However, I've hit a roadblock when trying to compute average marginal effects.

AMEs don't match

The binary logistic model gives correct AME for body mass:

avg_slopes(m_logit, variables = "body_mass_g")
 #       Term Estimate Std. Error    z Pr(>|z|)    2.5 % 97.5 %
 #body_mass_g 0.000253   2.36e-05 10.7   <0.001 0.000207  3e-04

However, AME for the second model is incorrect, because the individual level ME for males and females cancel each other out:

avg_slopes(m_binom, variables = "body_mass_g")
 #       Term  Estimate Std. Error         z Pr(>|z|)     2.5 %   97.5 %
 #body_mass_g -9.69e-19   1.44e-05 -6.75e-14        1 -2.82e-05 2.82e-05

Group level AMEs match

I can get (almost exactly) matching results by computing AME for a specific sex:

avg_slopes(m_logit,
           newdata = datagrid(sex = "male",
                              body_mass_g = unique))
 #       Term Estimate Std. Error    z Pr(>|z|)    2.5 %   97.5 %
 #body_mass_g 0.000243   2.05e-05 11.8   <0.001 0.000202 0.000283

avg_slopes(m_binom,
           newdata = datagrid(sex = "male",
                              body_mass_g = unique),
           variables = "body_mass_g")
 #       Term Estimate Std. Error    z Pr(>|z|)    2.5 %   97.5 %
# body_mass_g 0.000242   2.03e-05 11.9   <0.001 0.000202 0.000282

Unfortunately, while the group level AMEs match across models, they don't match the "global" AME (AME = 0.000253 vs G-AME = 0.000242).

How to get correct "global" AME from the binomial model?

Is there a way to get correct "global" AME from the binomial model? I'm honestly not sure if the mismatch is due to me using this package incorrectly or if it's an unintuitive quirk of binomial models.

Thanks for help!

Technical info

R version 4.3.0 Platform: aarch64-apple-darwin20 (64-bit) Running under: macOS Ventura 13.4

Packages:

vincentarelbundock commented 1 year ago

You need to specify weights (counts/N) with the wts argument.

alesvomacka commented 1 year ago

@vincentarelbundock thank you! Using weights worked. For future reference, here is the code:

n_obs <- nrow(penguins[!is.na(penguins$sex), ]) #total number of observation in the model

penguins_agg <- penguins |> 
  filter(!is.na(sex)) |> 
  mutate(body_mass_g = as.factor(body_mass_g)) |> 
  count(sex, body_mass_g, name = "success", .drop = FALSE) |> 
  mutate(body_mass_g = as.numeric(as.character(body_mass_g))) |> 
  mutate(total = sum(success),
         fail = total - success,
         .by = body_mass_g) 

m_binom <- glm(cbind(success, fail) ~ sex * body_mass_g, data = penguins_agg, family = binomial())

w <- penguins_agg |> 
  mutate(weight = sum(success) / n_obs,
         .by = body_mass_g) |> 
  pull()

avg_slopes(m_binom, variables = "body_mass_g", wts = w,
           by = "sex")
 #      Term    Contrast    sex  Estimate Std. Error     z Pr(>|z|)     2.5 %    97.5 %
 #body_mass_g mean(dY/dX) female -0.000253   2.37e-05 -10.7   <0.001 -0.000300 -0.000207
 #body_mass_g mean(dY/dX) male    0.000253   2.36e-05  10.7   <0.001  0.000207  0.000300