AMEs for logistic model and binomial model don't match

alesvomacka commented 1 year ago

Hi, I'm not sure if this is the right place to ask - if not, I'd appreciate being pointed in the right direction.

I'm trying to match results of logistic regression based on "expanded" data with binomial regression based on aggregated data. The idea is that working with aggregate data is much more computationally efficient, especially with big samples or bayesian models, while the results should be identical, because the underlying data generating process is the same.

The problem: While both models give the same predictions, their AMEs don't match.

Models

I've estimated two models using the Palmer penguins dataset. The first model is classical binary logistic regression predicting sex using body mass. The second is a binomial model predicting number of "success", i.e. penguin occurrence based on sex and body mass:

library(tidyverse)
library(palmerpenguins)
library(marginaleffects)

m_logit <- glm(sex ~ body_mass_g, data = penguins, family = binomial())

m_binom <- penguins |> 
  mutate(body_mass_g = as.factor(body_mass_g)) |> 
  count(sex, body_mass_g, name = "success", .drop = FALSE) |> 
  mutate(body_mass_g = as.numeric(as.character(body_mass_g))) |> 
  filter(!is.na(sex)) |> 
  mutate(total = sum(success),
         fail = total - success,
         .by = body_mass_g) |> 
  glm(cbind(success, fail) ~ sex * body_mass_g, data = _, family = binomial())

These two models give the same predicted values. However, I've hit a roadblock when trying to compute average marginal effects.

AMEs don't match

The binary logistic model gives correct AME for body mass:

avg_slopes(m_logit, variables = "body_mass_g")
 #       Term Estimate Std. Error    z Pr(>|z|)    2.5 % 97.5 %
 #body_mass_g 0.000253   2.36e-05 10.7   <0.001 0.000207  3e-04

However, AME for the second model is incorrect, because the individual level ME for males and females cancel each other out:

avg_slopes(m_binom, variables = "body_mass_g")
 #       Term  Estimate Std. Error         z Pr(>|z|)     2.5 %   97.5 %
 #body_mass_g -9.69e-19   1.44e-05 -6.75e-14        1 -2.82e-05 2.82e-05

Group level AMEs match

I can get (almost exactly) matching results by computing AME for a specific sex:

avg_slopes(m_logit,
           newdata = datagrid(sex = "male",
                              body_mass_g = unique))
 #       Term Estimate Std. Error    z Pr(>|z|)    2.5 %   97.5 %
 #body_mass_g 0.000243   2.05e-05 11.8   <0.001 0.000202 0.000283

avg_slopes(m_binom,
           newdata = datagrid(sex = "male",
                              body_mass_g = unique),
           variables = "body_mass_g")
 #       Term Estimate Std. Error    z Pr(>|z|)    2.5 %   97.5 %
# body_mass_g 0.000242   2.03e-05 11.9   <0.001 0.000202 0.000282

Unfortunately, while the group level AMEs match across models, they don't match the "global" AME (AME = 0.000253 vs G-AME = 0.000242).

How to get correct "global" AME from the binomial model?

Is there a way to get correct "global" AME from the binomial model? I'm honestly not sure if the mismatch is due to me using this package incorrectly or if it's an unintuitive quirk of binomial models.

Thanks for help!

Technical info

R version 4.3.0 Platform: aarch64-apple-darwin20 (64-bit) Running under: macOS Ventura 13.4

Packages:

tidyverse v2.0.
dplyr v1.1.2
marginaleffects v0.11.1
palmerpenguins v0.1.1

vincentarelbundock commented 1 year ago

You need to specify weights (counts/N) with the wts argument.

alesvomacka commented 1 year ago

@vincentarelbundock thank you! Using weights worked. For future reference, here is the code:

n_obs <- nrow(penguins[!is.na(penguins$sex), ]) #total number of observation in the model

penguins_agg <- penguins |> 
  filter(!is.na(sex)) |> 
  mutate(body_mass_g = as.factor(body_mass_g)) |> 
  count(sex, body_mass_g, name = "success", .drop = FALSE) |> 
  mutate(body_mass_g = as.numeric(as.character(body_mass_g))) |> 
  mutate(total = sum(success),
         fail = total - success,
         .by = body_mass_g) 

m_binom <- glm(cbind(success, fail) ~ sex * body_mass_g, data = penguins_agg, family = binomial())

w <- penguins_agg |> 
  mutate(weight = sum(success) / n_obs,
         .by = body_mass_g) |> 
  pull()

avg_slopes(m_binom, variables = "body_mass_g", wts = w,
           by = "sex")
 #      Term    Contrast    sex  Estimate Std. Error     z Pr(>|z|)     2.5 %    97.5 %
 #body_mass_g mean(dY/dX) female -0.000253   2.37e-05 -10.7   <0.001 -0.000300 -0.000207
 #body_mass_g mean(dY/dX) male    0.000253   2.36e-05  10.7   <0.001  0.000207  0.000300

vincentarelbundock / marginaleffects