vincentarelbundock / pymarginaleffects

GNU General Public License v3.0
47 stars 8 forks source link

avg_comparisons and avg_predictions not returning expected answers #106

Closed Dpananos closed 1 week ago

Dpananos commented 1 week ago

I've simulated some data for a talk I'm giving. The data is from an experiment with a binary exposure and a binary outcome. My goal is to estimate the "lift" (risk difference between treatment and control, divided by risk in control) using marginaleffects.

However, I'm not getting the answer I expect and was hoping to determine if it is a user error or not.

Below, I set up some data and create my sample. I provide the sampling weights weights in my dataframe d. Using these weights, I can construct the expected outcome under treatment or control using weighted.avg as follows

library(tidyverse)
library(marginaleffects)

set.seed(0)
# Set up some data
d <- tibble::tribble(
  ~device,         ~lang, ~pop_weight, ~weight, ~p_control, ~p_treat,
  "Desktop",     "English",         0.4,     0.8,        0.1,     0.15,
  "Desktop", "Non-English",         0.2,     0.1,       0.04,     0.02,
  "Mobile",     "English",         0.3,    0.05,       0.08,     0.03,
  "Mobile", "Non-English",         0.1,    0.05,       0.01,     0.01
)

# Create a sample from a hypothetical experiment
exp_d <- d %>% 
  sample_n(1000000, replace=T, weight = weight) %>% 
  mutate(
    trt = sample(c('treat','control'), size = n(), replace = T),
    p = case_match(trt, 'treat' ~ p_treat, .default = p_control),
    y = rbinom(n(), 1, p)
  )

# Based on the true weights and probabilities in each group, the risk difference should be

with(d, weighted.mean(p_treat - p_control, w=weight))
#> [1] 0.0355

# We get an answer similar to this with a linear model, so I'm confident in my math above
lm(y ~ trt, data=exp_d)
#> 
#> Call:
#> lm(formula = y ~ trt, data = exp_d)
#> 
#> Coefficients:
#> (Intercept)     trttreat  
#>     0.08855      0.03670

Using weighted_avg, the "lift" should be around 40%.


# Using the same logic, the lift should be 40%
with(d, weighted.mean(p_treat - p_control, w=weight) / weighted.mean(p_control, w=weight))
#> [1] 0.4011299

Fitting a logistic regression with treatment as the only covariate, I can also recover the lift.


# Now, fit a logistic regression and estimate the lift with a function

fit <- glm(y ~ trt, data=exp_d, family = binomial())

avg_comparisons(
  fit,
  variables = 'trt',
  comparison = 'lift',
 newdata = datagrid(trt = c('treat', 'control'), grid_type = 'counterfactual') )

#> 
#>  Term Contrast Estimate Std. Error    z Pr(>|z|)   S 2.5 % 97.5 %
#>   trt     lift    0.414    0.00831 49.9   <0.001 Inf 0.398  0.431
#> 
#> Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
#> Type:  response

But as soon as I add interactions into my model, the lift is completely wrong.


# Now, include interactions in the model

fit <- glm(y ~ trt*lang*device, data=exp_d, family = binomial())

# However, now the lift is very far off from what it was
avg_comparisons(
  fit,
  variables = 'trt',
  comparison = 'lift',
  newdata = datagrid(trt = c('treat', 'control'), grid_type = 'counterfactual')
)
#> 
#>  Term Contrast Estimate Std. Error    z Pr(>|z|)      S 2.5 % 97.5 %
#>   trt     lift    0.327    0.00872 37.5   <0.001 1020.7  0.31  0.344
#> 
#> Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
#> Type:  response

Indeed, the avg_predictions don't match what the actual average predictions should be


# Take a look at the average predictions.
# Don't match what they should be
avg_predictions(
  fit,
  by = 'trt',
  newdata = datagrid(trt = c('treat', 'control'), grid_type = 'counterfactual')
)
#> 
#>      trt Estimate Pr(>|z|)   S  2.5 % 97.5 %
#>  treat     0.1021   <0.001 Inf 0.1010 0.1032
#>  control   0.0812   <0.001 Inf 0.0804 0.0821
#> 
#> Columns: trt, estimate, p.value, s.value, conf.low, conf.high 
#> Type:  invlink(link)

d %>% 
  summarise_at(vars(starts_with('p')), ~weighted.mean(., w=weight))
#> # A tibble: 1 × 3
#>   pop_weight p_control p_treat
#>        <dbl>     <dbl>   <dbl>
#> 1       0.36    0.0885   0.124

Created on 2024-06-29 with reprex v2.1.0

While the avg_predictions depend on the estimated probability in each strata, it looks like fit accurately estimates the risk in strata defined by device and lang, so I don't think that is the issue.

Dpananos commented 1 week ago

If I use comparison="lnratioavg" and transform=exp I get an answer closer to what I expect


# However, now the lift is very far off from what it was
avg_comparisons(
  fit,
  variables = 'trt',
  comparison = 'lnratioavg',
  transform = exp,
  newdata = datagrid(trt = c('treat', 'control'), grid_type = 'counterfactual')
)

 Term                        Contrast Estimate Pr(>|z|)   S 2.5 % 97.5 %
  trt ln(mean(treat) / mean(control))     1.41   <0.001 Inf   1.4   1.43

Columns: term, contrast, estimate, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted 
Type:  response 
Dpananos commented 1 week ago

These approaches also give the right answer

> datagrid(newdata=exp_d, trt = c('treat', 'control'), grid_type = 'counterfactual') %>% 
+   modelr::add_predictions(fit, type='response') %>% 
+   group_by(trt) %>% 
+   summarise(mean(pred))
# A tibble: 2 × 2
  trt     `mean(pred)`
  <chr>          <dbl>
1 control       0.0886
2 treat         0.125 
> datagrid(model=fit, trt = c('treat', 'control'), grid_type = 'counterfactual') %>% 
+   modelr::add_predictions(fit, type='response') %>% 
+   group_by(trt) %>% 
+   summarise(mean(pred))
# A tibble: 2 × 2
  trt     `mean(pred)`
  <chr>          <dbl>
1 control       0.0886
2 treat         0.125