vincentarelbundock / marginaleffects

R package to compute and plot predictions, slopes, marginal means, and comparisons (contrasts, risk ratios, odds, etc.) for over 100 classes of statistical and ML models. Conduct linear and non-linear hypothesis tests, or equivalence tests. Calculate uncertainty estimates using the delta method, bootstrapping, or simulation-based inference
https://marginaleffects.com
Other
466 stars 47 forks source link

pooled avg_slopes (after mice) vs. avg_slopes on the "mira" object - small discrepancy. Is it possible to align them? #850

Closed Generalized closed 1 year ago

Generalized commented 1 year ago

Dear @vincentarelbundock Now I come with another problem. Let me briefly explain what I do:

1) I perform multiple imputation over a dataset with several columns, which are used to derive a new outcome called "overall success" (it's a clinical study), which will be compared across 2 treatment arms.

2) the derivation (creation) if this new variable is done after the imputation, based on the complete data. Because the "post-processing" syntax of mice was too difficult to me, I did it explicitly after the imputation, iterating through the list of imputed datasets and doing some dplyr mutate()

3) having the new variable created, I perform the logistic regression followed by AME to obtain the difference between two treatment arms. Then I pool the results. Done.

So far I did it this way. I'm sorry for "pseudo code", but the data itself is to big to bring it here... But I provide all steps and results.

library(miceafter) # offers the pooled Wald's difference - just to validate the results

# turn the "mids" class (resulting from the imputation by mice) into just a list of complete datasets
imp_dat <- mids2milist( my_imputed_data )

# go through the list and apply the function that creates the new variables
imp_dat_success <- lapply(imp_dat, infer_overall_success)

# restore the lost class name "milist"
class(imp_dat_success) <- "milist"

# define the "worker function"

fit_logistic_overall_success <- function(dat) {
  mod <- glm(OverallSuccess_calc_num ~ Arm_num, family = binomial(link = 'logit'), data=dat)
  out <- avg_slopes(mod, newdata = dat, conf_level = 0.9, numderiv = list("fdcenter", eps = 1e-10))
  return(out)
}

# run it over the imputed list...
mod_imputation <- lapply(imp_dat_success, fit_logistic_overall_success)

# Pool the results finally.
# The dfcom comes from my knowledge of how many degrees of freedom there should be. (it uses the t statistic)
summary(pool(mod_imputation, dfcom = 110), conf.int = TRUE, conf.level = 0.9)

    term contrast    estimate  std.error  statistic      df   p.value         5 %       95 %
1 Arm_num    1 - 0 -0.01730519 0.02059518 -0.8402544 87.8888 0.4030458 -0.05154214 0.01693175

OK! Now let's check with a dedicated function from the miceafter package (Wald's difference in 2 proportions with non-pooled SEs)

# It uses the same dataset as above - with the inferred variable
ov_succ_wald <- with(imp_dat_success, 
                                  expr=propdiff_wald(OverallSuccess_calc_num ~ Arm_num, strata = FALSE))

# And pool the results (it cares about the df itself)
pool_propdiff_wald(ov_succ_wald, conf.level = 0.9)

     Prop diff Wald     SE       t 90 CI low 90 CI high
[1,]       -0.01731 0.0206 1.66238  -0.05154    0.01693
attr(,"class")
[1] "mipool"

Just some sanity check...

# Sadly, pool_propdiff_waldrounds to 5 decimal digits. I temporarily increased this to 10 by editing the function body.

pool_propdiff_wald1: -0.051542 78
pooled avg_slopes:    -0.051542 14

The agreement is fine.

Then I realized I can do it a bit simpler:

# Stack all complete datasets
my_imputed_data <- complete(my_imputed_data , "long", include = TRUE)

# Derive the success-related variables
my_imputed_data_complete <- infer_overall_success(my_imputed_data )

# Turn it back to "mids"
my_imputed_data_complete  <- as.mids(my_imputed_data_complete )

# Fit the model to each dataset
m <- with(my_imputed_data_complete, glm(OverallSuccess_calc_num ~ Arm_num, family = binomial(link = 'logit')))

# And use the new feature - work directly on the "mira"
avg_slopes(m, conf_level = 0.9, numderiv = list("fdcenter", eps = 1e-10))

   Term Contrast Estimate Std. Error   S   Df     t Pr(>|t|)   5.0 % 95.0 %
 Arm_num    1 - 0  -0.0173     0.0206 1.3 1233 -0.84    0.401 -0.0512 0.0166

Columns: term, contrast, estimate, std.error, s.value, df, statistic, p.value, conf.low, conf.high 

Warning message:
In get.dfcom(object, dfcom) : Infinite sample size assumed.

Now the lower CI is: -0.05120674 vs -0.05154214

I noted the message: Infinite sample size assumed. OK, so the normal approximation was used. But the Df confused me. It's 1233 rather than Inf.

OK, let's try providing the DF manually:

> avg_slopes(m, df = 110, conf_level = 0.9, numderiv = list("fdcenter", eps = 1e-10))

    Term Contrast Estimate Std. Error   S   Df     t Pr(>|t|)   5.0 % 95.0 %
 Arm_num    1 - 0  -0.0173     0.0206 1.3 1233 -0.84    0.401 -0.0512 0.0166

Columns: term, contrast, estimate, std.error, s.value, df, statistic, p.value, conf.low, conf.high 

Warning message:
In get.dfcom(object, dfcom) : Infinite sample size assumed.

But nothing happened, the message remains and the results don't change.

I'm wondering if it's possible to obtain the same results regardless of the approach: pooling the AMEs over the GLM explicitly vs. running AME over the mira with imputed GLMs.

I use the developer's version: ‘0.13.0.9009’

vincentarelbundock commented 1 year ago

Hi,

Unfortunately, I don’t have time to figure out this complicated situation and give support.

But the warning doesn’t come from marginaleffects package itself. Maybe it comes from mice? You may want to set options() to trigger errors instead of warnings, and then use traceback() to track down the problem.