pooled avg_slopes (after mice) vs. avg_slopes on the "mira" object - small discrepancy. Is it possible to align them?

Dear @vincentarelbundock Now I come with another problem. Let me briefly explain what I do:

1) I perform multiple imputation over a dataset with several columns, which are used to derive a new outcome called "overall success" (it's a clinical study), which will be compared across 2 treatment arms.

2) the derivation (creation) if this new variable is done after the imputation, based on the complete data. Because the "post-processing" syntax of mice was too difficult to me, I did it explicitly after the imputation, iterating through the list of imputed datasets and doing some dplyr mutate()

3) having the new variable created, I perform the logistic regression followed by AME to obtain the difference between two treatment arms. Then I pool the results. Done.

So far I did it this way. I'm sorry for "pseudo code", but the data itself is to big to bring it here... But I provide all steps and results.

library(miceafter) # offers the pooled Wald's difference - just to validate the results

# turn the "mids" class (resulting from the imputation by mice) into just a list of complete datasets
imp_dat <- mids2milist( my_imputed_data )

# go through the list and apply the function that creates the new variables
imp_dat_success <- lapply(imp_dat, infer_overall_success)

# restore the lost class name "milist"
class(imp_dat_success) <- "milist"

# define the "worker function"

fit_logistic_overall_success <- function(dat) {
  mod <- glm(OverallSuccess_calc_num ~ Arm_num, family = binomial(link = 'logit'), data=dat)
  out <- avg_slopes(mod, newdata = dat, conf_level = 0.9, numderiv = list("fdcenter", eps = 1e-10))
  return(out)
}

# run it over the imputed list...
mod_imputation <- lapply(imp_dat_success, fit_logistic_overall_success)

# Pool the results finally.
# The dfcom comes from my knowledge of how many degrees of freedom there should be. (it uses the t statistic)
summary(pool(mod_imputation, dfcom = 110), conf.int = TRUE, conf.level = 0.9)

    term contrast    estimate  std.error  statistic      df   p.value         5 %       95 %
1 Arm_num    1 - 0 -0.01730519 0.02059518 -0.8402544 87.8888 0.4030458 -0.05154214 0.01693175

OK! Now let's check with a dedicated function from the miceafter package (Wald's difference in 2 proportions with non-pooled SEs)

# It uses the same dataset as above - with the inferred variable
ov_succ_wald <- with(imp_dat_success, 
                                  expr=propdiff_wald(OverallSuccess_calc_num ~ Arm_num, strata = FALSE))

# And pool the results (it cares about the df itself)
pool_propdiff_wald(ov_succ_wald, conf.level = 0.9)

     Prop diff Wald     SE       t 90 CI low 90 CI high
[1,]       -0.01731 0.0206 1.66238  -0.05154    0.01693
attr(,"class")
[1] "mipool"

Just some sanity check...

# Sadly, pool_propdiff_waldrounds to 5 decimal digits. I temporarily increased this to 10 by editing the function body.

pool_propdiff_wald1: -0.051542 78
pooled avg_slopes:    -0.051542 14

The agreement is fine.

Then I realized I can do it a bit simpler:

# Stack all complete datasets
my_imputed_data <- complete(my_imputed_data , "long", include = TRUE)

# Derive the success-related variables
my_imputed_data_complete <- infer_overall_success(my_imputed_data )

# Turn it back to "mids"
my_imputed_data_complete  <- as.mids(my_imputed_data_complete )

# Fit the model to each dataset
m <- with(my_imputed_data_complete, glm(OverallSuccess_calc_num ~ Arm_num, family = binomial(link = 'logit')))

# And use the new feature - work directly on the "mira"
avg_slopes(m, conf_level = 0.9, numderiv = list("fdcenter", eps = 1e-10))

   Term Contrast Estimate Std. Error   S   Df     t Pr(>|t|)   5.0 % 95.0 %
 Arm_num    1 - 0  -0.0173     0.0206 1.3 1233 -0.84    0.401 -0.0512 0.0166

Columns: term, contrast, estimate, std.error, s.value, df, statistic, p.value, conf.low, conf.high 

Warning message:
In get.dfcom(object, dfcom) : Infinite sample size assumed.

Now the lower CI is: -0.05120674 vs -0.05154214

I noted the message: Infinite sample size assumed. OK, so the normal approximation was used. But the Df confused me. It's 1233 rather than Inf.

OK, let's try providing the DF manually:

> avg_slopes(m, df = 110, conf_level = 0.9, numderiv = list("fdcenter", eps = 1e-10))

    Term Contrast Estimate Std. Error   S   Df     t Pr(>|t|)   5.0 % 95.0 %
 Arm_num    1 - 0  -0.0173     0.0206 1.3 1233 -0.84    0.401 -0.0512 0.0166

Columns: term, contrast, estimate, std.error, s.value, df, statistic, p.value, conf.low, conf.high 

Warning message:
In get.dfcom(object, dfcom) : Infinite sample size assumed.

But nothing happened, the message remains and the results don't change.

I'm wondering if it's possible to obtain the same results regardless of the approach: pooling the AMEs over the GLM explicitly vs. running AME over the mira with imputed GLMs.

I use the developer's version: ‘0.13.0.9009’

vincentarelbundock / marginaleffects

pooled avg_slopes (after mice) vs. avg_slopes on the "mira" object - small discrepancy. Is it possible to align them? #850