Closed maxmahlke closed 3 years ago
Hi @maxmahlke
Apologies - just getting to your issue now.
I appreciate the thoughtful comments. We recently added the MiceImputer
, and your suggestions would be an important next step to improve the package. I think all three of your suggestions are possible and would work nicely together. I've provided some feedback below in response to your ideas. I don't have a ton of bandwidth right now, but I could potentially work on some of this in the upcoming weeks, and I'm more than happy to advise / review pull requests should you want to contribute. Either way, let me know what you think!
The alternative you provide is the easiest to implement. This would probably be the quickest win to start. One thing we'd want to consider is returning k*n
dataframes in the most efficient way.
The two solutions seem to go hand in hand. A plot would be a nice layer on top of a solution that allows you to set conversion params or rely on some sort of auto
feature that handles optimal stopping. I'd have to go back and review the literature to refresh my memory here, but it's definitely something we could (and really should) have in the package.
Best, Joe
Hi Joe,
apologies for the delay. For context, I'm interested in imputation as part of my PhD research. This week I further looked into the Maximum Likelihood + EM algorithm approach as an alternative, and I will explore that direction for now as the MiceImputer in combination with Bayesian Regression is quite expensive computationally (dataframe is 280,000 x 75 with missing values in each feature). If I get back to imputation, I will gladly contribute to this project!
Cheers Max
Hi Max,
Sounds good, and good luck with your research! Would be interested to hear what you find with ML + EM. I wanted to add those approaches to autoimpute but didn't have time in the first major release. Would love to collaborate on that as well if you'd like to contribute in that way at any point in the future.
Best, Joe
Problem
The
MiceImputer
accepts a keywordk
for the number of iterations. There is no easy way of determining the optimalk
depending on the data at hand.Currently, I am fixing the imputation with a
seed
while increasingk
to monitor the convergence of the imputed variables. This is code-wise ok but computationally very inefficient.Possible Solutions
I see two solutions which could both be implemented:
MiceImputer
constructorplot_convergence
function which displays the change of imputed variables over the iterations, as in Figures 7-10 in Buuren+ 2011, or here.Alternative to new feature
A less user-friendly but more exhaustive solution would be to optionally return all iteration results for each imputation run, not just the final one. Assuming I'm instantiating
MiceImputer
withk=10
,n=5
, andreturn_list=True,
I would then be getting a list with 5 elements (just as now), where each element is itself a list with 10 elements, each element being the data matrix at various imputation steps.I have not seen this being addressed in any branch of the code. I'm also not very familiar with MICE, so maybe there is a better way of determining
k
that I'm not aware of. If you think this feature could be valuable, I'd be happy to contribute code in any part you see fit.