kearnz / autoimpute

Python package for Imputation Methods
MIT License
237 stars 19 forks source link

Adding convergence criterion and monitoring to MiceImputer #55

Closed maxmahlke closed 3 years ago

maxmahlke commented 3 years ago

Problem

The MiceImputer accepts a keyword k for the number of iterations. There is no easy way of determining the optimal k depending on the data at hand.

Currently, I am fixing the imputation with a seed while increasing k to monitor the convergence of the imputed variables. This is code-wise ok but computationally very inefficient.

Possible Solutions

I see two solutions which could both be implemented:

Alternative to new feature

A less user-friendly but more exhaustive solution would be to optionally return all iteration results for each imputation run, not just the final one. Assuming I'm instantiating MiceImputer with k=10, n=5, and return_list=True, I would then be getting a list with 5 elements (just as now), where each element is itself a list with 10 elements, each element being the data matrix at various imputation steps.


I have not seen this being addressed in any branch of the code. I'm also not very familiar with MICE, so maybe there is a better way of determining k that I'm not aware of. If you think this feature could be valuable, I'd be happy to contribute code in any part you see fit.

kearnz commented 3 years ago

Hi @maxmahlke

Apologies - just getting to your issue now.

I appreciate the thoughtful comments. We recently added the MiceImputer, and your suggestions would be an important next step to improve the package. I think all three of your suggestions are possible and would work nicely together. I've provided some feedback below in response to your ideas. I don't have a ton of bandwidth right now, but I could potentially work on some of this in the upcoming weeks, and I'm more than happy to advise / review pull requests should you want to contribute. Either way, let me know what you think!

The alternative you provide is the easiest to implement. This would probably be the quickest win to start. One thing we'd want to consider is returning k*n dataframes in the most efficient way.

The two solutions seem to go hand in hand. A plot would be a nice layer on top of a solution that allows you to set conversion params or rely on some sort of auto feature that handles optimal stopping. I'd have to go back and review the literature to refresh my memory here, but it's definitely something we could (and really should) have in the package.

Best, Joe

maxmahlke commented 3 years ago

Hi Joe,

apologies for the delay. For context, I'm interested in imputation as part of my PhD research. This week I further looked into the Maximum Likelihood + EM algorithm approach as an alternative, and I will explore that direction for now as the MiceImputer in combination with Bayesian Regression is quite expensive computationally (dataframe is 280,000 x 75 with missing values in each feature). If I get back to imputation, I will gladly contribute to this project!

Cheers Max

kearnz commented 3 years ago

Hi Max,

Sounds good, and good luck with your research! Would be interested to hear what you find with ML + EM. I wanted to add those approaches to autoimpute but didn't have time in the first major release. Would love to collaborate on that as well if you'd like to contribute in that way at any point in the future.

Best, Joe