kearnz / autoimpute

Python package for Imputation Methods
MIT License
237 stars 19 forks source link

multiple_imputer not doing real multiple imputation #42

Closed gjdv closed 4 years ago

gjdv commented 4 years ago

While debugging some weird behavior (huge imputed values), I found that the multiple_imputer skips various important steps in MI as defined by Rubin and as implemented in the MICE R-package by Stef van Buuren. MI should do the following:

repeat n times:
  identify missingness in dataframe
  initialize an imputed dataframe by inserting e.g., mean values per column where data is missing
  while not stable (or for set k number of iterations):
    for each column with missingness:
      create a single imputer using the current column as output and the other columns as input to the model
      update the imputed dataframe with imputed values where originally data was missing

Instead, multiple_imputer.py does not iteratively improve the imputed values, but rather runs the while-not-stable loop only once. Also, it does not use the initialized imputed dataframe, but rather selects only complete rows in the imputation (single_imputer calls _get_observed). This results in selection bias in the imputation.

Finally, the default imputation strategies default to regular regression methods rather than their Bayesian variants. As a consequence, the results of n imputations are all the same, e.g.,:

imp = MultipleImputer(return_list=True, n=3)
all(train_imp[0][1] == train_imp[1][1])   # gives True

If I find some time in near future, I might be able to help fixing this in the implementation.

kearnz commented 4 years ago

Hi, thanks for taking the time to go through the package.

A couple of points in response:

  1. we view multiple imputation as a framework in which you impute a dataset multiple times. MICE is an approach within the multiple imputation framework that can iteratively improve imputed values (i.e. the while not stable or set k = n iterations). For example, one could create a ChainedEquationsImputer that would work within our framework. Also, can you point me to the source you're referring to that outlines the algorithm you've written?

  2. the _get_observed() call is by design. Our package assumes that your data is drawn from the same distribution as the observed data (both univariate and joint). Can you be more specific about what you would expect for initialization? And can you provide a source regarding the selection bias you refer to?

  3. The default implementation is pmm, and the imputed datasets will not be the same after imputation. The code you provided is incomplete and would not run as is, so I can't verify what you've provided.

I'll leave this open for now, but will likely close the issue within 24 hours as the issue is too broad and its not clear if or which items to address or how to address them. For future issues, please provide one item per issue and name the issue in accordance with the item raised. Also include relevant documentation in support of that issue, and provide complete code. Additionally, please label the issue accordingly.

gjdv commented 4 years ago

Thanks for your reaction and constructive discussion.

ad 1. I wasn't aware of the ChainedEquationsImputer; thanks for the pointer. I will look into it and see if that serves my needs.

ad 2. I understand this way of working for single imputation; for multiple imputation it is, however, disadvantageous to not consider incomplete samples as source of the imputation. I'm sorry for not providing exact pointers; it was 5 years ago that I did a deep dive into the matter; unfortunately I could not find back all notes and sources from back then. Nevertheless, in e.g., this tech note you can find on page 10 the Gibbs sampling detailed out as solution to include columns with missing missing data as input to the regression methods to iteratively improve the imputed values. W.r.t. sample bias, I should have mentioned that I was referring to cases where Missingness At Random does not completely hold (as typical in real world cases) where filtering out incomplete samples by definition leads to selection bias.

ad 3. The constructor of MultipleImputer defaults to strategy="default predictive" which in BaseImputer translates to pmm for numerical, logistic for categorical.. I should have mentioned that I was referring to the categorical case here. W.r.t. the code example, I see that I mixed up various tries; will come back to this when I find some more time.

kearnz commented 4 years ago

Hi - thank you for the detailed response, and I appreciate the time you're taking on this!

Overall, your comments are very helpful. I have a couple clarifications in response:

  1. There isn't a ChainedEquationImputer yet. I meant to say that there could be, then it could be integrated into the MultipleImputer. The MultipleImputer shouldn't need any refactoring, as the ChainedEquationImputer could be responsible for the additional loop (i.e. while not stable). I would have to look at the potential code refactoring necessary in more detail, but these are my initial thoughts.

  2. Thanks for providing a source (and location within it!). I'll look into this, but your summary makes sense and I completely agree. For context, I cut the initial scope for v1 of this package at MCAR and MAR, as I developed the package as part of a Master's degree program, so I had a deadline. As you noted, MAR often doesn't hold in real world cases, so a major enhancement of this package would be imputation methods that account for this, as well as sensitivity analysis to evaluate. This point (along with 1.) will be prioritized for any further development now that there's no time restriction on delivery.

  3. I implemented a bayesian classifier, but it's only binary right now. Therefore, I set the default classification to multinomial. Classification in general could be improved in this package.

Over the weekend, I'll likely create 3 new issues - one for each of these. That way they can be separately tracked and addressed, since each one has its own action items.

gjdv commented 4 years ago

Thanks. Sorry for misreading the first point. I agree that that seems a good approach. I would be happy to think (and code) along if you'd like. It looks to me that the main changes (in a ChainedEquationImputer) would be that the position of missing elements needs to be retained (e.g., as a mask) such that only those elements are updated in each increment and the complete elements remain untouched; and that subselection (by the call to_get_observed()) should not be applied.

kearnz commented 4 years ago

No worries at all! I'm making the separate issues now and closing this one. Happy for you to work on any of the issues, just keep me posted. Appreciate you taking the time to help and go through the package.