kearnz / autoimpute

Python package for Imputation Methods
MIT License
241 stars 19 forks source link

Add Chained Equations to Multiple Imputation #43

Closed kearnz closed 4 years ago

kearnz commented 4 years ago

Right now, the MultipleImputer creates multiple samples of the same dataset, and it imputes each one independently. This is the repeat n times logic. That being said, the MultipleImputer does not actively improve each of the imputed datasets (i.e. each imputer runs only once on each column of each sample). The ChainedEquationsImputer (TBD on implementation) would handle iterative improvements to each imputation. The psuedo-code is as follows (provided by @gjdv):

repeat n times:
  identify missingness in dataframe
  initialize an imputed dataframe by inserting e.g., mean values per column where data is missing
  while not stable (or for set k number of iterations):
    for each column with missingness:
      create a single imputer using the current column as output and the other columns as input to the model
      update the imputed dataframe with imputed values where originally data was missing

Initial plan is to implement this as a NEW SeriesImputer. May need some changes to the MultipleImputer, although that is TBD.

kearnz commented 4 years ago

Implemented in v 0.12.0