kearnz / autoimpute

Python package for Imputation Methods
MIT License
242 stars 19 forks source link

Predictor Placeholder Imputation within PredictiveImputer #15

Closed kearnz closed 5 years ago

kearnz commented 5 years ago

PredictiveImputer supports machine learning models for imputation. Right now, it does so by fitting a model (linear, logistic) on observed X and y (note: the values for X and y depend on what the user specifies as predictors when creating an instance of the class). From the fit on observed, we derive coefficients which we then use for imputation during the transform step.

During the transform step, we predict values for Y_mis using the covariates X_mis, or the values for the predictors where Y is missing. That being said, nothing guarantees that X_mis is fully observed. If covariates have missing values, we must impute them with something so that we can generate predictions for each y in Y_mis. Right now, we impute the predictors in the transform step with the default methods from the SingleImputer.

There are two questions with this we may want to address: 1) Should the SingleImputer be customizable as well? 2) Once a column is imputed, should imputed values be used in subsequent predictions?

The first question is just a matter of writing extra code and validation. The second is a bit more theoretical. It's more necessary for multiple imputation, where we use something like the visit sequence or the random selection process.

kearnz commented 5 years ago

For 1), not supporting customizable SingleImputer. It uses default placeholders, which applies to imputations with missing covariates only. This is an edge case as it is, and most of the literature uses mean/mode placeholders as well.

For 2) imputations used in subsequent predictions. This was fixed in commit 9b09076. That being said, the fix is a bit of a hack for now. It calls the _update_dataframe method, which does additional work. No memory footprint issue, and additional time taken is negligible.