kearnz / autoimpute

Python package for Imputation Methods
MIT License
241 stars 19 forks source link

Can't use date or time columns as predictors for imputation #21

Open kearnz opened 5 years ago

kearnz commented 5 years ago

Right now, time columns must be fully observed, and time columns are not imputed. That being said, they can still get through as predictors in multivariate predictive imputation methods.

Unfortunately, Autoimpute checks that multivariate predictive imputation methods have features that are numerical only. As a result, when a time-based column is passed to this check (specifically _not_num_matrix), the following error below is thrown:

si = SingleImputer()
si.fit_transform(df_with_ts_column)

TypeError: default predictive not appropriate for Matrix with non-numerical columns.

The error itself is not the bug. It says that the default predictive imputation strategy is not appropriate when trying to fit a matrix with non-numerical columns. Therefore, _not_num_matrix does it's job as expected. The issue is the fact that a time-series column can get to this stage without:

  1. Being removed from the predictor set
  2. Being encoded and stored in predictor set as encoded

Ideally, we want to incorporate time-series columns as predictors by encoding them (option 2), although the work to do so is non-trivial. We have to decide whether or not we will go through with dirty-work (encoding, scaling, etc.) or force that responsibility on the user.