Imputation Fails in the presence of categorical variables as predictors

AlexisMignon commented 3 years ago

In some cases, when there are categorical predictors, imputation fails. I give here a an example, with the stochastic imputer but my guess is that is comes from an improper encoding of categorical variables and it probably affects all predictive imputers. The most probable issue is that the category encoding is not robust to unseen categories.

While it might be a choice not to treat it directly, it should be nice to detect the case and provide users with a clear error message.

This is probably linked to : https://github.com/kearnz/autoimpute/issues/11#issue-411732513

The example below triggers a ValueError:

n_samples = 10

np.random.seed(1)
df = pd.DataFrame({
    "x": list(map(str, np.random.randint(2, size=n_samples))),
    "y": np.random.rand(n_samples),
    "z": np.random.rand(n_samples)
})

df["z"] = np.where(np.random.rand(*df["z"].shape) < 0.8, df["z"], np.nan)

si = SingleImputer(strategy={"z": "stochastic"})
si.fit(df)
si.transform(df)

ValueError                                Traceback (most recent call last)
<ipython-input-102-6f3ad0393ae5> in <module>
     12 si = SingleImputer(strategy={"z": "stochastic"})
     13 si.fit(df)
---> 14 si.transform(df)

~/Projets/anaconda3/envs/machine-learning-level-1/lib/python3.8/site-packages/autoimpute/utils/checks.py in wrapper(d, *args, **kwargs)
     59             err = f"Neither {d_err} nor {a_err} are of type pd.DataFrame"
     60             raise TypeError(err)
---> 61         return func(d, *args, **kwargs)
     62     return wrapper
     63 

~/Projets/anaconda3/envs/machine-learning-level-1/lib/python3.8/site-packages/autoimpute/utils/checks.py in wrapper(d, *args, **kwargs)
    124 
    125         # return func if no missingness violations detected, then return wrap
--> 126         return func(d, *args, **kwargs)
    127     return wrapper
    128 

~/Projets/anaconda3/envs/machine-learning-level-1/lib/python3.8/site-packages/autoimpute/utils/checks.py in wrapper(d, *args, **kwargs)
    171             err = f"All values missing in column(s) {nc}. Should be removed."
    172             raise ValueError(err)
--> 173         return func(d, *args, **kwargs)
    174     return wrapper
    175 

~/Projets/anaconda3/envs/machine-learning-level-1/lib/python3.8/site-packages/autoimpute/imputations/dataframe/single_imputer.py in transform(self, X, imp_ixs, **trans_kwargs)
    296                 X.loc[imp_ix, column] = imputer.impute(x_, k=k)
    297             else:
--> 298                 X.loc[imp_ix, column] = imputer.impute(x_)
    299         return X
    300 

~/Projets/anaconda3/envs/machine-learning-level-1/lib/python3.8/site-packages/autoimpute/imputations/series/linear_regression.py in impute(self, X)
    156         check_is_fitted(self, "statistics_")
    157         mse = self.statistics_["param"]
--> 158         preds = self.lm.predict(X)
    159 
    160         # add random draw from normal dist w/ mean squared error

~/Projets/anaconda3/envs/machine-learning-level-1/lib/python3.8/site-packages/sklearn/linear_model/_base.py in predict(self, X)
    236             Returns predicted values.
    237         """
--> 238         return self._decision_function(X)
    239 
    240     _preprocess_data = staticmethod(_preprocess_data)

~/Projets/anaconda3/envs/machine-learning-level-1/lib/python3.8/site-packages/sklearn/linear_model/_base.py in _decision_function(self, X)
    219 
    220         X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
--> 221         return safe_sparse_dot(X, self.coef_.T,
    222                                dense_output=True) + self.intercept_
    223 

~/Projets/anaconda3/envs/machine-learning-level-1/lib/python3.8/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     61             extra_args = len(args) - len(all_args)
     62             if extra_args <= 0:
---> 63                 return f(*args, **kwargs)
     64 
     65             # extra_args > 0

~/Projets/anaconda3/envs/machine-learning-level-1/lib/python3.8/site-packages/sklearn/utils/extmath.py in safe_sparse_dot(a, b, dense_output)
    150             ret = np.dot(a, b)
    151     else:
--> 152         ret = a @ b
    153 
    154     if (sparse.issparse(a) and sparse.issparse(b)

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 2 is different from 1)

kearnz commented 3 years ago

Hi @AlexisMignon

Great write up and thank you for the example. Will find some time this upcoming week to run your code and reproduce the error to make sure I understand.

I agree with your suggestions. Short term bandaid here is better error handling so the user has a better idea. Longer term would be to support unseen categories and adapt. Let me know if you have any ideas there. Otherwise I'll look into it!

Thanks, Joe

filthysocks commented 1 year ago

I'm having issues with this as well. Missing categories in the validation set make it impossible for me to use this library at the moment.

kearnz / autoimpute

Imputation Fails in the presence of categorical variables as predictors #66