kearnz / autoimpute

Python package for Imputation Methods
MIT License
241 stars 19 forks source link

Problem with one-hot-encoding missing categories: ValueError: X has 1 features, but LinearRegression is expecting 2 features as input. #83

Closed virtualphoton closed 2 years ago

virtualphoton commented 2 years ago

If a feature is categorical, then during transform it's possible that selected rows of X may have only some of categories, for example:

df = pd.DataFrame({'cat':list('abcab'), 'num':[1, 2, 3,np.NaN,np.NaN]})
mice = SingleImputer(strategy={'cat':'multinomial logistic', 'num':'least squares'})
mice.fit_transform(df)

raises exception

Traceback ``` ValueError Traceback (most recent call last) Input In [9], in () 1 mice = SingleImputer(strategy={'cat':'multinomial logistic', 'num':'least squares'}) ----> 2 mice.fit_transform(df) File ~\_notebooks\data_science\autoimp\autoimpute\imputations\dataframe\single_imputer.py:313, in SingleImputer.fit_transform(self, X, y, **trans_kwargs) 301 def fit_transform(self, X, y=None, **trans_kwargs): 302 """Convenience method to fit then transform the same dataset. 303 304 Args: (...) 311 X (pd.DataFrame): imputed in place or copy of original. 312 """ --> 313 return self.fit(X, y).transform(X, **trans_kwargs) File ~\_notebooks\data_science\autoimp\autoimpute\utils\checks.py:61, in check_data_structure..wrapper(d, *args, **kwargs) 59 err = f"Neither {d_err} nor {a_err} are of type pd.DataFrame" 60 raise TypeError(err) ---> 61 return func(d, *args, **kwargs) File ~\_notebooks\data_science\autoimp\autoimpute\utils\checks.py:126, in check_missingness..wrapper(d, *args, **kwargs) 123 raise ValueError("Time series columns must be fully complete.") 125 # return func if no missingness violations detected, then return wrap --> 126 return func(d, *args, **kwargs) File ~\_notebooks\data_science\autoimp\autoimpute\utils\checks.py:173, in check_nan_columns..wrapper(d, *args, **kwargs) 171 err = f"All values missing in column(s) {nc}. Should be removed." 172 raise ValueError(err) --> 173 return func(d, *args, **kwargs) File ~\_notebooks\data_science\autoimp\autoimpute\imputations\dataframe\single_imputer.py:298, in SingleImputer.transform(self, X, imp_ixs, **trans_kwargs) 296 X.loc[imp_ix, column] = imputer.impute(x_, k=k) 297 else: --> 298 X.loc[imp_ix, column] = imputer.impute(x_) 299 return X File ~\_notebooks\data_science\autoimp\autoimpute\imputations\series\linear_regression.py:79, in LeastSquaresImputer.impute(self, X) 77 # check if fitted then predict with least squares 78 check_is_fitted(self, "statistics_") ---> 79 imp = self.lm.predict(X) 80 return imp File ~\anaconda3\envs\data_science\lib\site-packages\sklearn\linear_model\_base.py:362, in LinearModel.predict(self, X) 348 def predict(self, X): 349 """ 350 Predict using the linear model. 351 (...) 360 Returns predicted values. 361 """ --> 362 return self._decision_function(X) File ~\anaconda3\envs\data_science\lib\site-packages\sklearn\linear_model\_base.py:345, in LinearModel._decision_function(self, X) 342 def _decision_function(self, X): 343 check_is_fitted(self) --> 345 X = self._validate_data(X, accept_sparse=["csr", "csc", "coo"], reset=False) 346 return safe_sparse_dot(X, self.coef_.T, dense_output=True) + self.intercept_ File ~\anaconda3\envs\data_science\lib\site-packages\sklearn\base.py:585, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params) 582 out = X, y 584 if not no_val_X and check_params.get("ensure_2d", True): --> 585 self._check_n_features(X, reset=reset) 587 return out File ~\anaconda3\envs\data_science\lib\site-packages\sklearn\base.py:400, in BaseEstimator._check_n_features(self, X, reset) 397 return 399 if n_features != self.n_features_in_: --> 400 raise ValueError( 401 f"X has {n_features} features, but {self.__class__.__name__} " 402 f"is expecting {self.n_features_in_} features as input." 403 ) ValueError: X has 1 features, but LinearRegression is expecting 2 features as input. ```

(originally problem emerged on test set from this competition)

I made a fix in this commit:

Also probably it would've been easier to use sklearn's one hot encoders

kearnz commented 2 years ago

Hi @virtualphoton, thanks for you submission! Always appreciate when someone identifies a bug or improvement.

Would you be able branch this and submit a pull request? Then I can merge to master, which will run CI tests in github actions. Going to close this in anticipation of a new pull request. Let me know if you need any help on that or want to discuss further. Happy to open another.