scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
58.76k stars 25.12k forks source link

SimpleImputer fails for columns with pandas string type #27349

Open tvdboom opened 9 months ago

tvdboom commented 9 months ago

Describe the bug

The SimpleImputer class with strategy="most_frequent" fails during fit when one of the columns of the input dataframe is of type string.

The error happens here https://github.com/scikit-learn/scikit-learn/blob/d8e131c1640f13954b2dd5f0cc3b50b82a05a554/sklearn/utils/fixes.py#L58-L59

because (use X from the example below) np.array(X.iloc[:, 0]) == np.array(X.iloc[:, 0]) returns a bool (False) instead of an array of bools.

Steps/Code to Reproduce

from sklearn.impute import SimpleImputer
import numpy as np
import pandas as pd

X = pd.DataFrame([['A'], ['B'], [np.nan]], dtype="string")
print(X.dtypes)

SimpleImputer(strategy="most_frequent").fit(X)

Expected Results

No errors thrown.

Actual Results

0    string[python]
dtype: object

Traceback (most recent call last):
  File "C:\Users\Mavs\Documents\Python\ATOM\venv310\lib\site-packages\IPython\core\interactiveshell.py", line 3460, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-7b70f7e70352>", line 8, in <module>
    SimpleImputer(strategy="most_frequent").fit(X)
  File "C:\Users\Mavs\Documents\Python\ATOM\venv310\lib\site-packages\sklearn\base.py", line 1151, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "C:\Users\Mavs\Documents\Python\ATOM\venv310\lib\site-packages\sklearn\impute\_base.py", line 405, in fit
    self.statistics_ = self._dense_fit(
  File "C:\Users\Mavs\Documents\Python\ATOM\venv310\lib\site-packages\sklearn\impute\_base.py", line 488, in _dense_fit
    mask = missing_mask.transpose()
AttributeError: 'bool' object has no attribute 'transpose'

Versions

System:
    python: 3.10.2 (tags/v3.10.2:a58ebcc, Jan 17 2022, 14:12:15) [MSC v.1929 64 bit (AMD64)]
executable: C:\Users\Mavs\Documents\Python\ATOM\venv310\Scripts\python.exe
   machine: Windows-10-10.0.19045-SP0
Python dependencies:
      sklearn: 1.3.0
          pip: 23.2.1
   setuptools: 59.8.0
        numpy: 1.23.5
        scipy: 1.10.1
       Cython: 0.29.30
       pandas: 2.0.3
   matplotlib: 3.7.2
       joblib: 1.3.1
threadpoolctl: 3.1.0
Built with OpenMP: True
threadpoolctl info:
       user_api: openmp
   internal_api: openmp
         prefix: vcomp
       filepath: C:\Users\Mavs\Documents\Python\ATOM\venv310\Lib\site-packages\sklearn\.libs\vcomp140.dll
        version: None
    num_threads: 16
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: C:\Users\Mavs\Documents\Python\ATOM\venv310\Lib\site-packages\numpy\.libs\libopenblas.FB5AE2TYXYH2IJRDKGDGQ3XBKLKTF43H.gfortran-win_amd64.dll
        version: 0.3.20
threading_layer: pthreads
   architecture: Zen
    num_threads: 16
       user_api: blas
   internal_api: openblas
         prefix: libopenblas
       filepath: C:\Users\Mavs\Documents\Python\ATOM\venv310\Lib\site-packages\scipy.libs\libopenblas-802f9ed1179cb9c9b03d67ff79f48187.dll
        version: 0.3.18
threading_layer: pthreads
   architecture: Zen
    num_threads: 16
glemaitre commented 9 months ago

You need to pass missing_values=pd.NA since you are using this marker.

With the code snippet, I am getting:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [1], line 8
      5 X = pd.DataFrame([['A'], ['B'], [np.nan]], dtype="string")
      6 print(X.dtypes)
----> 8 SimpleImputer(strategy="most_frequent").fit(X)

File ~/Documents/packages/scikit-learn/sklearn/base.py:1215, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1208     estimator._validate_params()
   1210 with config_context(
   1211     skip_parameter_validation=(
   1212         prefer_skip_nested_validation or global_skip_validation
   1213     )
   1214 ):
-> 1215     return fit_method(estimator, *args, **kwargs)

File ~/Documents/packages/scikit-learn/sklearn/impute/_base.py:408, in SimpleImputer.fit(self, X, y)
    403         self.statistics_ = self._sparse_fit(
    404             X, self.strategy, self.missing_values, fill_value
    405         )
    407 else:
--> 408     self.statistics_ = self._dense_fit(
    409         X, self.strategy, self.missing_values, fill_value
    410     )
    412 return self

File ~/Documents/packages/scikit-learn/sklearn/impute/_base.py:458, in SimpleImputer._dense_fit(self, X, strategy, missing_values, fill_value)
    456 def _dense_fit(self, X, strategy, missing_values, fill_value):
    457     """Fit the transformer on dense data."""
--> 458     missing_mask = _get_mask(X, missing_values)
    459     masked_X = ma.masked_array(X, mask=missing_mask)
    461     super()._fit_indicator(missing_mask)

File ~/Documents/packages/scikit-learn/sklearn/utils/_mask.py:54, in _get_mask(X, value_to_mask)
     35 """Compute the boolean mask X == value_to_mask.
     36 
     37 Parameters
   (...)
     49     Missing mask.
     50 """
     51 if not sp.issparse(X):
     52     # For all cases apart of a sparse input where we need to reconstruct
     53     # a sparse output
---> 54     return _get_dense_mask(X, value_to_mask)
     56 Xt = _get_dense_mask(X.data, value_to_mask)
     58 sparse_constructor = sp.csr_matrix if X.format == "csr" else sp.csc_matrix

File ~/Documents/packages/scikit-learn/sklearn/utils/_mask.py:27, in _get_dense_mask(X, value_to_mask)
     24         Xt = np.zeros(X.shape, dtype=bool)
     25     else:
     26         # np.isnan does not work on object dtypes.
---> 27         Xt = _object_dtype_isnan(X)
     28 else:
     29     Xt = X == value_to_mask

File ~/Documents/packages/scikit-learn/sklearn/utils/fixes.py:60, in _object_dtype_isnan(X)
     59 def _object_dtype_isnan(X):
---> 60     return X != X

File ~/mambaforge/envs/dev/lib/python3.10/site-packages/pandas/_libs/missing.pyx:388, in pandas._libs.missing.NAType.__bool__()

TypeError: boolean value of NA is ambiguous

The message is too ambiguous and we should instead re raise giving a hint.

tvdboom commented 9 months ago

Thanks, that worked. A better exception message would be helpful indeed. The issue can be closed.

glevv commented 9 months ago

Should there be a separate issue regarding the warning? Or maybe PR could go from this one?

glemaitre commented 9 months ago

Should there be a separate issue regarding the warning? Or maybe PR could go from this one?

Actually there is only a single issue here, so a single PR reraise the error should be fine. But I think that we should investigate more closely all transformer in scikit-learn and check how do they deal with pd.NA because I am not really certain about the current state.

tuhinsharma121 commented 3 months ago

Can I work on this?