Open tvdboom opened 9 months ago
You need to pass missing_values=pd.NA
since you are using this marker.
With the code snippet, I am getting:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [1], line 8
5 X = pd.DataFrame([['A'], ['B'], [np.nan]], dtype="string")
6 print(X.dtypes)
----> 8 SimpleImputer(strategy="most_frequent").fit(X)
File ~/Documents/packages/scikit-learn/sklearn/base.py:1215, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
1208 estimator._validate_params()
1210 with config_context(
1211 skip_parameter_validation=(
1212 prefer_skip_nested_validation or global_skip_validation
1213 )
1214 ):
-> 1215 return fit_method(estimator, *args, **kwargs)
File ~/Documents/packages/scikit-learn/sklearn/impute/_base.py:408, in SimpleImputer.fit(self, X, y)
403 self.statistics_ = self._sparse_fit(
404 X, self.strategy, self.missing_values, fill_value
405 )
407 else:
--> 408 self.statistics_ = self._dense_fit(
409 X, self.strategy, self.missing_values, fill_value
410 )
412 return self
File ~/Documents/packages/scikit-learn/sklearn/impute/_base.py:458, in SimpleImputer._dense_fit(self, X, strategy, missing_values, fill_value)
456 def _dense_fit(self, X, strategy, missing_values, fill_value):
457 """Fit the transformer on dense data."""
--> 458 missing_mask = _get_mask(X, missing_values)
459 masked_X = ma.masked_array(X, mask=missing_mask)
461 super()._fit_indicator(missing_mask)
File ~/Documents/packages/scikit-learn/sklearn/utils/_mask.py:54, in _get_mask(X, value_to_mask)
35 """Compute the boolean mask X == value_to_mask.
36
37 Parameters
(...)
49 Missing mask.
50 """
51 if not sp.issparse(X):
52 # For all cases apart of a sparse input where we need to reconstruct
53 # a sparse output
---> 54 return _get_dense_mask(X, value_to_mask)
56 Xt = _get_dense_mask(X.data, value_to_mask)
58 sparse_constructor = sp.csr_matrix if X.format == "csr" else sp.csc_matrix
File ~/Documents/packages/scikit-learn/sklearn/utils/_mask.py:27, in _get_dense_mask(X, value_to_mask)
24 Xt = np.zeros(X.shape, dtype=bool)
25 else:
26 # np.isnan does not work on object dtypes.
---> 27 Xt = _object_dtype_isnan(X)
28 else:
29 Xt = X == value_to_mask
File ~/Documents/packages/scikit-learn/sklearn/utils/fixes.py:60, in _object_dtype_isnan(X)
59 def _object_dtype_isnan(X):
---> 60 return X != X
File ~/mambaforge/envs/dev/lib/python3.10/site-packages/pandas/_libs/missing.pyx:388, in pandas._libs.missing.NAType.__bool__()
TypeError: boolean value of NA is ambiguous
The message is too ambiguous and we should instead re raise giving a hint.
Thanks, that worked. A better exception message would be helpful indeed. The issue can be closed.
Should there be a separate issue regarding the warning? Or maybe PR could go from this one?
Should there be a separate issue regarding the warning? Or maybe PR could go from this one?
Actually there is only a single issue here, so a single PR reraise the error should be fine.
But I think that we should investigate more closely all transformer in scikit-learn and check how do they deal with pd.NA
because I am not really certain about the current state.
Can I work on this?
Describe the bug
The SimpleImputer class with strategy="most_frequent" fails during fit when one of the columns of the input dataframe is of type string.
The error happens here https://github.com/scikit-learn/scikit-learn/blob/d8e131c1640f13954b2dd5f0cc3b50b82a05a554/sklearn/utils/fixes.py#L58-L59
because (use X from the example below)
np.array(X.iloc[:, 0]) == np.array(X.iloc[:, 0])
returns a bool (False
) instead of an array of bools.Steps/Code to Reproduce
Expected Results
No errors thrown.
Actual Results
Versions