scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.41k stars 396 forks source link

FutureWarning in ordinal encoder when downcasting objects #441

Closed eangius closed 2 months ago

eangius commented 5 months ago

Expected Behavior

No FutureWarning is thrown.

Actual Behavior

Currently the following warning is thrown.

category_encoders/ordinal.py:198: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  X[column] = X[column].astype("object").fillna(np.nan).map(col_mapping)

Neither suppressing warnings, setting the pandas option or changing the types on caller side is sufficient for correctness.

Steps to Reproduce the Problem

  1. create data frame with object dtype.
  2. fit data frame to CountEncoder (or similar)
  3. notice the warning

Specifications

eangius commented 5 months ago

For what it's worth, these local changes fixed things for me & kept tests passing. If anyone is willing to officialize this it'll be much appreciated.

diff --git a/category_encoders/ordinal.py b/category_encoders/ordinal.py
index 45d333e..94804c0 100644
--- a/category_encoders/ordinal.py
+++ b/category_encoders/ordinal.py
@@ -195,7 +195,7 @@ class OrdinalEncoder(util.BaseEncoder, util.UnsupervisedTransformerMixin):

                 # Convert to object to accept np.nan (dtype string doesn't)
                 # fillna changes None and pd.NA to np.nan
-                X[column] = X[column].astype("object").fillna(np.nan).map(col_mapping)
+                X[column] = X[column].astype("object").infer_objects(copy=False).fillna(np.nan).map(col_mapping)
                 if util.is_category(X[column].dtype):
                     nan_identity = col_mapping.loc[col_mapping.index.isna()].array[0]
                     X[column] = X[column].cat.add_categories(nan_identity)
bmreiniger commented 5 months ago

Thanks for reporting!

Your proposed fix seems fine, but I wonder whether something else might be better. The cast to object is just there (according to the comment) to accommodate np.nan as the fill, and we're about to map to numeric, so the dtype itself isn't critical information, and downcasting in particular isn't needed. Should we just opt in to the future behavior?