scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.39k stars 393 forks source link

FutureWarning in ordinal encoder when downcasting objects #441

Open eangius opened 1 month ago

eangius commented 1 month ago

Expected Behavior

No FutureWarning is thrown.

Actual Behavior

Currently the following warning is thrown.

category_encoders/ordinal.py:198: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  X[column] = X[column].astype("object").fillna(np.nan).map(col_mapping)

Neither suppressing warnings, setting the pandas option or changing the types on caller side is sufficient for correctness.

Steps to Reproduce the Problem

  1. create data frame with object dtype.
  2. fit data frame to CountEncoder (or similar)
  3. notice the warning

Specifications

eangius commented 1 month ago

For what it's worth, these local changes fixed things for me & kept tests passing. If anyone is willing to officialize this it'll be much appreciated.

diff --git a/category_encoders/ordinal.py b/category_encoders/ordinal.py
index 45d333e..94804c0 100644
--- a/category_encoders/ordinal.py
+++ b/category_encoders/ordinal.py
@@ -195,7 +195,7 @@ class OrdinalEncoder(util.BaseEncoder, util.UnsupervisedTransformerMixin):

                 # Convert to object to accept np.nan (dtype string doesn't)
                 # fillna changes None and pd.NA to np.nan
-                X[column] = X[column].astype("object").fillna(np.nan).map(col_mapping)
+                X[column] = X[column].astype("object").infer_objects(copy=False).fillna(np.nan).map(col_mapping)
                 if util.is_category(X[column].dtype):
                     nan_identity = col_mapping.loc[col_mapping.index.isna()].array[0]
                     X[column] = X[column].cat.add_categories(nan_identity)
bmreiniger commented 1 month ago

Thanks for reporting!

Your proposed fix seems fine, but I wonder whether something else might be better. The cast to object is just there (according to the comment) to accommodate np.nan as the fill, and we're about to map to numeric, so the dtype itself isn't critical information, and downcasting in particular isn't needed. Should we just opt in to the future behavior?