scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.39k stars 393 forks source link

Pandas' string columns are not recognized #421

Closed tvdboom closed 10 months ago

tvdboom commented 10 months ago

Expected Behavior

Category encoders should recognize pandas string and string[pyarrow] types.

Actual Behavior

The column isn't recognized as categorical, and the dataframe is returned as is.

Steps to Reproduce the Problem

import pandas as pd
from category_encoders.target_encoder import TargetEncoder

X = pd.DataFrame([['a'], ['b']], dtype="string")
y = [0, 1]
print(X.dtypes)

print(TargetEncoder().fit_transform(X, y))

produces output:

0    string[python]
dtype: object

Warning: No categorical columns found. Calling 'transform' will only return input data.

   0
0  a
1  b

Specifications

PaulWestenthanner commented 10 months ago

I agree that string and arrow string should be recognized as categorical.
Even the categorical type itself it currently not recognized as such. https://github.com/scikit-learn-contrib/category_encoders/blob/80b4a9b9d85ac449fcb7a3543c6ee4353013f41f/category_encoders/utils.py#L35

That's the function that need to be adjusted (and renamed)

tvdboom commented 10 months ago

alright, I'll make a pr

PaulWestenthanner commented 10 months ago

thanks