scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.4k stars 393 forks source link

pd.NA should behave as np.nan #424

Closed tvdboom closed 12 months ago

tvdboom commented 1 year ago

Expected Behavior

pd.NA should behave the same as np.nan and be returned when handle_missing="return_nan".

Actual Behavior

pd.NA is treated like an other category.

Steps to Reproduce the Problem

from category_encoders.target_encoder import TargetEncoder

TargetEncoder(handle_missing="return_nan").fit_transform([["a"], ["b"], [pd.NA]], y=[0, 1, 1])

returns

          0
0  0.579928
1  0.710036
2  0.666667

instead of

          0
0  0.579928
1  0.710036
2       <NA>
SimonD7 commented 1 year ago

You just need to add this argument "handle_unknown="return_nan":

TargetEncoder(handle_missing="return_nan", handle_unknown="return_nan").fit_transform([["a"], ["b"], [pd.NA]], y=[0, 1, 1])

tvdboom commented 1 year ago

That's not the same. I want unknown values to return the target mean, like handle_unknown="value" does, and missing values return missing. Also, your code returns np.nan, and not pd.NA. It would be better if the returned NA type is the same as the input one.

SimonD7 commented 1 year ago

You can use Numpy :

Your data :

data = [["a"], ["b"], [pd.NA]]
y = [0, 1, 1]

Replace pd.NA with np.nan :

data = [[val if not pd.isna(val) else np.nan for val in row] for row in data]

Apply TargetEncoder :

encoder = TargetEncoder(handle_missing="return_nan")
encoded_data = encoder.fit_transform(data, y)

Convert the result back to pd.NA where np.nan is present :

encoded_data = pd.DataFrame([[pd.NA if pd.isna(val) else val for val in row] for row in encoded_data.values], columns=encoded_data.columns)

print(encoded_data)

I hope I was able to help you

tvdboom commented 1 year ago

Thanks, but what I am looking for is a change in the library, to have a structural implementation, and not an adhoc solution

PaulWestenthanner commented 1 year ago

agreed! this should be changed. Do you want to create a PR?