scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.39k stars 393 forks source link

Handle missing in one hot encoder #400

Open PaulWestenthanner opened 1 year ago

PaulWestenthanner commented 1 year ago

Expected Behavior

Currently, handle_missing=value adds a new column although the documentation says 'value' will encode a new value as 0 in every dummy column. Furthermore, we need a test for this

Actual Behavior

adds a column instead of using all 0

Steps to Reproduce the Problem

from category_encoders import OneHotEncoder
import pandas as pd

he = OneHotEncoder(handle_missing="value")

data = [("foo", 1), ("bar", 2), (None, 6)]
data = pd.DataFrame(data, columns=["c1", "c2"])
print(he.fit_transform(data))

Specifications

bmreiniger commented 1 year ago

Would this replace the new "ignore" from #396?

I would expect this to be the correct behavior; is the added column a longstanding behavior, or perhaps a regression that wasn't caught in testing?

PaulWestenthanner commented 1 year ago

Oh you're right. I missed this when adding the ignore option. Thanks for pointing out.
not sure about the naming though... we have the option value to put in "some value that makes sense" in most encoders. So it makes sense for people familiar with the library, ignore on the other hand is more telling

lazarust commented 9 months ago

@PaulWestenthanner I can take this if no one else has!