scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.41k stars 395 forks source link

OneHotEncoder produces a NaN field (nan as a suffix to the name) even though there is no missing data #295

Closed princyok closed 3 years ago

princyok commented 3 years ago

Description

OneHotEncoder produces a field with nan as a suffix to its name when handle_missing is set to "error", even though there is no missing data. Leaving handle_missing argument set to the default value works fine.

Code to reproduce

Assume data is a dataframe with a categorical column named "main_field", and there is no missing data in this column.

encoder = one_hot.OneHotEncoder(cols=["main_field"], handle_missing="error")

enc_data = encoder.fit_transform(data)

enc_data will have a column named "main_field_nan" that's made of only zeros. The result is also the same if handle_unknown="error" is also passed as an argument along with handle_missing="error".

Specifications