scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.4k stars 393 forks source link

OneHotEncoder: handle_missing = 'ignore' would be very useful #386

Closed woodly0 closed 1 year ago

woodly0 commented 1 year ago

Expected Behavior

It would be nice to be able to ignore missing values instead of creating new columns with an "_nan" suffix. Just like it is possible with pandas. What do you think?

Actual Behavior

Doesn't exist in the current latest version (accoring to my knowledge)

Steps to Reproduce

import pandas as pd
import numpy as np
from category_encoders import OneHotEncoder

encoder = OneHotEncoder(
    cols=None,  # all non-numeric
    return_df=True,
    handle_missing="value",  # would be nice to have the option 'ignore'
    use_cat_names=True,
)
df = pd.DataFrame(
    {"this": ["GREEN", "GREEN", "YELLOW", "YELLOW"], "that": ["A", "B", "A", np.nan]}
)

encoder.fit_transform(df) # unwanted result
pd.get_dummies(df, dummy_na=False) # wanted result

Specifications

PaulWestenthanner commented 1 year ago

I agree this would be useful. Do you want to create a pull request for it?

woodly0 commented 1 year ago

Hey Paul. Thanks for your reply. I will try to implement it.