scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.41k stars 396 forks source link

Fix ohe nan col #322

Closed bmreiniger closed 3 years ago

bmreiniger commented 3 years ago

Fixes #295

Proposed Changes

Prevents a column for missing values from being added in OneHotEncoder when handle_missing="error". Does this by preventing the underlying OrdinalEncoder from producing the mapping NaN->-2, by setting its handle_missing to "error" as well.

Also patches an incidental bug in OneHotEncoder.transform when handle_missing="error", when the input is not a frame.

Hacktoberfest?

wdm0006 commented 3 years ago

This makes sense to me but will let @PaulWestenthanner take a look as well.

bmreiniger commented 3 years ago

Thanks @wdm0006; for Hacktoberfest, I need (eventually) the label "hacktoberfest-accepted" for credit (also in #320).

wdm0006 commented 3 years ago

As far as I can tell, any merged pull request should do the same automatically, but if not we can add the label.

bmreiniger commented 3 years ago

I had some difficulties with the other handle_missing options, largely around the underlying OrdinalEncoder. I think I have everything working, but the code isn't particularly elegant.

Aside from my suggestion above to let indicator continue to always produce the column, I think the actual best behavior would be to

  1. add a new column if the fit data has any missings,
  2. don't add a new column if the fit data has no missings; if the transform data has missings, treat them as any other unknown category (using whatever handle_unknown option is provided).

Another question: when use_cat_names=True, should the unknown column have a more descriptive name than -1? I thought about using "unknown", but that's probably too likely to be an actual category value?

For Hacktoberfest, if I'm reading the FAQ and my progress page correctly, I need the accepted tag by tomorrow, but then we'll have 14 days to complete the review, and if it's decided that this PR doesn't meet the requirements you can invalidate it during that time.

PaulWestenthanner commented 3 years ago

LGTM thanks! Merging