scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.41k stars 396 forks source link

Allow function to be passed as handle missing argument that user can define which group to default to #344

Open Fish-Soup opened 2 years ago

Fish-Soup commented 2 years ago

Feature Enhancement

For encoders that have the handle_missing argument, allow a function to be passed here that takes the value of the missing value and computes an encoding for it. This allows the user to choose which encoding value is the best match for a given missing label.

def get_best_match(missing_value, available_values: Dict[Any,float]) -> float:
      """ choose which value best represents missing value"""
      return best_match

encoder = OrdinalEncoder(handle_missing=get_best_match)

Example

We have categories at train of Nokia 2.1, Nokia 2.2, Samsung A52, Samsung S10. At predict we also have Nokia 2.3, Samsung A52s

from thefuzz import process
def get_best_match(missing_value, available_values: Dict[Any,float]) -> float:
       """ perform string matching with thefuzz to get closest matching string""" 
      most_similar_label =  process.ExtractOne(missing_value, list(available_values.keys())
       return available_values[most_similar_label]

encoder = OrdinalEncoder(handle_missing=get_best_match)
PaulWestenthanner commented 2 years ago

Hi @Fish-Soup
I like this proposal. It's somewhat similar to interpolation for continuous data.
A lot of encoders internally have an ordinal encoder that first encodes data to ordinal integers before applying more fancy techniques. This would come in handy here as well since we pretty much only need the feature for the ordinal encoder (which also has the string-labels).
One thing I'm thinking about is whether this should be part of the library or rather part of a data cleansing library? Also we should then probably implement it for the handle_unknown as well. The case where the best_match is kind of far away from all seen values is also bothering me. In your example if we got a new phone maker nqwerty 1.2 it will probably say nokia is closest. Would a user really want that output?
Opinions are welcome!

Fish-Soup commented 2 years ago

Hey thanks for the response, happy to extend to missing. My intention was that the user defined the function so it would be their responsibility to handle this. In this case if the match was poor they could raise and error, return np.nan or -1/-2 if they wanted to mimic the current logic.

PaulWestenthanner commented 2 years ago

I guess you couldn't just return np.nan if you're function is supposed to handle NaN values it should not return NaN. Then you'd need a second order NaN handling strategy. You'd also need to make sure the output is always numeric (but this could be enforced easily).
Since this is basically an imputation strategy for categorical data it should maybe be implemented in the sklearn imputer directly (either SimpleImputer https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer or a new class like CategoricalImputer which inplements this strategy) What do you think about that?

Fish-Soup commented 2 years ago

Sorry for late response was on holiday. The reason I'd rather do this in inside the encoder is ot allows us to also consider the values of the encoding. For example you could return the median encoded value. Or maybe in my example, fuzzy wuzzy can return how similar the string matching is. So imagine fuzzy wuzzy matched 90% with the label which was encoded with the value 1 and 89% for values 90, 91 and 92. In this case you might be better of matching 91.

My intention was giving the user maximum flexibility in their choices.

bmreiniger commented 1 year ago

This is interesting.

Your examples sound like you mean handle_unknown. If you wanted it to work for missing values, how would that work? (If not, update the title and first post.)

From the examples, I assume available_values is supposed to be some mapping stored in the encoder? Would that have to be custom to each encoder, or just the internal label-encoding? The label-encoder is generally enforcing some order that the final encoder won't care about, so e.g. the 1, 90, 91, 92 example doesn't really mean you should prefer 90, depending on which encoder is being used.