scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.41k stars 393 forks source link

Feature Request: Count-Based Target Encoder (Dracula)? #420

Open bking124 opened 1 year ago

bking124 commented 1 year ago

I recently stumbled upon a categorical encoding idea dubbed "Distributed Robust Algorithm for Count-based Learning" (aka Dracula) described in this Microsoft blog as well as this talk. It seems like it mixes ideas of CountEncoder and TargetEncoder. Has anybody heard of this approach before and has there been thought of introducing such an encoder into the package? I'm interested to compare this approach with the typical TargetEncoder.

Thanks for the wonderful package!

PaulWestenthanner commented 1 year ago

Hi @bking124

I haven't heard of the approach before. Searching "Dracula Encoder" or "CTR encoder" (as mentioned in the talk) also doesn't yield much. Since the talk and blog post are already 8 years old and it didn't get much traction since I'd be surprised if yields great results.
On the other hand we could include it into the package. I think it should be rather straight forward to implement. From what I understood the encoded value is calculated as:

  1. calculate the counts for each label df.groupBy(col, label).count(). This can be only done for the top N and the rest will go to a rest category
  2. use as encoded value for a label x: counts[x, target=0], counts[x, target=1], ..., log-odds, flag_is_rest

I'm not quite sure how to handle the regression case. Probably we'd need some binning of the target variable there? Also small categories might result in overfitting if the classifier basically ignores the counts and just uses the log odds (which it will). This might be a potential issue (just like in target encoding with too little regularization). In fact this is pretty much what you'd get when you encode a variable with both count encoder and target encoder (with no regularisation).