scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.4k stars 393 forks source link

Target encoding hierarchical #366

Closed PaulWestenthanner closed 2 years ago

PaulWestenthanner commented 2 years ago

Fixes #136

Proposed Changes

This pull request implements feature hierarchies in Target Encoders.
Author: @nercisla Current status: Work in Progress

PaulWestenthanner commented 2 years ago

Hi @nercisla

Thanks for your work. I created a draft pull request just because it is easier and more transparent to make comments and have the tests run. I hope that's fine with you. Also feel free to change the description if you want to

PaulWestenthanner commented 2 years ago

Hi @nercisla thanks for updating the PR. I think it looks really good now. The only open issue I see is the test that checks if a trivial hierarchy yields the same encoding as no hierarchy.
Do you see any other open topics that need to be fixed before merging? Really looking good now and I think we can merge next week. Thank you!

nercisla commented 2 years ago

I think the trivial hierarchy issue is tested by two tests:

nercisla commented 2 years ago

The features we want to move on to are (in order of priority in our opinion):

  1. User gives a column that contains the mapped higher level (i.e. they have already applied a custom function to a column to get the next level up)
  2. Multi-level hierarchies
  3. Typical hierarchical functions (such as Zipcode/postcode etc)

We think these are "nice to have" extras, which don't need to be included just now.

nercisla commented 2 years ago

For info, I have a concern over the mapping. The current code works and gives the right answer, but the mapping is reversed if we want to consider multi-level mapping. So it might be good to consider larger hierarchies and a solution before we merge...

Detail: we currently have a mapping dictionary that goes from the existing column (key) to the higher level mapping (value). But for multi-level, it makes more sense to have the highest levels first in the dictionary. So, we need to consider reversing the map. The mapping won't be so trivial for multi-level in that scenario, but makes more sense for the user.

nercisla commented 2 years ago

Joe and I have modified the code to "reverse" the mapping dictionary. This should now follow a more user-friendly version of mapping. Multi-level mapping is also complete and we've included a test.

PaulWestenthanner commented 2 years ago

Great! Thanks for implementing the multilevel hierarchy. Maybe it would be nice to give an example of the hierarchy parameter in the docstrings since this is getting a little complicated (especially with multilevel)

nercisla commented 2 years ago

Have added a simple example, since more complex ones might take even more space.

PaulWestenthanner commented 2 years ago

Great!
Thank you very much @nercisla and @joeybricks
The tests are succeeding and I'd consider it ready for the merge. Is this fine with you or is there anything missing?

nercisla commented 2 years ago

Very happy to merge!