scikit-learn-contrib / category_encoders

A library of sklearn compatible categorical variable encoders
http://contrib.scikit-learn.org/category_encoders/
BSD 3-Clause "New" or "Revised" License
2.4k stars 393 forks source link

Continuous distribution based on probablities #60

Open antonkw opened 6 years ago

antonkw commented 6 years ago

I found interesting approach in paper "The Synthetic Data Vault: Generative Modeling for Relational Databases". It seems like there are no implementations in popular libs.

Steps:

  1. Sort the categories from most frequently occurring to least.
  2. Split the interval [0, 1] into sections based on the cumulative probability for each category.
  3. To convert a category, find the interval [𝑎, 𝑏] ∈ [0, 1] that corresponds to the category.
  4. Chose value between 𝑎 and 𝑏 by sampling from a truncated Gaussian distribution with 𝜇 at the center of the interval, and 𝜎 = (𝑏−𝑎) / 6.

Visualisation. image

Does it seems reasonable to implement it? I'm ready to contribute this part.

wdm0006 commented 6 years ago

I think this would be an interesting thing to add, if you'd like to work on it.

wdm0006 commented 6 years ago

@antonkw just checking in, are you working on this?

antonkw commented 6 years ago

@wdm0006 you can find demo of draft implementation here: Collaboratory Notebook

Things to be done:

wdm0006 commented 6 years ago

@antonkw happy to review a PR if you'd like to work on one. If you aren't planning on that, let me know and I can unassign this issue.