Continuous distribution based on probablities

antonkw commented 6 years ago

I found interesting approach in paper "The Synthetic Data Vault: Generative Modeling for Relational Databases". It seems like there are no implementations in popular libs.

Steps:

Sort the categories from most frequently occurring to least.
Split the interval [0, 1] into sections based on the cumulative probability for each category.
To convert a category, find the interval [𝑎, 𝑏] ∈ [0, 1] that corresponds to the category.
Chose value between 𝑎 and 𝑏 by sampling from a truncated Gaussian distribution with 𝜇 at the center of the interval, and 𝜎 = (𝑏−𝑎) / 6.

Visualisation.

Does it seems reasonable to implement it? I'm ready to contribute this part.

wdm0006 commented 6 years ago

I think this would be an interesting thing to add, if you'd like to work on it.

wdm0006 commented 6 years ago

@antonkw just checking in, are you working on this?

antonkw commented 6 years ago

@wdm0006 you can find demo of draft implementation here: Collaboratory Notebook

Things to be done:

adopt it for general sklearn API
implement indexing for decoding.

wdm0006 commented 6 years ago

@antonkw happy to review a PR if you'd like to work on one. If you aren't planning on that, let me know and I can unassign this issue.

scikit-learn-contrib / category_encoders

Continuous distribution based on probablities #60