sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.21k stars 288 forks source link

Option to treat Categorical data as stratum in stratified sampling #195

Open LihuaXiong2020 opened 3 years ago

LihuaXiong2020 commented 3 years ago

Description

Right now the sampling of categorical data is based on Gaussian distribution. But there's a use case (esp. in tabular data modeling) where users want to treat categorical data as a stratum in stratified sampling, where the ratios of the population in each stratum to the total population are expected to match exactly. Curious if the current version SDV can support that (with some transformation of the dataset), or can it be extended to cover this case?

csala commented 3 years ago

I'm afraid this is not offered as a feature exactly the way you described it, but it is possible to achieve this behavior by playing with constraints and the reject sampling strategy.

For example, one option would be to define an is_valid function that receives the relative category frequencies as hyperparameters and randomly discards rows of the overpopulated categories until enough rows have been sampled for each category.