Open LihuaXiong2020 opened 3 years ago
I'm afraid this is not offered as a feature exactly the way you described it, but it is possible to achieve this behavior by playing with constraints and the reject sampling strategy.
For example, one option would be to define an is_valid
function that receives the relative category frequencies as hyperparameters
and randomly discards rows of the overpopulated categories until enough rows have been sampled for each category.
Description
Right now the sampling of categorical data is based on Gaussian distribution. But there's a use case (esp. in tabular data modeling) where users want to treat categorical data as a stratum in stratified sampling, where the ratios of the population in each stratum to the total population are expected to match exactly. Curious if the current version SDV can support that (with some transformation of the dataset), or can it be extended to cover this case?