sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.3k stars 303 forks source link

For categorical column highly imbalanced categories are being lost in the generated data. #248

Closed abhisheknagar1983 closed 1 year ago

abhisheknagar1983 commented 3 years ago

Description

In one of the scenario for synthetic data generation using SDV, we have a dataset in which a columns have some categories which is highly imbalanced (1% Approx).

In SDV generated dataset, those highly imbalanced categories are being ignored completely. Hence the generated data looses its business purpose (due to lack of variations).

What I Did

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.
npatki commented 2 years ago

Hi @abhisheknagar1983, thanks for filing. The team is actively thinking about improvements to the modeling process.

There are two possible ways you can fix this:

  1. Try different categorical transformers for the model. Eg. the GaussianCopula model has 4 options for the categorical_transformer. It may be best to use categorical_fuzzy, as it creates continuous distributions that are more suited for copulas-based modeling.
  2. Use conditional sampling to get the exact proportions of the categories you need.

BTW, as hinted by 1, the quality of categorical data depends on the transformers. You'll likely see more improvements in this area after we integrate with the new RDT library. I'll keep this issue open until then.

npatki commented 1 year ago

Hi everyone, I'm closing off this issue because we now have RDT transformers that are able to better handle imbalanced categories such as the LabelEncoder.

We will also soon be including additional options for licensed SDV users.