Closed abhisheknagar1983 closed 1 year ago
Hi @abhisheknagar1983, thanks for filing. The team is actively thinking about improvements to the modeling process.
There are two possible ways you can fix this:
categorical_transformer
. It may be best to use categorical_fuzzy
, as it creates continuous distributions that are more suited for copulas-based modeling.BTW, as hinted by 1, the quality of categorical data depends on the transformers. You'll likely see more improvements in this area after we integrate with the new RDT library. I'll keep this issue open until then.
Hi everyone, I'm closing off this issue because we now have RDT transformers that are able to better handle imbalanced categories such as the LabelEncoder.
We will also soon be including additional options for licensed SDV users.
Description
In one of the scenario for synthetic data generation using SDV, we have a dataset in which a columns have some categories which is highly imbalanced (1% Approx).
In SDV generated dataset, those highly imbalanced categories are being ignored completely. Hence the generated data looses its business purpose (due to lack of variations).
What I Did