sdv-dev / CTGAN

Conditional GAN for generating synthetic tabular data.
Other
1.27k stars 287 forks source link

Question regarding CTGAN for data synthesis and classification tasks #306

Closed danielemolino closed 1 year ago

danielemolino commented 1 year ago

I am currently using CTGAN to synthesize data and evaluate its utility for classification tasks. I have a question regarding the observed performance when training classifiers on the generated data.

According to the quality report provided by CTGAN, the overall quality score is around 91%, so the synthesized datas should be good for the classification task. However, when I train a classifier using only the real data (performing cross-validation) I achieve about 70% accuracy. But when I try to train the classifier on the synthetic data generated by CTGAN and then to classify the test set the accuracy drops to around 50%.

I also experimented by combining the real and synthetic data in the training set, but the performance remains similar to training solely on real data.

I would appreciate any insights or suggestions to better understand these observations and improve the classification performance. Thank you for your attention to this matter.

Best regards, Daniele

Zhangyao09274103 commented 4 days ago

Hi just would like to follow up if you have any progress on that problem! I recently got same results using SDV CTGAN, and get very bad overfitting