sdv-dev / CTGAN

Conditional GAN for generating synthetic tabular data.
Other
1.23k stars 281 forks source link

Question regarding CTGAN for data synthesis and classification tasks #306

Closed danielemolino closed 1 year ago

danielemolino commented 1 year ago

I am currently using CTGAN to synthesize data and evaluate its utility for classification tasks. I have a question regarding the observed performance when training classifiers on the generated data.

According to the quality report provided by CTGAN, the overall quality score is around 91%, so the synthesized datas should be good for the classification task. However, when I train a classifier using only the real data (performing cross-validation) I achieve about 70% accuracy. But when I try to train the classifier on the synthetic data generated by CTGAN and then to classify the test set the accuracy drops to around 50%.

I also experimented by combining the real and synthetic data in the training set, but the performance remains similar to training solely on real data.

I would appreciate any insights or suggestions to better understand these observations and improve the classification performance. Thank you for your attention to this matter.

Best regards, Daniele