I am currently using CTGAN to synthesize data and evaluate its utility for classification tasks. I have a question regarding the observed performance when training classifiers on the generated data.
According to the quality report provided by CTGAN, the overall quality score is around 91%, so the synthesized datas should be good for the classification task. However, when I train a classifier using only the real data (performing cross-validation) I achieve about 70% accuracy. But when I try to train the classifier on the synthetic data generated by CTGAN and then to classify the test set the accuracy drops to around 50%.
I also experimented by combining the real and synthetic data in the training set, but the performance remains similar to training solely on real data.
I would appreciate any insights or suggestions to better understand these observations and improve the classification performance. Thank you for your attention to this matter.
I am currently using CTGAN to synthesize data and evaluate its utility for classification tasks. I have a question regarding the observed performance when training classifiers on the generated data.
According to the quality report provided by CTGAN, the overall quality score is around 91%, so the synthesized datas should be good for the classification task. However, when I train a classifier using only the real data (performing cross-validation) I achieve about 70% accuracy. But when I try to train the classifier on the synthetic data generated by CTGAN and then to classify the test set the accuracy drops to around 50%.
I also experimented by combining the real and synthetic data in the training set, but the performance remains similar to training solely on real data.
I would appreciate any insights or suggestions to better understand these observations and improve the classification performance. Thank you for your attention to this matter.
Best regards, Daniele