Open ilkayyuksel opened 1 month ago
Hi there @ilkayyuksel 👋
Do you mind sharing some visualizations of what your marginal distributions look like? This would help us understand if they're bi-model, skewed, etc.
In general, the loss chart looks good and that can correlate with high quality synthetic data but it's not always the case with CTGAN. GAN's in general can be cumbersome to tweak (which is often why we point people to using Gaussian Copulas instead!) but it seems like this is the approach you'll need to take.
Some potential avenues to consider:
Pre-process the data more thoroughly to make it easier for CTGAN to capture the patterns better. If you're able to use SDV instead of CTGAN directly, we do some pre-processing for you based on the metadata and we make it easy for you to tweak the data transformations. If that's interesting to you, check out CTGANSynthesizer from SDV.
Tune the hyperparameter using an external library. BTB is one that comes to mind (but we aren't experts in this ourselves so we can't provide specific support). You can read our FAQ article here on tuning hyperparameters.
Hi there @ilkayyuksel I'm closing this issue out for now since I haven't heard from you in a while. But comment here and we can re-open if you still need guidance!
I'd also encourage you to join our Slack community if you aren't there already :)
Hi @srinify
Thank you for your response! I didnt see your response, my bad! I am still suffering from generating good fake samples. I have tried the Gaussian Copulas, but again, the quality is bad. I have tried the CTGAN synthesizer, but the quality is not different.
I have tested this all with another dataset, CIC collection (https://www.kaggle.com/datasets/dhoogla/cicidscollection) all features are numerical!
Some distributions are added in this message.
What do you think? If you need any other information, please ask! What should I do? I have send a message on Slack aswell...
Hi @ilkayyuksel synthetic data modeling is definitely data dependent. Another area to explore is data transformation, to better prepare the data for SDV. Our RDT library has many transformers to explore: https://docs.sdv.dev/rdt/transformers-glossary/numerical/clusterbasednormalizer
I'd also recommend incorporating SDMetrics into your pipeline if you aren't already so you can understand how each iteration is improving the most relevant scores: https://docs.sdv.dev/sdmetrics
Besides my existing advice, I unfortunately don't think there's much more I can offer here. For some datasets, it just takes lots of iteration.
@srinify Thanks a lot for your response and help. I will try these suggestions. Another question: How can I set restrictions on my model? So I am gonna loop over my features and set restrictions, eg. boundaries, columns that shouldn't change ...
I am using the CTGAN Model for my masterthesis, i want to generate synthetic data using dataset UNSW_NB15 (intrusion detection system dataset, zo it contains attacks). I want to generate synthetic data of 'Generic attacks', which counts 58871 real samples to train with.
I have trained my CTGAN model with the following code:
lossvalues:
Those are my lossvalues for my generator and discriminator, if you look at the discussion https://github.com/sdv-dev/SDV/discussions/980 , you would expect really good synthetic data generated by the CTGAN Model.
But if I use the metrics from SDV, comparing the real data with the synthetic data, the scores from the metrics are bad:
KS_complement:
TV_complement:
The visual distributions of each feature are also bad.
Can you help me? what did I wrong? Why have the fake samples bad quality?
PS. If I use SMOTE, the scores of the SDV metrics are better. But I have to use a GAN model...