sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.21k stars 287 forks source link

Lossvalues are good, but the quality of the synthetic data is bad #2010

Open ilkayyuksel opened 1 month ago

ilkayyuksel commented 1 month ago

I am using the CTGAN Model for my masterthesis, i want to generate synthetic data using dataset UNSW_NB15 (intrusion detection system dataset, zo it contains attacks). I want to generate synthetic data of 'Generic attacks', which counts 58871 real samples to train with.

I have trained my CTGAN model with the following code:

from ctgan import CTGAN

ctgan = CTGAN(epochs=600, verbose=True, generator_lr=1e-5, discriminator_lr=1e-6, batch_size=128, pac=2, generator_decay=1e-6,
                 discriminator_decay=1e-6, discriminator_steps=1)
ctgan.fit(real_data, discrete_columns)

lossvalues:

image

Those are my lossvalues for my generator and discriminator, if you look at the discussion https://github.com/sdv-dev/SDV/discussions/980 , you would expect really good synthetic data generated by the CTGAN Model.

But if I use the metrics from SDV, comparing the real data with the synthetic data, the scores from the metrics are bad:

KS_complement:

Column: dur , Score:  0.47134738665896614
Column: spkts , Score:  0.6197188938526609
Column: dpkts , Score:  0.723784647789234
Column: sbytes , Score:  0.30066847853781997
Column: dbytes , Score:  0.39178464778923405
Column: rate , Score:  0.5549714460430433
Column: sload , Score:  0.6335265580676395
Column: dload , Score:  0.3017846477892341
Column: sloss , Score:  0.777054237230555
Column: dloss , Score:  0.7870712235226173
Column: sinpkt , Score:  0.47327176368670476
Column: dinpkt , Score:  0.36778464778923403
Column: sjit , Score:  0.30070190756059856
Column: djit , Score:  0.4202410864432403
Column: swin , Score:  0.7760882098146795
Column: stcpb , Score:  0.3720882098146795
Column: dtcpb , Score:  0.44599999999999995
Column: dwin , Score:  0.7760882098146795
Column: tcprtt , Score:  0.37802026464643035
Column: synack , Score:  0.483
Column: ackdat , Score:  0.485
Column: smean , Score:  0.3137999864109664
Column: dmean , Score:  0.713784647789234
Column: trans_depth , Score:  0.8535419136756637
Column: response_body_len , Score:  0.4398115031169847
Column: ct_src_dport_ltm , Score:  0.36141310662295534
Column: ct_dst_sport_ltm , Score:  0.3924831920640043
Column: ct_flw_http_mthd , Score:  0.8455249273836014

Average:  0.5271555622826664

TV_complement:

Column: proto , Score:  0.42350683698255553
Column: service , Score:  0.28905916325525305
Column: state , Score:  0.6670394421701688

Average:  0.45986848080265913

The visual distributions of each feature are also bad.

Can you help me? what did I wrong? Why have the fake samples bad quality?

PS. If I use SMOTE, the scores of the SDV metrics are better. But I have to use a GAN model...

srinify commented 1 month ago

Hi there @ilkayyuksel 👋

Do you mind sharing some visualizations of what your marginal distributions look like? This would help us understand if they're bi-model, skewed, etc.

In general, the loss chart looks good and that can correlate with high quality synthetic data but it's not always the case with CTGAN. GAN's in general can be cumbersome to tweak (which is often why we point people to using Gaussian Copulas instead!) but it seems like this is the approach you'll need to take.

Some potential avenues to consider:

srinify commented 1 month ago

Hi there @ilkayyuksel I'm closing this issue out for now since I haven't heard from you in a while. But comment here and we can re-open if you still need guidance!

I'd also encourage you to join our Slack community if you aren't there already :)

ilkayyuksel commented 6 days ago

Hi @srinify

Thank you for your response! I didnt see your response, my bad! I am still suffering from generating good fake samples. I have tried the Gaussian Copulas, but again, the quality is bad. I have tried the CTGAN synthesizer, but the quality is not different.

I have tested this all with another dataset, CIC collection (https://www.kaggle.com/datasets/dhoogla/cicidscollection) all features are numerical!

Some distributions are added in this message.

distribution_Total Fwd Packets distribution_Total Backward Packets distribution_Fwd Packets Length Total distribution_Fwd Packet Length Std distribution_Fwd Packet Length Mean distribution_Fwd Packet Length Max distribution_Flow Duration distribution_Bwd Packets Length Total distribution_Bwd Packet Length Std distribution_Bwd Packet Length Mean distribution_Bwd Packet Length Max

What do you think? If you need any other information, please ask! What should I do? I have send a message on Slack aswell...

srinify commented 1 day ago

Hi @ilkayyuksel synthetic data modeling is definitely data dependent. Another area to explore is data transformation, to better prepare the data for SDV. Our RDT library has many transformers to explore: https://docs.sdv.dev/rdt/transformers-glossary/numerical/clusterbasednormalizer

I'd also recommend incorporating SDMetrics into your pipeline if you aren't already so you can understand how each iteration is improving the most relevant scores: https://docs.sdv.dev/sdmetrics

Besides my existing advice, I unfortunately don't think there's much more I can offer here. For some datasets, it just takes lots of iteration.

ilkayyuksel commented 22 hours ago

@srinify Thanks a lot for your response and help. I will try these suggestions. Another question: How can I set restrictions on my model? So I am gonna loop over my features and set restrictions, eg. boundaries, columns that shouldn't change ...