sdv-dev / CTGAN

Conditional GAN for generating synthetic tabular data.
Other
1.23k stars 279 forks source link

Lossvalues are good, but the quality of the synthetic data is bad... How?? HELP WANTED #391

Closed ilkayyuksel closed 2 months ago

ilkayyuksel commented 2 months ago

Hi, I would like to ask a question.

I am using the CTGAN Model for my masterthesis, i want to generate synthetic data using the dataset CIC Collection (https://www.kaggle.com/datasets/dhoogla/cicidscollection) (intrusion detection system dataset, so it contains attacks, there are only numerical features!!). I want to generate synthetic data of a certain attack, doesn't matter which one, I choose to generate fake samples for the attack 'Infiltration', which counts 94857 real samples to train with. I have trained my CTGAN model with the following code:

from ctgan import CTGAN

ctgan = CTGAN(epochs=600, verbose=True, generator_lr=1e-5, discriminator_lr=1e-6, batch_size=128, pac=2, generator_decay=1e-6,
                 discriminator_decay=1e-6, discriminator_steps=1)
ctgan.fit(real_data, discrete_columns)

loss values: image

metrics from SDV:

KS complement Average: 0.3587

Result: Despite that the generator and discriminator are stabilizing, the quality of my fake samples is not that good, bad actually.

Then I trained CTGAN synthesizer, this one gonna put some more preprocessing init, but the results are not different.

Why is this happening? My loss values are perfectly shaped according to https://github.com/sdv-dev/SDV/discussions/980

If you need other information, please ask me! Can you help me guys? I have been struggling with this for a while.....

You can see some of the distributions (see images)

distribution_Total Fwd Packets distribution_Total Backward Packets distribution_Fwd Packets Length Total distribution_Fwd Packet Length Std distribution_Fwd Packet Length Mean distribution_Fwd Packet Length Max distribution_Flow Duration distribution_Bwd Packets Length Total distribution_Bwd Packet Length Std distribution_Bwd Packet Length Mean distribution_Bwd Packet Length Max

srinify commented 2 months ago

Hi @ilkayyuksel I'll close this issue out since we already have this thread: https://github.com/sdv-dev/SDV/issues/2010