I am using the CTGAN Model for my masterthesis, i want to generate synthetic data using the dataset CIC Collection (https://www.kaggle.com/datasets/dhoogla/cicidscollection) (intrusion detection system dataset, so it contains attacks, there are only numerical features!!). I want to generate synthetic data of a certain attack, doesn't matter which one, I choose to generate fake samples for the attack 'Infiltration', which counts 94857 real samples to train with.
I have trained my CTGAN model with the following code:
Hi, I would like to ask a question.
I am using the CTGAN Model for my masterthesis, i want to generate synthetic data using the dataset CIC Collection (https://www.kaggle.com/datasets/dhoogla/cicidscollection) (intrusion detection system dataset, so it contains attacks, there are only numerical features!!). I want to generate synthetic data of a certain attack, doesn't matter which one, I choose to generate fake samples for the attack 'Infiltration', which counts 94857 real samples to train with. I have trained my CTGAN model with the following code:
loss values:
metrics from SDV:
KS complement Average: 0.3587
Result: Despite that the generator and discriminator are stabilizing, the quality of my fake samples is not that good, bad actually.
Then I trained CTGAN synthesizer, this one gonna put some more preprocessing init, but the results are not different.
Why is this happening? My loss values are perfectly shaped according to https://github.com/sdv-dev/SDV/discussions/980
If you need other information, please ask me! Can you help me guys? I have been struggling with this for a while.....
You can see some of the distributions (see images)