sdv-dev / CTGAN

Conditional GAN for generating synthetic tabular data.
Other
1.25k stars 282 forks source link

Feature Request: More verbose logging #354

Closed ingmarfjolla closed 5 months ago

ingmarfjolla commented 6 months ago

Problem Description

Hello! I've been getting started using this library and noticed that even with setting verbose to true, the logging is sparse before the generator and discriminator begin training. I'm working with a pretty large dataset, and was working with the debugger and found that the code was hanging somewhere near here: https://github.com/sdv-dev/CTGAN/blob/main/ctgan/data_transformer.py#L54

which makes sense, since the dataset is pretty big and this might take a bit to work through.

However, it would be nice if there was a way to know that those operations were happening since without any logging I wasn't sure if something was hanging or if it was just the discriminator/generator taking a while to start.

Expected behavior

I'm not sure the best way to handle this with Python since I come from a Java background (and would be able to do something like add a debug profile with extra logging), but I would assume that the verbose flag could be passed here: https://github.com/sdv-dev/CTGAN/blob/main/ctgan/synthesizers/ctgan.py#L304 as an extra parameter and maybe in the DataTransoformer operations there could some added print or logging statements to indicate that pre-processing was happening if a user wants it.

thanks for taking a look and for maintaining this library!

srinify commented 5 months ago

Hi there @ingmarfjolla I empathize with your pain! Out of curiosity, can you share more about your dataset and how long it's taking? It would be great to know:

My immediate suggestions would be:

ingmarfjolla commented 5 months ago

Hey @srinify, thanks for your reply!

Sure thing, I was working with this: https://www.unb.ca/cic/datasets/iotdataset-2023.html dataset.

TLDR it has 47 feature columns and then the label column. Atm we mapped 15 discrete and the rest of the columns were continuous but we're also trying to mess with that a bit. All are being kept as float32s to minimize the size of the dataframe in memory.

I think there's around 44million data points in total and around 14 gigs.

We did train on 3 different subsets, around: <.1, <.5, <.8 of the original. The .1 took like 8ish hours, the .5 took like 3ish days and the .8 didn't finish since it ran out of memory. There's definitely some room here on our part to not load the original data and to just train different class labels at a time but we did find that our .5 subset wasn't producing the best data.

We did originally start with the CTGANsynthesizer from sdv but we didn't see any logs, and when I set up my debugger and profiling it was easier to work with the CTGAN directly instead of the CTGANsynthesizer.

Like I said, just for usability it would be nice to have known that it was in the pre-processing stage instead of no logging happening at all. I'm also learning so thank you for the suggestions and feedback!

I will note that we did see a huge balloon in memory during pre-processing for the CTGAN which happened directly after the continuous columns were fit. Typically when we trained a neural network (with all the data loaded in memory) we were looking at like ~40 gigs but this would balloon to like 158gigs before the CTGAN training loop event started. So any pointers there would be appreciated too. Thanks again!

srinify commented 5 months ago

Hi there @ingmarfjolla thanks for all this additional context!

Re: Progress bar & verbosity Ah, I see what you're seeing! CTGANSynthesizer does additional data pre-processing and we currently don't have a progress bar for that. But that's a good idea and I'll open an issue for that idea: https://github.com/sdv-dev/SDV/issues/1983

https://github.com/sdv-dev/SDV/issues/1983

Alternative Approaches Do you mind sharing more about your use case and goals? If your primary goal is to generate synthetic data, then I'd encourage you to consider other synthesizers as well in SDV.

GAN's in general take a long time to train with large datasets and can also be difficult to tune to improve synthetic data quality. I would encourage you to try GaussianCopulaSynthesizer, which is significantly faster and offers more knobs (like setting the column-level distribution) to influence the modeling process.

With the alternative approach (GaussianCopulaSynthesizer) and the feature request opened, I'll close out this specific issue thread. But let me know if these 2 don't really address your workflow challenges and I can always re-open!