sdv-dev / CTGAN

Conditional GAN for generating synthetic tabular data.
Other
1.27k stars 287 forks source link

Question about large amount of training dataset in TVAE -- is there max? #299

Closed koseoyoung closed 8 months ago

koseoyoung commented 1 year ago

Environment details

If you are already running CTGAN, please indicate the following details about the environment in which you are running it:

Problem description

I'm wondering if there is any max length of the training dataset for TVAE (for dataset fitting). I've tried a large dataset, but it seems like it takes too long, although the epoch is specified as 1. The dataset size was around 80 MB, and I was running the code with CPU. (keep running more than 1 hr -- and not able to see any logs) Is it expected behavior? Since there is no verbose option, debugging whether it's working on training or having some error is hard.

Thank you! : )

What I already tried

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.
npatki commented 1 year ago

Hi @koseoyoung, nice to meet you.

As long as your code isn't crashing, it should still be running as intended. Even though you have only specified 1 epoch, TVAE still uses the batch_size parameter to iterate through different portions of your data. This may be taking some time.

I understand the frustration of not having a verbose option, so I've added #300 as a proposed feature request.

I'm wondering if there is any max length of the training dataset for TVAE (for dataset fitting).

While there is no theoretical max length, you may find certain dataset sizes infeasible for the computational power that you have. For GAN-based synthesizers, many users report needing a few hours.

The dataset size was around 80 MB, and I was running the code with CPU.

If possible, running on a GPU might be a good option. Alternatively, you can subsample your data for training purposes. The important thing is to make sure your subsample contains the patterns you are trying to learn. For example, all the possible categories, a large range of numerical values, etc.

npatki commented 8 months ago

Marking this issue as resolved since it has been inactive for some time. The good news is that the feature in #300 has been added, so you can now view the progress bar to track estimated time.

If you have additional questions, please feel free to file a new issue. Thanks.