sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.37k stars 316 forks source link

Allow model to train in batches (helps with: memory usage, large dataset handling, progress tracking) #805

Open halvorot opened 2 years ago

halvorot commented 2 years ago

Environment details

Question

Is it possible to train a relational HMA model in batches to avoid having to load all the data into memory at once? When working with large amounts of data it might not be possible to load all rows for all tables into memory. Is it possible to fit the model to some part of the data, then save it to a pkl file, load it back and continue training with the rest of the data? It is desirable that this would yield the same result as training on all the data at once.

npatki commented 2 years ago

Hi @halvorot, this feature is not currently supported in the SDV. Let me mark this issue as a feature request rather than a question.

To help us prioritize, it would be helpful if you could provide more details about your use case. How large is your dataset? How do you plan to use the synthetic data?

halvorot commented 2 years ago

Thank you.

I am trying to expand a bit on the SDV framework and build an app that could connect to an arbitrary database and create metadata and collect data from it using SQLAlchemy. Thus, I currently don't know how large future datasets will be but it is important for it to scale. An example database I am working with now is the MySQL Employee database (https://dev.mysql.com/doc/employee/en/) and SDV is struggling to fit the data but I got it to fit the data in about 2.5 hours

Employee Database Tables: +--------------+------------------+ | table_name | records | +--------------+------------------+ | employees | 300024 | | departments | 9 | | dept_manager | 24 | | dept_emp | 331603 | | titles | 443308 | | salaries | 2844047 | +--------------+------------------+

The plan is then to insert the data into another copy of the database schema to create a "synthetic copy" of the original data. A realistic database size would be around 10 tables with 500 000 to 5 000 000 rows per table. Is this an unrealistic expectation to try to model that much data with SDV?

npatki commented 2 years ago

Hi @halvorot, thanks for the details!

There are certainly ways to optimize the algorithm for a speed-up. But it comes with tradeoffs so it is not a one-size-fits-all solution -- it's highly dependent on the dataset, what you're planning to accomplish using the synthetic data, etc.

Happy to chat when you have a few, concrete use cases that come up. In the meantime, you can play around with the model_kwargs to figure out what suits your needs, or use smaller datasets for explorations.

BTW the SDV team is actively thinking about optimizations and we may make some general updates in future releases.

halvorot commented 2 years ago

Thank you! I understand that speed-ups and optimization is very dataset dependent so that would just be an added bonus. I am thinking more about the memory issue when having to load the entire database into memory in order to train a model, it might not be feasible if there is a lot of data present. So my feature request is really for continued training on a previously trained model on additional data (to be able to load half the data, all tables but half the rows, into memory and train and then load the other half and train).

Sorry for the poor description, is it understandable what mye problem is and what I am thinking about?

I would just like to mention that I think SDV is an awesome library and you are doing some great work here, so happy to be able to use it. Keep it up :)

anyweez commented 2 years ago

@halvorot if I understand correctly, you want to train an SDV model on a potentially very large input dataset (larger than memory). I believe this is typically referred to as "out of core training." This is a use case I'm interested in as well.

Example: I have a 2TB dataset that I'd like to train on, but I don't have a machine w/ 2TB of memory. Out of core training would allow me to train 30GB at a time (for example) and eventually achieve the same results as if I'd trained w/ all 2TB.

I haven't tried this in SDV, but I have done some exploration and I think it's possible with Pytorch. Here's what I intend to try out, though I haven't done it yet: https://discuss.pytorch.org/t/loading-huge-data-functionality/346

I'll report back if I learn anything.

npatki commented 2 years ago

So my feature request is really for continued training on a previously trained model on additional data

Yup! Let's focus this issue on the memory issues of loading large data.

@anyweez nice to meet you! I'm interested to see what you discover. Note that the SDV models do not currently support training in batches, but understanding how the data can be batch loaded is a good first step.

npatki commented 2 years ago

As we see more issues related to this, it's becoming clear that batch training is useful for more than just memory usage. I'll update the title to reflect this.

DamianUS commented 2 years ago

+1! Following the progress on this =)

vinay-k12 commented 1 year ago

I'm looking forward to this product feature as well. While this helps to handle large data sets, it also helps to offer use transfer learning to new problems. If we can load these pre-trained models, we can necessarily train to new problems and helps to bridge the gap of insufficient data sets.

Any tentative timelines on when this feature would be rolled out?

ardulat commented 4 months ago

Hi @npatki! I see that this issue has been referenced in many issues related to batch training over the last two years. Are there any updates on batch training? I'm also unable to fit millions of rows into memory and train the sequential model on large data. The batch training would hugely help to train on large amounts of data, though I understand that SDV can perform well with little data.