sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.32k stars 305 forks source link

Scalability of theSDV tool and GPU support for Multi Table Data #2110

Closed manurajmr1 closed 2 months ago

manurajmr1 commented 3 months ago

Environment details

Problem description

Hi i wanted to use SDV for generating multi table data. For that i used HMASynthesizer which is free. SO i have few queries on this tool. Is the free synthesizer HMASynthesizer felt slow while testing? how much scalability it can provide like example 1M data with constraints it took for me around 3 days. in the doc it says - The HMA Synthesizer uses hierarchical ML algorithm to learn from real data and generate synthetic data. The algorithm uses classical statistics. which means it doesnt leverage GPU right? as its not neural network. Also i could find other synthesizers paid ones like HSASynthesizer, IndependentSynthesizer etcc, does that leverage GPU if i use these ones which support a neural net synthesizer. And how much time will it take to generate a synthetic data around 1M with 2 tables 5 columns each in two tables and maintain a primary key foriegn key relation between tables, with a date constraint like hotel booking < hotel checkout date. Is there a trail for paid version is available? to see it support GPU neural training, and to see it support also parallelism like distributing the load into multiple GPUS for faster performance (as pytorch by default support this).

What I already tried

I tried the Multi table Data use case and saw the process is slow, but the quality of data generated is good.

npatki commented 3 months ago

Hi @manurajmr1 nice to meet you.

Since your questions are related to our free vs. paid plans, I would encourage you to reach out here for more clarity. This GitHub is primarily meant for trackings bugs and troubleshooting code in the free version.

To answer your Qs briefly:

npatki commented 2 months ago

Hi @manurajmr1, I'm closing off this issue since we've pointed you to the right venue for this type of Q.

If you need help troubleshooting any code when using SDV, please feel free to file a new issue here with any related code snippet(s).

deepakbhavsar123 commented 3 weeks ago

Does SDV Enterprise solve the performance issue? SDV-free sdk takes too much time to generate data.