sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.21k stars 287 forks source link

SDV support for Ray? #2030

Closed raywang1021 closed 2 weeks ago

raywang1021 commented 1 month ago

Not sure how SDV can support generate large scale of data volume like TB level dataset? Does SDV support effortlessly scale workloads like Ray? https://www.ray.io/ Many thx

srinify commented 4 weeks ago

Hi there @raywang1021 👋 we currently have a bug that might make SDV problematic to use in a distributed way: https://github.com/sdv-dev/SDV/issues/2042

Specifically, another community member pointed out here that SDV is generating temp CSV files to store intermediate progress during model training, making it problematic with Ray (which apparently tries to serialize data across multiple processes).

Have you had a chance to use SDV with Ray? How has your experience been and have you ran into the same issue with the temp CSV file? I'd love to learn more!

srinify commented 2 weeks ago

Hi there @raywang1021 hopefully this answer was helpful! I haven't heard from you in 2 weeks so I'm going to go ahead and close this issue out. Feel free to comment and tag me or open a new issue if you have more questions on this front!