sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.28k stars 300 forks source link

feat: generate SDV synthetic data at scale #1350

Closed jaidisido closed 1 year ago

jaidisido commented 1 year ago

Problem Description

I have discovered SDV this week and I am impressed by its capabilities, well done!

One thing I couldn't find in the docs is how to scale the sampling process. As you know, Pandas data frames are limited to a single machine and run into OOM issues pretty easily.

Expected behavior

Ray and Modin are two frameworks that can help alleviate these issues.

Here is a quick and dirty example I have put together to showcase what I mean

Additional context

Some context:

My team looks after a popular Python library named AWS SDK for pandas (20M+ downloads/month). Our focus has been on pandas but lately have been working on a new version (currently a release candidate) which scales beyond pandas thanks to Ray and Modin.

Some links for more details:

jaidisido commented 1 year ago

Side note, one thing that has caused me trouble with the synthesizer is the output_file_path argument. My understanding is that if left to None, it will "sneakily" create a temporary .csv file locally.

This is fine as is, but in my example I am running multiple synthesizer instances at once. As result, they are concurrently trying to write to the same file. As a workaround I am using a random filename in the argument to generate separate files, but at that point I am responsible for cleaning them up once done with the sampling.

Is there a particular reason why the file must be created at all? Appreciate this is a separate thing though so can create a separate issue in the repo

sdv-team commented 1 year ago

Hi @jaidisido! It’s great to see your interest in the SDV ecosystem. This comment is a reminder to consult your legal before adopting the SDV into your project, as the recent versions of SDV have a new source-available license.

For more information, you can read through our license FAQs (not a legal advice). For any other questions, you can reach us at info@sdv.dev. You can also inquire about a commercial license to allow additional use

jaidisido commented 1 year ago

Thank you for sharing I'll keep that in mind