sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.3k stars 303 forks source link

Pyspark backend option? #573

Open tomrod opened 3 years ago

tomrod commented 3 years ago

Problem Description

SDV is AWESOME! And one of the very few players in this space to be able to handle mutli-tables.

However, it is quite limited with sklearn as a backend. What would it take to add pyspark as a backend? This would integrate into modern MLOps pipelines quite nicely, allowing the training to occur on the backend rather than on main node.

npatki commented 2 years ago

Hi @tomrod, thanks for the kind words and the feedback.

We'll keep this feature request open for tracking and updates. To help us prioritize, it would be great if you can share a bit more about your use case. What kind of data are you working with and how do you plan to use the synthetic data?

Z3r0Coo1 commented 2 years ago

I would second this, I can say that I am using Databricks and we are experimenting with SDV to create synthetic data that we can use to POC ideas when we don't necessarily have the data to back it up. When working with time series data, for example, it can take quite a large cluster just to fit a model to a small subset of the data (think 2000 indexes across 365 days)

HistoryBeginsAtSumer commented 6 months ago

I would pile on and say using pandas.DataFrame objects with MultiTableMetadata and HMASynthesizer raises the following difficulties:

It is my humble opinion that this must surely be a universal problem, in some sense. The decision of handling it using spark/PySpark may not be one shared by all, but the issues associated with "big data" are, and with the rising popularity of Databricks, its bound to become more and more common.

Respectfully, this is "barely a new feature" in that, yes, it requires porting the code to handle the PySpark backend, an admitedly nontrivial task, yet one through which the model's underlying mechanisms should remained unchanged (but then again, what do I know...).

Glossing over the codebase "much too quickly", it would appear as though multi_table.py, single_table.py, _utils.py would at the very least be affected -- it probably runs much deeper than this.

Cheers to the dev team.