Pyspark backend option?

tomrod commented 3 years ago

Problem Description

SDV is AWESOME! And one of the very few players in this space to be able to handle mutli-tables.

However, it is quite limited with sklearn as a backend. What would it take to add pyspark as a backend? This would integrate into modern MLOps pipelines quite nicely, allowing the training to occur on the backend rather than on main node.

npatki commented 2 years ago

Hi @tomrod, thanks for the kind words and the feedback.

We'll keep this feature request open for tracking and updates. To help us prioritize, it would be great if you can share a bit more about your use case. What kind of data are you working with and how do you plan to use the synthetic data?

Z3r0Coo1 commented 2 years ago

I would second this, I can say that I am using Databricks and we are experimenting with SDV to create synthetic data that we can use to POC ideas when we don't necessarily have the data to back it up. When working with time series data, for example, it can take quite a large cluster just to fit a model to a small subset of the data (think 2000 indexes across 365 days)

HistoryBeginsAtSumer commented 6 months ago

I would pile on and say using pandas.DataFrame objects with MultiTableMetadata and HMASynthesizer raises the following difficulties:

All the data has to be loaded in memory -- this can rapidly become prohibitive for large databases or systems
Using data sample doesn't appear an option whenever using detect_from_dataframes because the relational integrity constraints would most surely be broken and thus go undetected by the method
Resorting to detect_from_dataframes is often the only option when working with legacy systems for which documentation is missing and/or badly designed (at least according to today's standards)
Generating test data for the entire system is required precisely because the system's migration is the intended end goal

It is my humble opinion that this must surely be a universal problem, in some sense. The decision of handling it using spark/PySpark may not be one shared by all, but the issues associated with "big data" are, and with the rising popularity of Databricks, its bound to become more and more common.

Respectfully, this is "barely a new feature" in that, yes, it requires porting the code to handle the PySpark backend, an admitedly nontrivial task, yet one through which the model's underlying mechanisms should remained unchanged (but then again, what do I know...).

Glossing over the codebase "much too quickly", it would appear as though multi_table.py, single_table.py, _utils.py would at the very least be affected -- it probably runs much deeper than this.

Cheers to the dev team.

sdv-dev / SDV

Pyspark backend option? #573

Problem Description