Open tomrod opened 3 years ago
Hi @tomrod, thanks for the kind words and the feedback.
We'll keep this feature request open for tracking and updates. To help us prioritize, it would be great if you can share a bit more about your use case. What kind of data are you working with and how do you plan to use the synthetic data?
I would second this, I can say that I am using Databricks and we are experimenting with SDV to create synthetic data that we can use to POC ideas when we don't necessarily have the data to back it up. When working with time series data, for example, it can take quite a large cluster just to fit a model to a small subset of the data (think 2000 indexes across 365 days)
I would pile on and say using pandas.DataFrame
objects with MultiTableMetadata
and HMASynthesizer
raises the following difficulties:
detect_from_dataframes
because the relational integrity constraints would most surely be broken and thus go undetected by the methoddetect_from_dataframes
is often the only option when working with legacy systems for which documentation is missing and/or badly designed (at least according to today's standards)It is my humble opinion that this must surely be a universal problem, in some sense. The decision of handling it using spark/PySpark may not be one shared by all, but the issues associated with "big data" are, and with the rising popularity of Databricks, its bound to become more and more common.
Respectfully, this is "barely a new feature" in that, yes, it requires porting the code to handle the PySpark backend, an admitedly nontrivial task, yet one through which the model's underlying mechanisms should remained unchanged (but then again, what do I know...).
Glossing over the codebase "much too quickly", it would appear as though multi_table.py
, single_table.py
, _utils.py
would at the very least be affected -- it probably runs much deeper than this.
Cheers to the dev team.
Problem Description
SDV is AWESOME! And one of the very few players in this space to be able to handle mutli-tables.
However, it is quite limited with sklearn as a backend. What would it take to add pyspark as a backend? This would integrate into modern MLOps pipelines quite nicely, allowing the training to occur on the backend rather than on main node.