sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.28k stars 300 forks source link

Model multi-table datasets using non-parametric models (CTGAN, CopulaGAN, etc.) #457

Closed abedshantti closed 1 year ago

abedshantti commented 3 years ago

Problem Description

This is more of a question rather than feature request (but it doesn't fit the question issues either). I have noticed that Gaussian Coupla is the default generative model for the relational database HMA1 model. Is there a specific reason why the Gaussian model was used? Would the other synthesizers, such as CTGAN work well for relational data?

Expected behavior

Using CTGAN might generate even more realistic synthetic children datasets than the Gaussian Coupla model. I am not sure if this something that was implemented earlier with relational databases or not.

Ability to contribute

I am available to contribute if the developers want to add the CTGAN as an option when implementing HMA1. Otherwise, I am happy to implement my own solution and submit a pull request if everything works as expected.

MLjungg commented 2 years ago

Hi SDV,

I'm curious about the problem description as well – any comment @csala, @npatki?

Edit: Question answered here https://github.com/sdv-dev/SDV/issues/526.

npatki commented 2 years ago

Right now, the HMA1 algorithm is only compatible with single table models that are parametric. GaussianCopula is currently our only parametric single table model. You can read more about HMA1 in the original SDV paper.

We can keep this issue open as a feature request for multi-table algorithms that can use non-parametric models (like CTGAN, CopulaGAN, etc.)

npatki commented 1 year ago

I'm closing this issue as resolved, as we now have an HSASynthesizer that is capable of using non-parametric models on individual tables.

This is currently available in our private library to licensed users. If you'd like to explore this option, feel free to contact us.