sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.39k stars 317 forks source link

Diffusion model-based synthetic data generation #1061

Open onacrame opened 2 years ago

onacrame commented 2 years ago

Interesting use of diffusion models to generate synthetic data:

https://github.com/rotot0/tab-ddpm/blob/main/

https://arxiv.org/abs/2209.15421

npatki commented 2 years ago

Thanks for filing @onacrame. We can take a look and keep this issue open to communicate any updates.

To help us prioritize, I'm curious if there are any use cases where you think a diffusion model would work best? And also any metrics of particular interest when generating synthetic data?

jjmarks commented 1 year ago

To help us prioritize, I'm curious if there are any use cases where you think a diffusion model would work best? And also any metrics of particular interest when generating synthetic data?

This paper (Feb. 2023) suggests improvements for transformer-based models, specifically with regard to machine learning efficacy, over TVAE and CTGAN baselines.