sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.21k stars 287 forks source link

How to improve the performance of synthesizers? #2038

Closed hanzigs closed 2 weeks ago

hanzigs commented 1 month ago

Hi, Very good library for synthetic data generation, I have a dataset, properly transformed with normalization, standardisation, pre-processing, and removed outliers. LGBM and XGB algorithms produce very good F1 score and PRAUC on validation data. Here is the sample image Class Distribution is image I have tried 4 synthesizers image I could not improve the real data and generated data similarity, the KS statistics p-value, Jensen-Shannon Divergence, Earth Mover's Distance, PCA Similarity and Cosine Similarity all getting worst values. Synthesizers used image sample model image Have tried different Epochs Not sure how to improve? or why it is not working? Any suggestions please?

srinify commented 1 month ago

Hi there @hanzigs my response will be multi-faceted as there's a lot of nuance here!

  1. In our experience, we've found that choosing the right metrics when evaluating quality of synthetic data is important and also very strongly tied to your specific use case. We wrote a blog post on this exact topic if you're curious. Depending on the use case and project at hand, you'll pick and optimize for different metrics. It's very challenging to try to optimize for all metrics because there's usually tradeoffs.

  2. With that in mind, we actually don't recommend using p-values to measure quality. I'd recommend reading our explanation on this topic here in our docs for SDMetrics.

  3. Some of our models, like Gaussian Copulas, are highly tunable. You can change the distributions, for example, at the column level. Tuning the SDV synthesizers can go a long way to improving the quality of the generated synthetic data using the metrics from the first part of my response.

srinify commented 2 weeks ago

Hi there @hanzigs hopefully this answer was helpful! I haven't heard from you in 2 weeks so I'm going to go ahead and close this issue out. Feel free to comment and tag me or open a new issue if you have more questions on this front!

hanzigs commented 2 weeks ago

Apologies for the delay, still I'm testing that, I couldn't get expected similarity in generic way. I have tried other metrics from the SDMetrics Glossary as well.