Closed hanzigs closed 2 weeks ago
Hi there @hanzigs my response will be multi-faceted as there's a lot of nuance here!
In our experience, we've found that choosing the right metrics when evaluating quality of synthetic data is important and also very strongly tied to your specific use case. We wrote a blog post on this exact topic if you're curious. Depending on the use case and project at hand, you'll pick and optimize for different metrics. It's very challenging to try to optimize for all metrics because there's usually tradeoffs.
With that in mind, we actually don't recommend using p-values to measure quality. I'd recommend reading our explanation on this topic here in our docs for SDMetrics.
Some of our models, like Gaussian Copulas, are highly tunable. You can change the distributions, for example, at the column level. Tuning the SDV synthesizers can go a long way to improving the quality of the generated synthetic data using the metrics from the first part of my response.
Hi there @hanzigs hopefully this answer was helpful! I haven't heard from you in 2 weeks so I'm going to go ahead and close this issue out. Feel free to comment and tag me or open a new issue if you have more questions on this front!
Apologies for the delay, still I'm testing that, I couldn't get expected similarity in generic way. I have tried other metrics from the SDMetrics Glossary as well.
Hi, Very good library for synthetic data generation, I have a dataset, properly transformed with normalization, standardisation, pre-processing, and removed outliers. LGBM and XGB algorithms produce very good F1 score and PRAUC on validation data. Here is the sample
Class Distribution is
I have tried 4 synthesizers
I could not improve the real data and generated data similarity, the KS statistics p-value, Jensen-Shannon Divergence, Earth Mover's Distance, PCA Similarity and Cosine Similarity all getting worst values.
Synthesizers used
sample model
Have tried different Epochs
Not sure how to improve? or why it is not working?
Any suggestions please?