sdv-dev / SDMetrics

Metrics to evaluate quality and efficacy of synthetic datasets.
https://docs.sdv.dev/sdmetrics
MIT License
210 stars 45 forks source link

Update the synthetic data that's available for the multi-table demo #501

Closed npatki closed 11 months ago

npatki commented 12 months ago

Problem Description

The SDMetrics library comes with built-in multi-table demo data that you can use to explore the reports. It includes real data, synthetic data, and the metadata as hardcoded in this folder

The problem is that the synthetic data was created a long time ago using very old versions of the SDV. Since the older versions had many bugs, the synthetic data doesn't quite match the real data for a lot of important qualities. In particular, the BoundaryAdherence is unmet for transactions.amount, users.age and transactions.timestamp because at the time, SDV was not adhering to min/max values.

Expected behavior

Update the synthetic data available for the multi-table demo. We can do this by:

  1. Keeping the same metadata and real data
  2. Running the real data through the HSASynthesizer
  3. Sampling new synthetic data and saving the new synthetic data instead

Additional context

Upon doing this, the new version of the Diagnostic Report should have a score of 1.0 (i.e. the BoundaryAdherence should be met).