sdv-dev / SDMetrics

Metrics to evaluate quality and efficacy of synthetic datasets.
https://docs.sdv.dev/sdmetrics
MIT License
210 stars 45 forks source link

Single Table Quality Report is not rounding correctly #419

Closed npatki closed 1 year ago

npatki commented 1 year ago

Environment Details

Error Description

When generating the Quality Report with verbose=True, the report should display the final score and property scores rounded to the nearest 2 decimal digits. We did this to make it more readable to the end user.

In practice, there are some edge cases where it's unable to round to the nearest 2 decimal digits.

Generating report ...
(1/2) Evaluating Column Shapes: : 100%|██████████| 9/9 [00:00<00:00, 462.83it/s]
(2/2) Evaluating Column Pair Trends: : 100%|██████████| 36/36 [00:00<00:00, 81.39it/s]

Overall Quality Score: 87.18%

Properties:
- Column Shapes: 92.17%
- Column Pair Trends: 82.19999999999999%

Steps to reproduce

I'm not sure if this will always trigger the case, but I first noticed it when running the SDV's CTGANSynthesizer demo.

from sdv.datasets.demo import download_demo
from sdv.single_table import CTGANSynthesizer
from sdv.evaluation.single_table import evaluate_quality

real_data, metadata = download_demo(
    modality='single_table',
    dataset_name='fake_hotel_guests'
)

synthesizer = CTGANSynthesizer(metadata)
synthesizer.fit(real_data)
synthetic_data = synthesizer.sample(num_rows=500)

quality_report = evaluate_quality(
    real_data,
    synthetic_data,
    metadata
)

Additional Context

We are meant to be rounding the final score and the property scores similarly to the same # of digits. I think the multiplication by 100 is messing up the rounding. Best to do the rounding directly within the print statement.

https://github.com/sdv-dev/SDMetrics/blob/d34cae04e73bbeb4542454339fa9227bdde6fe09/sdmetrics/reports/single_table/quality_report.py#L76-L87