sdv-dev / SDMetrics

Metrics to evaluate quality and efficacy of synthetic datasets.
https://docs.sdv.dev/sdmetrics
MIT License
201 stars 44 forks source link

Check if QualityReport needs the synthetic data to match the metadata #509

Closed frances-h closed 9 months ago

frances-h commented 10 months ago

Environment Details

Please indicate the following details about the environment in which you found the bug:

Error Description

Similar to #508, we'd like to verify that the QualityReport still runs correctly even if the synthetic data does not exactly match the metadata. The QualityReport should be tested with multiple datasets that have missing or extra columns. If the QualityReport runs as expected, the requirement that synthetic data should match the metadata should be relaxed for the QualityReport AND DiagnosticReport. The error message should be updated as well.

Steps to reproduce

import pandas as pd
from sdmetrics.reports.single_table import QualityReport

data = pd.DataFrame({
   'id': [0, 1, 2],
   'val1': ['a', 'a', 'b'],
   'val2': [0.1, 2.4, 5.7]
})
synthetic_data = pd.DataFrame({
  'id': [1, 2, 3],
  'extra_col': ['x', 'y', 'z'],
  'val1': ['c', 'd', 'd']
})

metadata = {
  'columns': {
     'id': {'sdtype': 'id'},
     'val1': {'sdtype': 'categorical'},
     'val2': {'sdtype': 'numerical'}
  },
  'primary_key': 'id'
}

report = QualityReport()
report.generate(data, synthetic_data, metadata)