sdv-dev / SDMetrics

Metrics to evaluate quality and efficacy of synthetic datasets.
https://docs.sdv.dev/sdmetrics
MIT License
210 stars 45 forks source link

`ValueError` in DiagnosticReport if synthetic data does not match metadata #508

Closed frances-h closed 11 months ago

frances-h commented 12 months ago

Environment Details

Please indicate the following details about the environment in which you found the bug:

Error Description

Currently, the DiagnosticReport errors if the synthetic data does not match the given metadata. Because the DiagnosticReport has metrics designed to evaluate this situation, the report should not error if the synthetic data does not match the metadata. The report should still validate that the real data matches the synthetic data. The error message should be updated to indicate only the real data has missing/extra columns.

Steps to reproduce

import pandas as pd
from sdmetrics.reports.single_table import DiagnosticReport

data = pd.DataFrame({
   'id': [0, 1, 2],
   'val1': ['a', 'a', 'b'],
   'val2': [0.1, 2.4, 5.7]
})
synthetic_data = pd.DataFrame({
  'id': [1, 2, 3],
  'extra_col': ['x', 'y', 'z'],
  'val1': ['c', 'd', 'd']
})

metadata = {
  'columns': {
     'id': {'sdtype': 'id'},
     'val1': {'sdtype': 'categorical'},
     'val2': {'sdtype': 'numerical'}
  },
  'primary_key': 'id'
}

report = DiagnosticReport()
report.generate(data, synthetic_data, metadata)