sdv-dev / SDMetrics

Metrics to evaluate quality and efficacy of synthetic datasets.
https://docs.sdv.dev/sdmetrics
MIT License
201 stars 45 forks source link

ColumnPairTrends score depends on the data index #582

Closed R-Palazzo closed 2 months ago

R-Palazzo commented 3 months ago

Environment Details

Please indicate the following details about the environment in which you found the bug:

Error Description

By design, the metrics and property score should be independent of the indexes of the real and synthetic data. However, this is currently not the case for the ColumnPairTrends property as shown below. The issue comes from the discretization step when numerical and datetime columns are converted to categorical, the indexes are not preserved.

Steps to reproduce

The code below should output a metric score of 1.0 since real and synthetic data are the same (the only have different indexes).

import pandas as pd
from sdmetrics.reports.single_table._properties import ColumnPairTrends

real_data = pd.DataFrame({
    'A': [1, 2, 3],
    'B': ['a', 'b', 'a']
}, index=[0, 1, 2])

synthetic_data = pd.DataFrame({
    'A': [1, 2, 3],
    'B': ['a', 'b', 'a']
}, index=[0, 4, 2])

metadata = {
    'columns': {
        'A': {'sdtype': 'numerical'},
        'B': {'sdtype': 'categorical'}
    }
}

property = ColumnPairTrends()
property._generate_details(real_data, synthetic_data, metadata)

The current output is:

Screenshot 2024-06-07 at 10 42 22