sdv-dev / SDMetrics

Metrics to evaluate quality and efficacy of synthetic datasets.
https://docs.sdv.dev/sdmetrics
MIT License
201 stars 45 forks source link

The `Cardinality` property is not returning a DataFrame #405

Closed npatki closed 1 year ago

npatki commented 1 year ago

Environment Details

Error Description

In the new version of SDmetrics, we are renaming the 'Table Relationships' property to 'Cardinality' but it should otherwise behave the same. However, I am seeing that get_details for this property is returning a dictionary instead of a DataFrame, as expected.

Steps to reproduce

Reproduce using the demo data.

from sdmetrics import load_demo
from sdmetrics.reports.multi_table import QualityReport

real_data, synthetic_data, metadata = load_demo(modality='multi_table')

report = QualityReport()
report.generate(real_data, synthetic_data, metadata)
report.get_details(
    property_name='Cardinality',
    table_name='sessions'
)

Output:

{('sessions', 'transactions'): {'score': 0.7}}

(Note that one of the relationships is also missing from this output. But this is being covered in #404)

Expected Behavior

Nothing should change from the older version except for the property name ('Table Relationships' --> 'Cardinality'). In the older version, the same method call returns this:

image
fealho commented 1 year ago

@npatki What is the expected behavior for Column Pair Trends and Column Shapes when no table_name is passed? Because right now those also return a dictionary of dataframes, so it's weird that Cardinality returns a dataframe and the others don't.

Below are examples of what's returned, just to clarify things.

    # Assert Column Shapes details
    pd.testing.assert_frame_equal(details[0], pd.DataFrame({
        'Column': ['col2', 'col3'],
        'Metric': ['TVComplement', 'TVComplement'],
        'Score': [.75, .75]
    }))

    # Assert Column Pair Trends details
    pd.testing.assert_frame_equal(details[1], pd.DataFrame({
        'Column 1': ['col2'],
        'Column 2': ['col3'],
        'Metric': ['ContingencySimilarity'],
        'Score': [.25],
        'Real Correlation': [np.nan],
        'Synthetic Correlation': [np.nan],
    }))

    # Assert Cardinality details, with and without table name
    expected = pd.DataFrame({
        'Child Table': ['table1'],
        'Parent Table': ['table2'],
        'Metric': ['CardinalityShapeSimilariy'],
        'Quality Score': [.75]
    })
    pd.testing.assert_frame_equal(details[2], expected)
    pd.testing.assert_frame_equal(details[5], expected)

    # Assert Column Shapes details without table_name. NOTE: this is a dict
    pd.testing.assert_frame_equal(details[3]['table1'], pd.DataFrame({
        'Column': ['col2', 'col3'],
        'Metric': ['TVComplement', 'TVComplement'],
        'Score': [.75, .75]
    }))
    pd.testing.assert_frame_equal(details[3]['table2'], pd.DataFrame({
        'Column': ['col4', 'col5', 'col7'],
        'Metric': ['KSComplement', 'KSComplement', 'KSComplement'],
        'Score': [.75, .75, 1]
    }))

    # Assert Column Pair Trends details without table_name. NOTE: this is a dict
    pd.testing.assert_frame_equal(details[4]['table1'], pd.DataFrame({
        'Column 1': ['col2'],
        'Column 2': ['col3'],
        'Metric': ['ContingencySimilarity'],
        'Score': [.25],
        'Real Correlation': [np.nan],
        'Synthetic Correlation': [np.nan],
    }))
    pd.testing.assert_frame_equal(details[4]['table2'], pd.DataFrame({
        'Column 1': ['col4', 'col4', 'col5'],
        'Column 2': ['col5', 'col7', 'col7'],
        'Metric': ['CorrelationSimilarity', 'CorrelationSimilarity', 'CorrelationSimilarity'],
        'Score': [0.9901306731066666, 0.9853027960145061, 0.9678805694257717],
        'Real Correlation': [0.946664, 0.966247, 0.862622],
        'Synthetic Correlation': [0.926925, 0.936853, 0.798384],
    }))
npatki commented 1 year ago

@fealho, according to our docs, we say:

If [table_name is] provided, you'll receive filtered results for the table.

So the expected behavior is to return unfiltered results if you don't provide a table_name. That is, return 1 single dataframe with all the results from all tables. I think the discrepancy is that we're missing a Table column in all cases to identify the table name.

Note that for single table, we should not add the Table column (leave as is)