Closed npatki closed 1 year ago
@npatki What is the expected behavior for Column Pair Trends
and Column Shapes
when no table_name
is passed? Because right now those also return a dictionary of dataframes, so it's weird that Cardinality
returns a dataframe and the others don't.
Below are examples of what's returned, just to clarify things.
# Assert Column Shapes details
pd.testing.assert_frame_equal(details[0], pd.DataFrame({
'Column': ['col2', 'col3'],
'Metric': ['TVComplement', 'TVComplement'],
'Score': [.75, .75]
}))
# Assert Column Pair Trends details
pd.testing.assert_frame_equal(details[1], pd.DataFrame({
'Column 1': ['col2'],
'Column 2': ['col3'],
'Metric': ['ContingencySimilarity'],
'Score': [.25],
'Real Correlation': [np.nan],
'Synthetic Correlation': [np.nan],
}))
# Assert Cardinality details, with and without table name
expected = pd.DataFrame({
'Child Table': ['table1'],
'Parent Table': ['table2'],
'Metric': ['CardinalityShapeSimilariy'],
'Quality Score': [.75]
})
pd.testing.assert_frame_equal(details[2], expected)
pd.testing.assert_frame_equal(details[5], expected)
# Assert Column Shapes details without table_name. NOTE: this is a dict
pd.testing.assert_frame_equal(details[3]['table1'], pd.DataFrame({
'Column': ['col2', 'col3'],
'Metric': ['TVComplement', 'TVComplement'],
'Score': [.75, .75]
}))
pd.testing.assert_frame_equal(details[3]['table2'], pd.DataFrame({
'Column': ['col4', 'col5', 'col7'],
'Metric': ['KSComplement', 'KSComplement', 'KSComplement'],
'Score': [.75, .75, 1]
}))
# Assert Column Pair Trends details without table_name. NOTE: this is a dict
pd.testing.assert_frame_equal(details[4]['table1'], pd.DataFrame({
'Column 1': ['col2'],
'Column 2': ['col3'],
'Metric': ['ContingencySimilarity'],
'Score': [.25],
'Real Correlation': [np.nan],
'Synthetic Correlation': [np.nan],
}))
pd.testing.assert_frame_equal(details[4]['table2'], pd.DataFrame({
'Column 1': ['col4', 'col4', 'col5'],
'Column 2': ['col5', 'col7', 'col7'],
'Metric': ['CorrelationSimilarity', 'CorrelationSimilarity', 'CorrelationSimilarity'],
'Score': [0.9901306731066666, 0.9853027960145061, 0.9678805694257717],
'Real Correlation': [0.946664, 0.966247, 0.862622],
'Synthetic Correlation': [0.926925, 0.936853, 0.798384],
}))
@fealho, according to our docs, we say:
If [
table_name
is] provided, you'll receive filtered results for the table.
So the expected behavior is to return unfiltered results if you don't provide a table_name. That is, return 1 single dataframe with all the results from all tables. I think the discrepancy is that we're missing a Table
column in all cases to identify the table name.
table_name
, the Table
column will contain a constant valuetable_name
, you will get all results so the Table
column will varyNote that for single table, we should not add the Table
column (leave as is)
Environment Details
Error Description
In the new version of SDmetrics, we are renaming the
'Table Relationships'
property to'Cardinality'
but it should otherwise behave the same. However, I am seeing thatget_details
for this property is returning a dictionary instead of a DataFrame, as expected.Steps to reproduce
Reproduce using the demo data.
Output:
(Note that one of the relationships is also missing from this output. But this is being covered in #404)
Expected Behavior
Nothing should change from the older version except for the property name (
'Table Relationships'
-->'Cardinality'
). In the older version, the same method call returns this: