sdv-dev / SDMetrics

Metrics to evaluate quality and efficacy of synthetic datasets.
https://docs.sdv.dev/sdmetrics
MIT License
201 stars 45 forks source link

Visualize cardinality of foreign key columns #283

Closed npatki closed 1 year ago

npatki commented 1 year ago

I'm filing this issue on behalf of a user request on our Slack.

Problem Description

Currently, users are able to plot the data in statistical columns such as numerical, categorical, etc. (utils.get_relationship_plot) only supports columns that are numerical, categorical, boolean or datetime.

It would be nice to support a visualization for the foreign key/primary key relationship -- when it comes to the cardinality.

Expected behavior

Create a new visualization utils.get_cardinality_plot. This should plot the cardinality (# of children) that each parent row has, colored by real vs. synthetic data.

Parameters:

Output: A plotly.Figure object with a bar graph. The graph shows the # of children that each parent row has. The color represents real vs. synthetic data.

from sdmetrics.reports import utils

fig = utils.get_cardinality_plot(
    real_data=real_tables,
    synthetic_data=synthetic_tables,
    parent_table_name='users',
    child_table_name='transactions',
    child_foreign_key='user_id',
    metadata=my_multi_table_metadata
)

fig.show()
npatki commented 1 year ago

Example

See an example below for the final visualization.

image

Code is available in this private notebook