sdv-dev / SDMetrics

Metrics to evaluate quality and efficacy of synthetic datasets.
https://docs.sdv.dev/sdmetrics
MIT License
201 stars 45 forks source link

Possible performance degradation for high cardinality columns in Contingency Similarity (affecting Quality Report) #589

Open npatki opened 2 months ago

npatki commented 2 months ago

Environment Details

Error Description

In the Quality Report, the Column Pair Trends and Intertable Trends properties both use the ContingencySimilarity metric to compute a score.

This underlying metric's performance may not be optimized when a column has extremely high cardinality. If you are computing between two columns A and B, then this metric computes the cross-tabulation of the two columns based on cardinality. Eg: If Column A is categorical with cardinality of a, and column B is also categorical with cardinality of b, then the Contingency Table will contain a x b values. This may end up being slow if a or b is really large.

Additional Context

We are not interested in replacing ContingencySimilarity with another metric. Rather, we should optimize its performance. Some ideas include:

Any solution will have to be vetted to ensure that the overall quality score being returned does not differ too much from the status quo.