sdv-dev / SDMetrics

Metrics to evaluate quality and efficacy of synthetic datasets.
https://docs.sdv.dev/sdmetrics
MIT License
210 stars 45 forks source link

Try to improve performance of contingency_similarity #622

Closed amontanez24 closed 2 months ago

amontanez24 commented 2 months ago

Problem Description

As a user, I'd like to get the results of my metrics reports as quickly as possible.

We performed an audit on the QualityReport since it seemed to be slow. The conclusion was that most of the time is lost in the contingency similarity metric. More specifically, these lines https://github.com/sdv-dev/SDMetrics/blob/685731fbfe8ae2744793f7ee93b1dd9700a2f0ef/sdmetrics/column_pairs/statistical/contingency_similarity.py#L45-L54

This is the performance report's visualization image

Expected behavior

Without changing the algorithm at all the goal of this issue is to improve the performance of contingency_similarity. Optimizations that are in scope include

The optimizations should not change the overall algorithm of the metric.

Additional context