Without changing the algorithm at all the goal of this issue is to improve the performance of contingency_similarity. Optimizations that are in scope include
Trying different pandas or numpy functions instead of crosstab
Trying to do any type conversions at a higher level (eg. the astype(str) calls are happening multiple times on the same columns)
Seeing if there is a more efficient way to compute the table
The optimizations should not change the overall algorithm of the metric.
Additional context
If not many optimizations can be made, we can follow up with a different issue
Problem Description
As a user, I'd like to get the results of my metrics reports as quickly as possible.
We performed an audit on the
QualityReport
since it seemed to be slow. The conclusion was that most of the time is lost in the contingency similarity metric. More specifically, these lines https://github.com/sdv-dev/SDMetrics/blob/685731fbfe8ae2744793f7ee93b1dd9700a2f0ef/sdmetrics/column_pairs/statistical/contingency_similarity.py#L45-L54This is the performance report's visualization
Expected behavior
Without changing the algorithm at all the goal of this issue is to improve the performance of
contingency_similarity
. Optimizations that are in scope includeThe optimizations should not change the overall algorithm of the metric.
Additional context