synthesized-io / insight

🧿 Metrics & Monitoring of Datasets
BSD 3-Clause "New" or "Revised" License
12 stars 0 forks source link

Optimization list for metrics #161

Open marqueewinq opened 5 months ago

marqueewinq commented 5 months ago

In classes like KendallTauCorrelation, there's a conversion of series to integer if they are of datetime type. This conversion is expensive. Conversion to categorical vars in KendallTauCorrelation is repeated for sr_a and sr_b, which could be abstracted away.

In Mean._compute_metric, subtracting and then adding an array of zeros is redundant. It could be simplified to np.nanmean(sr.values).

The StandardDeviation class performs a sort before trimming for outliers, which may not be necessary if a significant number of values are removed.

In EarthMoversDistance, we can perform the operations we need on pd series more efficiently. Also, we use dict to count unique values, but pandas has functions for that -- we are possibly losing efficiency here as well.

We can use np.nan_to_num in EarthMoversDistanceBinned to handle nans more explicitly.

In several metrics, there's a check for empty series or specific conditions that return default value. We need to move them up to be more efficient (not critical) The check_column_types method is quite similar across different classes -- we might want to abstract that (not critical) Overall, it seems like the there could be less code duplication (not critical)

### Tasks
- [ ] Refactor datetime conversion in `KendallTauCorrelation`
- [ ] Implement a utility function for categorical conversion in `KendallTauCorrelation`
- [ ] Update `Mean._compute_metric to use `np.nanmean(sr.values)`
- [ ] Update `StandardDeviation` to decide if sorting is necessary
- [ ] Use pandas ops in `EarthMoversDistance`
- [ ] Use `np.nan_to_num` in `EarthMoversDistanceBinned` to handle nans
- [ ] Move up the checks for empty series / special conditions
- [ ] Abstract away the similar `check_column_types` methods across all classes