vnmabus / dcor

Distance correlation and related E-statistics in Python
https://dcor.readthedocs.io
MIT License
144 stars 26 forks source link

Energy distance using medians? #22

Closed multimeric closed 3 years ago

multimeric commented 3 years ago

Hi, thank you for your phenomenal work writing and documenting this library.

As I'm sure you're aware, there has been some literature suggesting that an energy statistic that is more robust to outliers can be calculated by taking the median rather than mean when calculating the average distance between samples. See: James, N. A., Kejariwal, A., & Matteson, D. S. (2016). Leveraging cloud data to mitigate user experience from ‘breaking bad.’ 2016 IEEE International Conference on Big Data (Big Data), 3499–3508. https://doi.org/10.1109/BigData.2016.7841013. Specifically section 3a of that article, "Robustness against Anomalies".

From looking at this library, it seems to me that this change would be as simple as allowing a configurable "average" function which would replace the use of mean in this code:

https://github.com/vnmabus/dcor/blob/e7351553fb277f271ede1bf3e7148b408185707a/dcor/_energy.py#L24-L28

Would you be interested in such an implementation?

vnmabus commented 3 years ago

I was not aware of that paper. If you propose a PR that exposes an average parameter for the relevant functions I have no problem in reviewing it.