vnmabus / dcor

Distance correlation and related E-statistics in Python
https://dcor.readthedocs.io
MIT License
144 stars 26 forks source link

Clarification of distance correlation - dcor vs scipy #21

Closed amjass12 closed 3 years ago

amjass12 commented 3 years ago

Hi!

I have started using dcor as as I need to find pairwise correlations between two variables/vectors for every pairwise comparison in a dataframe. I am using the distance correlation as i need to find correlations not just for linear pairwise correlations but also non-linear correlations.

Having read the documentation, I know this is the correct implementation for this purpose, however, as I understand it, Scipy also provides a distance correlation function. I am getting different results when using both dcor and scipy and was wondering if you could explain why? I am unsure if Scipy is actually using the same distance correlation, or if their implementation contains something obvious I have missed which leads to the different results:

from scipy.spatial import distance
distance.correlation(data['column1'], data['column2'])
= 0.57

import dcor
dcor.distance_correlation(data['column1'], data['column2'])
= 0.41

There is a large discrepancy here and would appreciate clarification!

thank you!

vnmabus commented 3 years ago

This is because scipy is not computing distance correlation, but transforming the usual (Pearson) correlation R into a (semi)metric, as 1 - R, so that highly correlated variables (correlation near 1) are close using this metric (distance near 0). The naming of that functionality is unfortunate, and I am afraid that it has confused some people before (see https://stackoverflow.com/questions/35988933/scipy-distance-correlation-is-higher-than-1 and https://stackoverflow.com/questions/60392972/scipy-distance-correlation-scale, for example).

amjass12 commented 3 years ago

Thank you @vnmabus for clarifying this makes perfect sense!

so just to clarify, dcor is the right package to calculate the distance correlation that is able to find pairwise comparisons that can find both linear and non-linear correlations as per the definition of the distance correlation. (sorry, just want to be absolutely sure I am using the intended analyses!)

thanks again

vnmabus commented 3 years ago

Yes, this package can find nonlinear correlations, as it implements Székely's distance correlation (https://en.wikipedia.org/wiki/Distance_correlation).

amjass12 commented 3 years ago

perfect, thank you for confirming and thanks for your time.