Distance correlation of matrix and vector.

vnmabus / dcor

Distance correlation and related E-statistics in Python

https://dcor.readthedocs.io

MIT License

144 stars 26 forks source link

Distance correlation of matrix and vector. #3

Closed CompRhys closed 5 years ago

CompRhys commented 5 years ago

dcor returns a scalar for the distance correlation of a matrix and a vector. I cannot yet understand why this is the case as isn't the distance correlation defined between two vectors and so I would expect a vector of the correlations as the output.

Could you explain what's going on?

vnmabus commented 5 years ago

Can you provide an example of the input and the current and desired output? Currently, the functions only allow to pass instances from two random vectors. I was trying to implement pairwise computation of these measures (look at the develop branch), but it is not publicly available right now, and I intended to use a separate function for that, because I think it is more clear that way.

CompRhys commented 5 years ago

Sure, I think the issue is that I don't follow what you mean by instances of random vectors

import numpy as np
import dcor
a = np.array([1, 2, 3, 4])
b = np.array([5,8,6,2])
c = np.column_stack(a,b) # i.e. a (4,2) matrix

so for dcor.distance_correlation(a,a) we'd expect 1.0 and for dcor.distance_correlation(a,b) I get 0.795. For `dcor.distance_correlation(a,c)' I'd expect back the vector [[1.0] [0.795]] but I instead get a single scalar 0.886

vnmabus commented 5 years ago

distance_correlation interprets those as follows:

a and b both contain 4 evaluations of a random variable.
c contains 4 evaluations of a random vector, with 2 elements. Thus distance_correlation(a, c) is well defined, as distance correlation is defined even for two random vectors with different dimensions, and the result is a single number.

CompRhys commented 5 years ago

ahh okay now I see, thanks! I hadn't really thought about the fact that we could have vectors with different dimensions due to the distance matrix being constructed from the norms and that's what was confusing me.

A pairwise implementation would be good but I can just refactor my code to use dcor.distance_covariance to stop the redundant calculation of dvar(Y) when iterating over arrays of random variables