vnmabus / dcor

Distance correlation and related E-statistics in Python
https://dcor.readthedocs.io
MIT License
145 stars 26 forks source link

Counting the distance from a point to itself #26

Closed multimeric closed 3 years ago

multimeric commented 3 years ago

Hi, I've hit a bit of a problem. I was trying to work out why I was getting different results from the ecp R package versus dcor. After some intense investigation, I think the cause seems to be at the point of taking the mean of each within-sample distance. Note, this is before we apply the coefficient or consider the between-sample distances. Precisely, I'm referring to the mean taken here: https://github.com/vnmabus/dcor/blob/161a6f5928ec0f30ce89fcfd5e90e6ed9315e383/dcor/_energy.py#L41-L42

In all the Székely and Rizzo papers (e.g. Székely & Rizzo, 2004), this mean is defined as the arithmetic mean, and the same as you have used in dcor: image

However in the Matteson and James papers I have been looking at (e.g. Matteson & James, 2014; James et al., 2016), they seem to define it as follows:

image

What they seem to be doing here is summing the lower triangle of the matrix, excluding the diagonal, and then divided by the combination n choose 2. So if we had a sample with 5 items, the full distance matrix would be 5 x 5 = 25 items, but the lower triangle would only have 10 items in it. They would sum these distances and divide by 5 choose 2, which is 10. So this is also taking the mean, but it's the mean excluding the diagonal, which is of course always 0 in a within-sample distance matrix. The ultimate outcome is that their "mean" is actually \frac{n}{n-1} \mu, which is larger than it should be, as it isn't counting the 0s on the diagonal.

Note that this is also visible in the implementation of their work, in the ecp package. Here, they sum the matrix but then divide by n \times n - 1, which is equivalent to the above, but not equivalent to the true mean: https://github.com/zwenyu/ecp/blob/65a9bb56308d25ce3c6be4d6388137f428118248/src/energyChangePoint.cpp#L112

My question is this: are they simply wrong? If no, is there any theory supporting this alternative formula? If there is, should this be something supported in dcor? Fortunately it kind of already is thanks to my customizable average feature. But it could be called out specifically. I appreciate your input here as you likely understand this domain better than I do.

multimeric commented 3 years ago

I've coded up a hacky implementation of this average in a way that's usable in dcor:

def imbalanced_average(arr):
    nrow, ncol = arr.shape
    if nrow == ncol and np.all(np.diag(arr) == 0):
        # Under these conditions, this is probably within-sample distance, so correct the mean
        return np.sum(arr) / (nrow * (nrow - 1))
    else:
        # Otherwise, this is a between-sample distance, so we don't have to correct for 0s
        return np.mean(arr)

test = dcor.homogeneity.energy_test_statistic(sample_a, sample_b, average=imbalanced_average)

Whether or not this is wise, I can't say.

vnmabus commented 3 years ago

I really do not know why they use that expression (have you tried asking them?). I think they are not wrong, in the sense that asymptotically the values in the diagonal do not matter and that estimator will tend to the energy distance of the population. Maybe they are trying to reduce a bit the computation time, as they sum in a loop. However in numpy the sum function is heavily optimized (e. g. using SIMD instructions), so I think that using that formula would make it slower, not faster.

If there is not a better motivation for that, I would prefer to leave the code as it is now. For matching the results of ecp I think it is REALLY improbable that your hacky implementation gives a false positive for within-sample distances, as you are doing an exact comparation with 0. The only case in which you WILL have a false positive is if you try to compute energy_distance(X, X) (an even more hacky implementation would keep the count of how many times has been called and could work even in that case, but that would be a HORRIBLE function). Note that for a high value of n the results should be quite similar.

multimeric commented 3 years ago

Yes I will try to delve into why they might be doing this. I guess the impact for dcor if there is a good motivation behind this, is that I might want to add a flag to the average function for same_sample=True so I don't have to guess when this is the case.

multimeric commented 3 years ago

Okay, the helpful answer I got back from Matteson is that they define energy distance using U-Statistics (unbiased), while Székely and Rizzo use V-statistics (which are also what you have implemented in dcor). I can't personally explain the advantages of either, but there seems to be a theoretical justification for each. Possibly even a better one for U-statistics since they're unbiased.

Would you consider a flag for U vs V statistics mode for distance calculations?

vnmabus commented 3 years ago

I would consider it, but I am not sure how it will fit with the custom mean callable.

multimeric commented 3 years ago

Well I guess the U/V setting would define a subset of the array to pass in to the average function. That would let you use all combinations of U/V and mean/median.

vnmabus commented 3 years ago

I am not sure that the resulting estimator is unbiased with an arbitrary average. I do not know enough about the theory of U-statistics. If you want to research that, we could add it.

Another possibility is to restrict the possible averages to "mean", "median" and "mean-unbiased". And as an alternative, we could accept TWO callables, for the inter and intra means (but then the burden of choosing a right mean falls to the user).

multimeric commented 3 years ago

James et al., 2016 don't really go into why using the median is theoretically justified, so I'm not sure. I guess it depends to what extent you want to prevent users from doing something stupid. If it were me I would allow all combinations of statistic-type and average, and just point out that certain combinations of parameters correspond to their uses in different papers.

Having a "mean" or "median" string which is internally converted to np.mean/np.median could be a nice user-friendly change, but I would rather keep the statistic type separate.

I'm envisaging a function signature something like:

def energy_test_statistic(
    x: ArrayLike,
    y: ArrayLike,
    *, 
    exponent: float=1,
    stat: Literal['u', 'v']='v',
    average: Callable[[ArrayLike], float]=np.mean
):
    pass