Help with understanding the homogeneity test

srujan741 commented 4 years ago

I have an experiment wherein i have two groups of customers with the same attributes. I wanted to do a multivariate homogeneity test for this and used the dcor.homogeneity.energy_test() method on both the groups. My question is that i always end up with a p value of 1 or close to 1. I simulated a 2 d dataset in two cases a.) There are two distinct clusters seperated b.) The data clusters are overlapping. The p value in both the cases came out to be 1 although the test statistic value was different. I want to understand how the homogeneity test works? Help is much appreciated.

vnmabus commented 4 years ago

First, if you are obtaining such high p-values for clearly distinct distributions, maybe there is a bug in the code or maybe you are calling the method with wrong parameters, because that should not happen. Can you provide an example of how are you using the method?

As for the explanation and understanding, the complete procedure is explained in the original article of Székely and Rizzo.

I will summarize the method:

The null hypothesis is that the two samples have the same distribution. The alternative hypothesis is that the distribution is different (it does not matter how).
In the article they prove that the expected energy statistic (energy_test_statistic in the code) between two samples converge if the samples have the same distribution but tends to infinity (when the size of the samples grow) if they have different distributions.
So, we will discard the null hypothesis if the energy statistic is "too high". But, how do we measure if it is "too high"? Because our samples have a finite size, the statistic will not be near infinity.
Here is where we use the idea of a permutation test. Essentially, under the null hypothesis, all the observations come from the same distribution. Thus, if we permute the observations, so that now some observations may switch to a different sample, under the null hypothesis, the energy statistic would be similar to the original one: there is no reason for the original one to be special.
However, under the alternative hypothesis, the samples obtained from the permutation come from a common distribution, which is a mixture of the original distributions of each sample. However, when we computed the original statistic, each sample had a different distribution. Thus, it is expected that the original statistic would be larger in this case than the statistics obtained by the permutations.
Thus, we can perform a lot of random permutations (the number of permutations is the parameter num_resamples). We then compare the statistics obtained with the original one, obtaining the proportion of statistics larger than the original. This proportion is the estimated p-value.
Under the alternative hypothesis, this p-value should be very small, as the statistic should be more extreme for the original data. Under the null hypothesis, the original p-value is not speciall in any way, so this p-value would be distributed uniformly between 0 and 1. The probability that this p-value is less than α is exactly α. Thus, if we discard the null hypothesis when the p-value is less than 0.05, we will wrongly discard the null hypothesis one time every 20 times.

vnmabus commented 4 years ago

I will close this as there is no answer from @srujan741.

vnmabus / dcor

Help with understanding the homogeneity test #11