mariarizzo / energy

energy package for R
https://mariarizzo.github.io/energy/index.html
43 stars 7 forks source link

dcor.test p-value dependent on R #4

Closed tomato42 closed 2 years ago

tomato42 commented 3 years ago

While testing some data I've noticed that the calculated p-value is not only constant for individual runs of the dcor.test function, for non-independent data it actually decreases as R increases...

Reproducer:

x = rnorm(1000, sd=6)
y = sin(x) + rnorm(1000, sd=1)
dcor.test(x, y, R=10)

produces:

        dCor independence test (permutation test)

data:  index 1, replicates 10
dCor = 0.1071, p-value = 0.09091
sample estimates:
     dCov      dCor   dVar(X)   dVar(Y) 
0.1883853 0.1070995 3.9104040 0.7912209 

But If I keep increasing the R, to 20:

        dCor independence test (permutation test)

data:  index 1, replicates 20
dCor = 0.1114, p-value = 0.04762
sample estimates:
     dCov      dCor   dVar(X)   dVar(Y) 
0.1907829 0.1113982 3.7785888 0.7762351 

50:

        dCor independence test (permutation test)

data:  index 1, replicates 50
dCor = 0.1114, p-value = 0.01961
sample estimates:
     dCov      dCor   dVar(X)   dVar(Y) 
0.1907829 0.1113982 3.7785888 0.7762351

or 500:

        dCor independence test (permutation test)

data:  index 1, replicates 500
dCor = 0.1114, p-value = 0.001996
sample estimates:
     dCov      dCor   dVar(X)   dVar(Y) 
0.1907829 0.1113982 3.7785888 0.7762351
mariarizzo commented 3 years ago

Dear Hubert Kario,

There is nothing wrong or strange about your example. You have an example of dependent data. What is your null hypothesis? It should be that the samples are independent, which is that population dcor(x, y)=0. We reject H_0 for large values of dcor. The p-value in the first case tells you that your observed statistic is the largest among the ten replicates, all of which are generated under the null hypothesis of independence.

A permutation test should not be run with only 10 replicates. The permutation test is exact if we generate all possible n! = 1000! Permutations. That is exp(5912), a number that overflows to Inf. That is not feasible so we generate a large number of randomly generated permutations.

If we only generate 10 replicates, the possible p-values are just multiples of 1/11.

The reason the p-values seem to be “decreasing” is that the accuracy of the estimation is improving as the number of replications increases. In your example, the statistic is in the extreme upper tail of the null distribution. In fact, what you are getting is that your test statistic is the largest. It is not even close to the values you get under independence. For example, p=1/501 in the R=500 example tells you that it was larger than all 500 permutation replicates.

You can see how far out in the upper tail your statistic tends to be in the attached report. I used 2000 permutations.

To summarize, no in general the p-value is not constant and in general, the p-value does not go to 0 as R increases. What happens is that as R increases, the p-value gets more accurate, and if the true p-value is very small, it will be converging to that very small number. If the true p-value was say, 0.2, then as R increases, you will get average p-values close to 0.2. Perhaps you could try an example where the data is not so strongly dependent.

In fact, I tried increasing R up to 10000 and the p-values (which are random variables anyway) seem to settle around .001.

Maria Rizzo, Professor Department of Mathematics & Statistics Bowling Green State University


From: Hubert Kario @.> Sent: Wednesday, October 13, 2021 1:53:40 PM To: mariarizzo/energy @.> Cc: Subscribed @.***> Subject: [EXTERNAL] [mariarizzo/energy] dcor.test p-value dependent on R (#4)

While testing some data I've noticed that the calculated p-value is not only constant for individual runs of the dcor.test function, for non-independent data it actually decreases as R increases...

Reproducer:

x = rnorm(1000, sd=6) y = sin(x) + rnorm(1000, sd=1) dcor.test(x, y, R=10)

produces:

    dCor independence test (permutation test)

data: index 1, replicates 10 dCor = 0.1071, p-value = 0.09091 sample estimates: dCov dCor dVar(X) dVar(Y) 0.1883853 0.1070995 3.9104040 0.7912209

But If I keep increasing the R, to 20:

    dCor independence test (permutation test)

data: index 1, replicates 20 dCor = 0.1114, p-value = 0.04762 sample estimates: dCov dCor dVar(X) dVar(Y) 0.1907829 0.1113982 3.7785888 0.7762351

50:

    dCor independence test (permutation test)

data: index 1, replicates 50 dCor = 0.1114, p-value = 0.01961 sample estimates: dCov dCor dVar(X) dVar(Y) 0.1907829 0.1113982 3.7785888 0.7762351

or 500:

    dCor independence test (permutation test)

data: index 1, replicates 500 dCor = 0.1114, p-value = 0.001996 sample estimates: dCov dCor dVar(X) dVar(Y) 0.1907829 0.1113982 3.7785888 0.7762351

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmariarizzo%2Fenergy%2Fissues%2F4&data=04%7C01%7Cmrizzo%40bgsu.edu%7Cdd3996e13ed44167ff7308d98e7263ed%7Ccdcb729d51064d7cb75ba30c455d5b0a%7C1%7C0%7C637697444235313041%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=kJPwpQjkUGMOprMK42kYF1MTMXUWDKVvPVBmgxbJLn4%3D&reserved=0, or unsubscribehttps://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA7UGDFVXCWCZWPUFTM4U4TUGXBSJANCNFSM5F5YXO5A&data=04%7C01%7Cmrizzo%40bgsu.edu%7Cdd3996e13ed44167ff7308d98e7263ed%7Ccdcb729d51064d7cb75ba30c455d5b0a%7C1%7C0%7C637697444235323031%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ovTsO5aaowMnZUnDcr1pbARngjMDOrXstLzNprcn%2Fjo%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cmrizzo%40bgsu.edu%7Cdd3996e13ed44167ff7308d98e7263ed%7Ccdcb729d51064d7cb75ba30c455d5b0a%7C1%7C0%7C637697444235333036%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=cxwuFdWv6Y7rZAOtWP73g0KLyjDykA4VGfjwKmy8GJc%3D&reserved=0 or Androidhttps://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cmrizzo%40bgsu.edu%7Cdd3996e13ed44167ff7308d98e7263ed%7Ccdcb729d51064d7cb75ba30c455d5b0a%7C1%7C0%7C637697444235343021%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=7gt4Yn5ibnMp0ttsdyukfNnLJ6F2mRoYDBRlwRJBQfE%3D&reserved=0.

tomato42 commented 3 years ago

Ah, so because p-value can only have values that are multiples of 1/(R+1), the test returns the lowest possible value: 1/(R+1), if the estimate is lower than it.

Maybe the method could print a warning message in such a case? Similar to how ks.test() complains about ties:

> a = ks.test(round(rnorm(100)), round(rnorm(100)))
Warning message:
In ks.test(round(rnorm(100)), round(rnorm(100))) :
  p-value will be approximate in the presence of ties

if returning 0 p-value isn't expected in such case.

Or explain the relationship between R and p-value resolution in the man page?

The reason why I was running the test with low R values is because I was doing some exploratory testing of larger data sets, that show dependence only for large sample size, but then even R=20 takes few minutes to calculate, let alone reasonable values like 1000 or 2000.