Wrong expression for variance used - p-values computed here may not be correct

Knusper commented 2 months ago

Prompted by a referee report I just went down a rabbit hole regarding the generalised Kendall's tau coefficient.

First, the main paper that is being referenced here is a bit misleading. Akritas & Seibert (1996; Mon. Not. R. Astron. Soc. 278, 919) develops a procedure for partial correlation coefficients with upper limits based on Kendall's tau. In this setup one has three sets of n observables (T1, T2, T3), and the question is whether the correlation between T1 and T2 is dependent on a selection bias that might introduce correlations between T1 and T3 or T2 and T3. The formalism for this case relies on the generalised Kendall Taus for (T1,T2), (T1,T3), and (T2,T3), that has been given in Nelson, Isobe, and Feigelson (1986; Astrophys. J. 306, 490).

The main issue is then, that Akritas & Seibert do not provide an estimate for the expectation value of the variance under the null-hypothesis being true. In this code here you wrote for the variance:

var = 2*(2*n+5)/(9*n*(n-1))

However, this expression is only approximately true for large samples, where the number of upper limits is small. If we have a significant fraction of upper limits, the variance depends on how the upper limits are distributed with respect to the proportions of the rest of the sample. In particular, the expression that needs to be evaluated for the variance is given by Eq. (28) in Isobe et al.

I haven't done the algebra, but I believe reduces to 2*(2*n+5)/(9*n*(n-1)) if there are no ties or upper limits. It looks structural similar to what is derived in the proof for 2*(2*n+5)/(9*n*(n-1)) on the Wikipeida page.

I'll file a PR in the coming days with the correct formula, so that we can check the impact of this.

sflury commented 2 months ago

Thanks for the insight! Let me know what you find. Definitely looks like there's a difference in the variance, but I will wait to implement it until you file your PR.

Knusper commented 2 months ago

So, I implemented this, pushed to the fork that I made in my github, I need to document the changes still, but you can already have a look. (I also added an option to change back to lower limits, if someone likes to use this for liftetimes or so...)

https://github.com/Knusper/kendall

I added basically an option, so that the user can change between Isobe method and simple method.

One thing that I noticed, is that the sign definition for J_kij' is reversed in Akritas with respect to Isobe. This does not matter, since-a_ij -b_ij == a_ij b_ij, but I found it irritating. I changed the sign now in the code accordingly.

As a test case I used the fiducial small sample in the Isobe paper, and I get the same result. For my science data the values I get do not differ significantly from what i get with your original code (i.e., the errors from the MC simulation dominate).

Moreover, If I use no censoring, I get the same p-value as with the standard expression for the variance, as suspected. With censors, the variance from Isobe is slightly smaller than the theoretical value in the case of uncensored data, thus the p-values also decrease. This is actually expected, since we now estimate the variance from the data and do not use a theoretical value. It is also proven, that this is expected: https://sci-hub.st/10.2307/2530458 -- the upper bound in Eq. 1 is the theoretical result, and for censored data the variance will decrease. (I don't find this intuitive at all...)

Now, there exists an R code, that uses an improved estimate of the variance with respect to the Isobe et al. paper. At some point we might want to include this as well here, but this is quite sophisticated and I don't understand a bit in the associated paper and I am also not so good with R:

code: https://rdrr.io/cran/NADA/src/R/ken.R
docs: https://rdrr.io/cran/NADA/man/cenken.html
paper: https://sci-hub.st/https://doi.org/10.2307/2291140

The p-values from this routine are slightly larger then the Isobe et al. values, but again below the theoretical result in the absence of upper limits.

sflury / kendall

Wrong expression for variance used - p-values computed here may not be correct #1