gnu-octave / statistics

The Statistics package for GNU Octave
GNU General Public License v3.0
25 stars 24 forks source link

kstest uses wrong n for p-value calculation #164

Closed ecky-l closed 1 month ago

ecky-l commented 2 months ago

The kstest function creates an empirical CDF (using F_n = ecdf(...)) and then calculates the value D_n = sup_x |F_n - F| to a computed or given CDF F. Fn and F have the same size.

The test statistics D_n is the supremum of all differences between F_n and F, and the set of these differences has cardinality n. Therefore, n should be the size of the second output from the call to ecdf (line 142, where x is set to the ordinate of ecdf), because there are just that many CDF-differences calculated (line 191 interpolates the CDF to x). But instead, n is taken as the size of the input x to kstest (line 141, just before the call to ecdf). Then this n is used in the calculation of the pValue (lines 216, 242 or 247).

This does not generate an error - But it produces wrong results. Attached is a script to reproduce it. It creates an array R of poisson distributed random numbers and plots the ecdf(R) together with a calculated CDF of the poisson distribution with same \lambda. The plot shows clearly, that the CDF's are very close. But kstest rejects the H0 hypothesis, saying that R is not poisson distributed, and gives a very low p value:

pkg load statistics

R = poissrnd (1,100,1);
[F,X,Flo,Fup] = ecdf (R);
Xp = linspace (min (X), max (X), 1000)';
P = makedist ("poisson", "lambda", 1);
Fp = cdf(P, Xp);

figure
hold on
stairs (X, F, "-b")
stairs (X, Flo, "--g")
stairs (X, Fup, "--c")
stairs (Xp, Fp, "-r")
legend (
    "edcf(R)",
    "lower bound of ecdf",
    "upper bound of ecdf",
    "calculated poisson CDF",
    "location", "east")

[h,p,ks,cv] = kstest(R, "cdf", @P.cdf, "tail", "unequal")

Output:

h = 1
p = 1.8543e-12
ks = 0.3679
cv = 0.1340

poisson_cdf

When I swap the lines 141 and 142 in kstest, so that n is taken as the length of the ecdf ordinates instead of the input x, the result looks better:

>> [h,p,ks,cv] = kstest(R, "cdf", @P.cdf, "tail", "unequal")
h = 0
p = 0.3108
ks = 0.3679
cv = 0.5193

And this is, btw, also more in line with the result that you get with R, using the KSgeneral package:

> R = rpois(512,1)
> F = stepfun(c(0:100), c(0, ppois(0:100,1)))
> KSgeneral::disc_ks_test(R,F)

    One-sample Kolmogorov-Smirnov test

data:  R
D = 0.047879, p-value = 0.4922
alternative hypothesis: two-sided
pr0m1th3as commented 1 month ago

Thanks for pointing this out and for the PR as well. Closing as fixed