The kstest function creates an empirical CDF (using F_n = ecdf(...)) and then calculates the value D_n = sup_x |F_n - F| to a computed or given CDF F. Fn and F have the same size.
The test statistics D_n is the supremum of all differences between F_n and F, and the set of these differences has cardinality n. Therefore, n should be the size of the second output from the call to ecdf (line 142, where x is set to the ordinate of ecdf), because there are just that many CDF-differences calculated (line 191 interpolates the CDF to x). But instead, n is taken as the size of the input x to kstest (line 141, just before the call to ecdf). Then this n is used in the calculation of the pValue (lines 216, 242 or 247).
This does not generate an error - But it produces wrong results. Attached is a script to reproduce it. It creates an array R of poisson distributed random numbers and plots the ecdf(R) together with a calculated CDF of the poisson distribution with same \lambda. The plot shows clearly, that the CDF's are very close. But kstest rejects the H0 hypothesis, saying that R is not poisson distributed, and gives a very low p value:
pkg load statistics
R = poissrnd (1,100,1);
[F,X,Flo,Fup] = ecdf (R);
Xp = linspace (min (X), max (X), 1000)';
P = makedist ("poisson", "lambda", 1);
Fp = cdf(P, Xp);
figure
hold on
stairs (X, F, "-b")
stairs (X, Flo, "--g")
stairs (X, Fup, "--c")
stairs (Xp, Fp, "-r")
legend (
"edcf(R)",
"lower bound of ecdf",
"upper bound of ecdf",
"calculated poisson CDF",
"location", "east")
[h,p,ks,cv] = kstest(R, "cdf", @P.cdf, "tail", "unequal")
Output:
h = 1
p = 1.8543e-12
ks = 0.3679
cv = 0.1340
When I swap the lines 141 and 142 in kstest, so that n is taken as the length of the ecdf ordinates instead of the input x, the result looks better:
>> [h,p,ks,cv] = kstest(R, "cdf", @P.cdf, "tail", "unequal")
h = 0
p = 0.3108
ks = 0.3679
cv = 0.5193
And this is, btw, also more in line with the result that you get with R, using the KSgeneral package:
> R = rpois(512,1)
> F = stepfun(c(0:100), c(0, ppois(0:100,1)))
> KSgeneral::disc_ks_test(R,F)
One-sample Kolmogorov-Smirnov test
data: R
D = 0.047879, p-value = 0.4922
alternative hypothesis: two-sided
The
kstest
function creates an empirical CDF (using F_n = ecdf(...)) and then calculates the value D_n = sup_x |F_n - F| to a computed or given CDF F. Fn and F have the same size.The test statistics D_n is the supremum of all differences between F_n and F, and the set of these differences has cardinality n. Therefore, n should be the size of the second output from the call to
ecdf
(line 142, where x is set to the ordinate ofecdf
), because there are just that many CDF-differences calculated (line 191 interpolates the CDF to x). But instead, n is taken as the size of the input x tokstest
(line 141, just before the call toecdf
). Then this n is used in the calculation of the pValue (lines 216, 242 or 247).This does not generate an error - But it produces wrong results. Attached is a script to reproduce it. It creates an array R of poisson distributed random numbers and plots the
ecdf(R)
together with a calculated CDF of the poisson distribution with same \lambda. The plot shows clearly, that the CDF's are very close. Butkstest
rejects the H0 hypothesis, saying that R is not poisson distributed, and gives a very low p value:Output:
When I swap the lines 141 and 142 in
kstest
, so that n is taken as the length of the ecdf ordinates instead of the input x, the result looks better:And this is, btw, also more in line with the result that you get with R, using the KSgeneral package: