numpy / numpy

The fundamental package for scientific computing with Python.
https://numpy.org
Other
28.13k stars 10.15k forks source link

Validate the statistics of the random distributions #15911

Open charris opened 4 years ago

charris commented 4 years ago

Possible candidate for GSOC. The testing will be long running, so should be separate and run occasionally. Organization open to discussion.

Reproducing code example:

import numpy as np
<< your code here >>

Error message:

Numpy/Python version information:

anirudh2290 commented 4 years ago

Is numpy participating in gsoc this year ?

charris commented 4 years ago

No, I'm thinking of next year.

GabrielSimonetto commented 3 years ago

@charris do you have any reference on this subject, any starting points? I would like to understand how this could be done, and how we would know that the work is done

mdhaber commented 1 year ago

@charris were you still interested in this being done?

I'm assuming you meant to perform hypothesis tests in which the null hypothesis is that a sample from the RNG is drawn from a hypothesized distribution. A small p-value would be reason to take a closer look. Of course, you could get false positives (especially if many tests are to be performed), and a large p-value is inconclusive.

SciPy has two tests that would work out of the box for any univariate distribution.

goodness_of_fit can perform randomized versions of the two tests above as well as the Anderson-Darling test and Filliben's test. The caveat is that it needs to be able to produce random samples from the hypothesized distribution. Without care, this could lead to a circular test. If you trust NumPy's uniform RNG, though, you can generate samples using inverse transform sampling or something from scipy.stats.sampling.

import numpy as np
from scipy import stats
rng = np.random.default_rng(35823458245)
n = 1000

shape = 1.
loc = 0.
scale = 1.
sample = rng.gamma(shape=shape, scale=scale, size=n)

dist = stats.gamma
known_params=dict(a=shape, loc=loc, scale=scale)

res = stats.ks_1samp(sample, dist(**known_params).cdf)
assert res.pvalue > 0.05

res = stats.cramervonmises(sample, dist(**known_params).cdf)
assert res.pvalue > 0.05

for statistic in ('ad', 'filliben'):  # also supportds ks, cvm
    # As written, this test is sort of circular, since it will use `dist.rvs` to 
    # draw random samples, and `rvs` methods rely on NumPy
    # It can be modified to perform a valid test, though.
    res = stats.goodness_of_fit(dist, sample, statistic=statistic,
                                known_params=known_params)
    assert res.pvalue > 0.05

Also, since the parameters of the underlying distribution are known, one could take the CDF of the sampled values and use any test developed to validate a uniform RNG (e.g. here).

Perhaps @steppi has other ideas (e.g. Bayesian methods)?

charris commented 1 year ago

Long ago the legacy random distributions were tested to see if they produced the correct distributions. The usual method is to map the output values so that the result should be uniformly distributed and check that. @rkern, @bashtage would have a better idea is it would be desirable to do that with the new code.

bashtage commented 1 year ago

Would be a good project for someone. I think it would be particularly helpful to verify the distribution near various boundaries, especially when the variates are generated using a rejection sampler. This would lead to practical advice about the effective range of parameter values, e.g., a formal requirement in the distribution that x >= 0 might mean that either x == 0 or x >1e-8.

mdhaber commented 1 year ago

Example: zipf with a near 1 or large (gh-9829). This is something we're working toward for distributions in SciPy.

jakevdp commented 1 year ago

I'm interested to see what ideas are generated here. In JAX we use KS tests and chi-squared tests to validate distributions in jax.random, but currently this leads to very slow tests (we generate many samples for statistical robustness), and even with these many samples we still see statistically-expected false positive test failures when we change the random seed! (after all, p < 0.05 means that a test is expected to fail in one out of 20 cases due to statistical noise).

I'm sure there are better ideas out there, and I'd love to hear them!

WarrenWeckesser commented 1 year ago

There must be an extensive body of literature and prior work on this. Is anyone familiar with any fundamental literature or existing software for this? For the low-level RNG algorithms, there are the "diehard tests".

For the reasons already mentioned, a single p-value from a single sample has limited value as a rigorous test. FWIW, when I added the multivariate_hypergeometric distribution to NumPy in https://github.com/numpy/numpy/pull/13794, I validated the algorithms by running many trials, with each trial generating a p-value from a G-test (a slight variation of the chi-squared test) that compared the actual and expected frequencies of the samples. I then plotted a histogram of the p-values and visually checked for any anomalies (so it was not an automated process).

charris commented 1 year ago

I think diehard is considered outdated. @rkern Recommendations?

rkern commented 1 year ago

There must be an extensive body of literature and prior work on this.

Yes and no! On the no side, even the low-level PRNG tests are of the form "make a bunch of NHSTs and check the p-values; be suspicious of clusters of extreme values". And their concerns are largely distinct from ours. The low-level PRNG tests mostly spend their effort on testing for independence since the uniform distribution properties are much easier for these algorithms to obtain. We have complementary concerns here; independence is relatively straightforward when relying on a quality low-level PRNG while the fine details of the distribution, particularly in tail regions and extreme parameters, is of more concern.

On the yes side, testing whether a set of samples follows a given distribution is the bread and butter of NHSTing in general. The problem for us is that we are generally implementing both the numerical CDF/PPFs and the rvs at the same time and often with each other and testing both of them at the same time. We're looking for pretty fine deviations on large numbers of draws while real-data statistical practice is looking for rather larger deviations in smaller data (and the low-level PRNG test batteries are looking for even finer deviations).

NHSTs are okay for screening for problem areas, but once alerted to a potential problem area, then you have to go in and look at the generated values and compare them with the distributions to see how the issue actually presents itself. Because it's floating point, there are going to be issues somewhere; there is little chance of recovering the perfect uniformity of the BitGenerator.

I think diehard is considered outdated.

FWIW, PractRand is what I use. TestU01 is also well-regarded (if you hear about passing/failing "BigCrush", that's TestU01). Both are ... venerable, so I use lemire/testingRNG to get them built and runnable with the latest community patches.

But neither is particularly relevant for this issue. They test low-level bit-spouting PRNG algorithms, not floating point continuous distributions. Reading about their test suite philosophy and interpretation of extreme p-values is mildly applicable, but not by much.