Investigate test weaknesses: multiple testing problem

rurban / dieharder

A fixed version of Robert G. Brown's "dieharder" tests for random number generators.

Other

10 stars 4 forks source link

Investigate test weaknesses: multiple testing problem #6

Closed rurban closed 3 years ago

rurban commented 3 years ago

Wang Yi claims that all tests are systematically wrong : https://github.com/wangyi-fudan/wyhash/issues/75 due to the https://en.wikipedia.org/wiki/Multiple_comparisons_problem because when you make 1000 statistical tests, some of them will show a p-value like 0.001, it is natural. you can run your program with different seed. "weakness" due to random chance will disappear with different seed while systematic fail will persist.

rurban commented 3 years ago

In this case its not a Marsaglia (diehard) or STS test, but a new rgb_lagged_sum test with ntup=17.

rurban commented 3 years ago

Ok, so the recommended procedure if you get unexpected weak results is to use -Y 1.

Xtrategy "resolve ambiguity"

1 - 'resolve ambiguity' (RA) mode. If a test returns 'weak', this is an undesired result. What does that mean, after all? If you run a long test series, you will see occasional weak returns for a perfect generators because p is uniformly distributed and will appear in any finite interval from time to time. Even if a test run returns more than one weak result, you cannot be certain that the generator is failing. RA mode adds psamples (usually in blocks of 100) until the test result ends up solidly not weak or proceeds to unambiguous failure. This is morally equivalent to running the test several times to see if a weak result is reproducible, but eliminates the bias of personal judgement in the process since the default failure threshold is very small and very unlikely to be reached by random chance even in many runs.

Do this for all GOOD and WEAK results. See QUALITY.md

rurban commented 3 years ago

I either

have to eliminate expected outliers as in smhasher (needs lot of time and memory, which is harder with that many prng results).
Or substract the alpha (family-wise error rate). (with this I loose some raw data, but should be done).
Or I use a different testing strategy (-Y1 in dieharder), which checks in weak results again with different seeds to resolve potential ambiguities in bad p-values. (this is already builtin)

rurban commented 3 years ago

Our -Y1 strategy is recommended by TestU01. That's what we use, together with -k2.

When a p-value is extremely close to 0 or to 1 (for example, if it is less than 10^−10), one can obviously conclude that the generator fails the test. If the p-value is suspicious but failure is not clear enough, (p = 0.0005, for example), then the test can be replicated independently until either failure becomes obvious or suspicion disappears (i.e., one finds that the suspect p-value was obtained only by chance). This approach is possible because there is no limit (other than CPU time) on the amount of data that can be produced by a RNG to increase the sample size and the power of the test. (TestU01 manual)