rust-random / rand

A Rust library for random number generation.
https://crates.io/crates/rand
Other
1.66k stars 432 forks source link

Hypergeometric incorrect samples #1508

Open benjamin-lieser opened 1 week ago

benjamin-lieser commented 1 week ago

Hypergeometric::new(100,50,49) produces samples which are very likely not from this distribution.

The distribution is not very extreme, so I would expect this to sample correctly.

One piece of evidence is in the failed KS test (see https://github.com/rust-random/rand/pull/1504)

I also did a chisquared test which gives a p value of 0.0 for a million samples:

The frequencies I sample:

array([0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00,
       0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00,
       0.00000e+00, 0.00000e+00, 0.00000e+00, 4.00000e+00, 3.70000e+01,
       1.99000e+02, 8.04000e+02, 2.90700e+03, 8.51200e+03, 1.99080e+04,
       3.89010e+04, 7.11980e+04, 1.12618e+05, 1.42632e+05, 1.42973e+05,
       1.42408e+05, 1.21861e+05, 8.93170e+04, 5.53630e+04, 2.98190e+04,
       1.32620e+04, 5.03400e+03, 1.67100e+03, 4.55000e+02, 9.80000e+01,
       1.60000e+01, 3.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00,
       0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00,
       0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00, 0.00000e+00])

the theoretical ones:

array([5.05494304e-22, 6.19230523e-19, 2.42738365e-16, 4.56348126e-14,
       4.93312324e-12, 3.40385504e-10, 1.60467452e-08, 5.42150748e-07,
       1.35989479e-05, 2.60193203e-04, 3.87924412e-03, 4.58456124e-02,
       4.35533318e-01, 3.36461453e+00, 2.13412693e+01, 1.12041664e+02,
       4.90182279e+02, 1.79733502e+03, 5.54966604e+03, 1.44875492e+04,
       3.20795733e+04, 6.04095861e+04, 9.69418655e+04, 1.32768207e+05,
       1.55338802e+05, 1.55338802e+05, 1.32768207e+05, 9.69418655e+04,
       6.04095861e+04, 3.20795733e+04, 1.44875492e+04, 5.54966604e+03,
       1.79733502e+03, 4.90182279e+02, 1.12041664e+02, 2.13412693e+01,
       3.36461453e+00, 4.35533318e-01, 4.58456124e-02, 3.87924412e-03,
       2.60193203e-04, 1.35989479e-05, 5.42150748e-07, 1.60467452e-08,
       3.40385504e-10, 4.93312324e-12, 4.56348126e-14, 2.42738365e-16,
       6.19230523e-19, 5.05494304e-22])
WarrenWeckesser commented 1 week ago

It looks like there is something wrong with the rejection-acceptance sampling method. For various input parameters, I generated 10000000 samples, and compared the frequencies of the output values to the theoretical frequencies. ("Compare" means I just printed the observed and expected frequencies, and looked for big discrepancies. I used a Python script for that, and used the pmf method of scipy.stats.hypergeom to calculate the expected frequencies.) I get incorrect results with inputs such as (65, 30, 28), (48, 25, 20), and (40, 20, 19), which all use the rejection-acceptance method. I haven't seen any unexpected discrepancies when the inverse-transform sampling method is used.

WarrenWeckesser commented 1 week ago

The (100,50,49) case uses the inverse sampling code

Interesting. That's not what I observe.

With the master branch, when I call Hypergeometric::new(100, 50, 49), it selects the method RejectionAcceptance. The code that determines the method is

https://github.com/rust-random/rand/blob/0fba9401c4a301d8edd99bacdb751a0600b55a38/rand_distr/src/hypergeometric.rs#L196-L198

With those inputs, n1 = n2 = 50, n = 100, k = 49, and the mode m = 25. The value that determines the sampling method is m - max(0, k - n2) = 25, which is greater than the threshold HIN_THRESHOLD = 10, so RejectionAcceptance is used.

benjamin-lieser commented 1 week ago

The (100,50,49) case uses the inverse sampling code

Interesting. That's not what I observe.

With the master branch, when I call Hypergeometric::new(100, 50, 49), it selects the method RejectionAcceptance. The code that determines the method is

https://github.com/rust-random/rand/blob/0fba9401c4a301d8edd99bacdb751a0600b55a38/rand_distr/src/hypergeometric.rs#L196-L198

With those inputs, n1 = n2 = 50, n = 100, k = 49, and the mode m = 25. The value that determines the sampling method is m - max(0, k - n2) = 25, which is greater than the threshold HIN_THRESHOLD = 10, so RejectionAcceptance is used.

I was mistaken, I had it in a debugger from the KS tests, but I think I forgot to comment out the other hyperparameter

benjamin-lieser commented 1 week ago

I would wait a bit if someone with experience with the algorithm (maybe @teryror ?) wants to investigate. Otherwise I will try myself.

WarrenWeckesser commented 1 week ago

It turns out the problem is a bug in the original algorithm. R discovered this years ago: https://bugs.r-project.org/show_bug.cgi?id=7314

The fix is to change this line

https://github.com/rust-random/rand/blob/f5185d91fae77b1c75344c22036dbaceaeaaba1e/rand_distr/src/hypergeometric.rs#L362

to

f /= (n1 - i + 1) as f64 * (k - i + 1) as f64; 

There is a separate (but apparently not so significant) bug: in ln_of_factorial(v), Stirling's approximation is used instead of actually computing ln(v!) = ln(gamma(v + 1)) accurately. That approximation is not very good for small to moderate values of v. For example, with Hypergeometric::new(40, 20, 19), the function ln_of_factorial(v) is called with values ranging from 7 to 13. For the input 7, it returns 6.621371043387192, but the correct value of ln(7!) is 8.525161361065415. I haven't tried to figure out how this affects the correctness of the algorithm.

benjamin-lieser commented 1 week ago

Really good catch :)

I actually tried to see if fixing the Stirling helps, but it did not have any measurable effect on the KS statistic (but this was with the bug still there). It would be good to know what would be the minimal values it can be called with.