Closed benjamin-lieser closed 4 days ago
It looks like there is something wrong with the rejection-acceptance sampling method. For various input parameters, I generated 10000000 samples, and compared the frequencies of the output values to the theoretical frequencies. ("Compare" means I just printed the observed and expected frequencies, and looked for big discrepancies. I used a Python script for that, and used the pmf
method of scipy.stats.hypergeom
to calculate the expected frequencies.) I get incorrect results with inputs such as (65, 30, 28), (48, 25, 20), and (40, 20, 19), which all use the rejection-acceptance method. I haven't seen any unexpected discrepancies when the inverse-transform sampling method is used.
The (100,50,49) case uses the inverse sampling code
Interesting. That's not what I observe.
With the master branch, when I call Hypergeometric::new(100, 50, 49)
, it selects the method RejectionAcceptance
. The code that determines the method is
With those inputs, n1 = n2 = 50
, n = 100
, k = 49
, and the mode m = 25
. The value that determines the sampling method is m - max(0, k - n2) = 25
, which is greater than the threshold HIN_THRESHOLD = 10
, so RejectionAcceptance
is used.
The (100,50,49) case uses the inverse sampling code
Interesting. That's not what I observe.
With the master branch, when I call
Hypergeometric::new(100, 50, 49)
, it selects the methodRejectionAcceptance
. The code that determines the method isWith those inputs,
n1 = n2 = 50
,n = 100
,k = 49
, and the modem = 25
. The value that determines the sampling method ism - max(0, k - n2) = 25
, which is greater than the thresholdHIN_THRESHOLD = 10
, soRejectionAcceptance
is used.
I was mistaken, I had it in a debugger from the KS tests, but I think I forgot to comment out the other hyperparameter
I would wait a bit if someone with experience with the algorithm (maybe @teryror ?) wants to investigate. Otherwise I will try myself.
It turns out the problem is a bug in the original algorithm. R discovered this years ago: https://bugs.r-project.org/show_bug.cgi?id=7314
The fix is to change this line
to
f /= (n1 - i + 1) as f64 * (k - i + 1) as f64;
There is a separate (but apparently not so significant) bug: in ln_of_factorial(v)
, Stirling's approximation is used instead of actually computing ln(v!) = ln(gamma(v + 1))
accurately. That approximation is not very good for small to moderate values of v
. For example, with Hypergeometric::new(40, 20, 19)
, the function ln_of_factorial(v)
is called with values ranging from 7 to 13. For the input 7, it returns 6.621371043387192, but the correct value of ln(7!)
is 8.525161361065415. I haven't tried to figure out how this affects the correctness of the algorithm.
Really good catch :)
I actually tried to see if fixing the Stirling helps, but it did not have any measurable effect on the KS statistic (but this was with the bug still there). It would be good to know what would be the minimal values it can be called with.
Hypergeometric::new(100,50,49)
produces samples which are very likely not from this distribution.The distribution is not very extreme, so I would expect this to sample correctly.
One piece of evidence is in the failed KS test (see https://github.com/rust-random/rand/pull/1504)
I also did a chisquared test which gives a p value of 0.0 for a million samples:
The frequencies I sample:
the theoretical ones: