Closed jamaltas closed 1 month ago
This appears to be an alternative to #1484.
Okay, I had the time to sit down tonight and solve the issues both here and with #1484 I believe.
For the case of binomial_1 (n = 1, p=0.9) the calculation can be dramatically simplified. I've added this code path. I can imagine this is a common use case and it is a simple addition.
For the case of binomial_small the powi() algorithm is significantly faster for n*p < approx 10e-10. I've added this code path, although it remains constrained by i32::MAX. I can see the argument for both sides of this whether to carve out this code path or not. We do see fairly significant performance gains, so I think it doesn't hurt but I won't argue much either way.
For the rest of the np < BINV_THRESHOLD values the code uses my updated algorithm for r. This is both more efficient in benchmarks and allows us to drop the i32::MAX constraint.
Benchmarks showing significant performance gains across a wide range of benchmarks and no change to two benchmarks (binomial_small as it continues to use the powi() algorithm, and binomial_1e12 because I did not change the BTPE algorithm which this test benchmarks):
Original (master
branch):
binomial/binomial
time: [186779.4481 cycles 187011.4868 cycles 187239.1338 cycles]
thrpt: [23.4049 cpb 23.3764 cpb 23.3474 cpb]
binomial/binomial_small time: [28230.1376 cycles 28379.1098 cycles 28556.1121 cycles] thrpt: [3.5695 cpb 3.5474 cpb 3.5288 cpb]
binomial/binomial_1 time: [32013.4558 cycles 32054.6770 cycles 32098.4895 cycles] thrpt: [4.0123 cpb 4.0068 cpb 4.0017 cpb]
binomial/binomial_10
time: [146572.2548 cycles 146784.9231 cycles 147007.3199 cycles]
thrpt: [18.3759 cpb 18.3481 cpb 18.3215 cpb]
binomial/binomial_100
time: [147255.9131 cycles 147529.2020 cycles 147813.5971 cycles]
thrpt: [18.4767 cpb 18.4412 cpb 18.4070 cpb]
binomial/binomial_1000
time: [348917.4704 cycles 399027.3114 cycles 448866.4078 cycles]
thrpt: [56.1083 cpb 49.8784 cpb 43.6147 cpb]
binomial/binomial_1e12
time: [133848.6357 cycles 136388.7594 cycles 139883.7218 cycles]
thrpt: [17.4855 cpb 17.0486 cpb 16.7311 cpb]
This (#1486) commit: Newly optimized:
binomial/binomial
time: [143210.8756 cycles 143539.7949 cycles 143940.9181 cycles]
thrpt: [17.9926 cpb 17.9425 cpb 17.9014 cpb]
Performance has improved ~ -23%
binomial/binomial_small time: [26698.7778 cycles 26718.3567 cycles 26741.9949 cycles] thrpt: [3.3427 cpb 3.3398 cpb 3.3373 cpb] Performance is unchanged.
binomial/binomial_1
time: [22258.1045 cycles 22271.8067 cycles 22288.2595 cycles]
thrpt: [2.7860 cpb 2.7840 cpb 2.7823 cpb]
Performance has improved ~ -30%
binomial/binomial_10
time: [107829.8224 cycles 107948.2544 cycles 108114.5181 cycles]
thrpt: [13.5143 cpb 13.4935 cpb 13.4787 cpb]
Performance has improved ~ -26%
binomial/binomial_100
time: [113915.2959 cycles 113955.7108 cycles 113998.6718 cycles]
thrpt: [14.2498 cpb 14.2445 cpb 14.2394 cpb]
Performance has improved ~ -22%
binomial/binomial_1000
time: [280200.3562 cycles 280355.2869 cycles 280542.6277 cycles]
thrpt: [35.0678 cpb 35.0444 cpb 35.0250 cpb]
Performance has improved ~ 20%
binomial/binomial_1e12
time: [137204.5220 cycles 137475.2389 cycles 137807.7701 cycles]
thrpt: [17.2260 cpb 17.1844 cpb 17.1506 cpb]
Performance is unchanged.
The n==1
case is trivial, no binv needed:
let u : f64 = rng.random();
if u < p {1} else {0}
I also think we can unify the code paths.
Also not sure if this catches the q == 1.0
problem
CHANGELOG.md
entrySummary
Slightly refactor algorithm to match the Binary Inverse algorithm used by numpy in order to remove the n <= i32::MAX issues.
Motivation
In my use case (large N, small p, seen frequently in computational biology with large N populations of cells with some small mutation rate p, I encountered frequent crashes.
Details
I found the numpy C-implementation of the BINV algorithm and noted they had no i32::MAX constraint to enter. The issue with the current implementation comes as a result of powi() taking i32. I modified the current algorithm to fit the implementation that numpy uses which relaxes this constraint. I tested the new algorithm against a handful of (n,p) values for large n,p and it seems to match what I get in python. I don't think this introduces any new panic/crash conditions, but I may have missed something.