Summary

This is an attempt to solve the issue in #1378

Details

The old implementation would enter BTPE even for small np in the case n > i32:MAX and then causes panics, because the the algorithm needs np to be large enough. I have fixed this by casting n to f64 and replaced powi with powf.

The implementation is still not perfect, for example with n = 1 << 61 and p = 1e-18 we have np ~ 2.3 but it samples only 0. This is due to q = 1-p = 1.0 exactly. numpy has the same behavior (here already with 1e-17, because they have a higher BINV threshold).

I guess we have to either catch this case (and then maybe doing poisson approximation, which should be very precise then) or we do more of the calculations in logspace but this will be slower.

benjamin-lieser commented 3 months ago

The second commit contains an Poisson path in the case BINV is numerically unstable

dhardy commented 3 months ago

Removing the q == 1 special case (Poisson sampling) does not improve benchmarks either. None of the benchmarks used hit the removed if condition, so the cause is likely the switch to powf.

This approach may still be the best option.

Further benchmarks:

Original (`master` branch):
binomial/binomial       time:   [113597.2571 cycles 113851.0806 cycles 114134.4589 cycles]
                        thrpt:  [14.2668 cpb 14.2314 cpb 14.1997 cpb]
binomial/binomial_small time:   [45699.4943 cycles 45875.8845 cycles 46060.8705 cycles]
                        thrpt:  [5.7576 cpb 5.7345 cpb 5.7124 cpb]
binomial/binomial_1     time:   [25123.5419 cycles 25138.9251 cycles 25155.1745 cycles]
                        thrpt:  [3.1444 cpb 3.1424 cpb 3.1404 cpb]
binomial/binomial_10    time:   [67701.8112 cycles 69642.0678 cycles 71619.5343 cycles]
                        thrpt:  [8.9524 cpb 8.7053 cpb 8.4627 cpb]
binomial/binomial_100   time:   [71981.0843 cycles 72269.4885 cycles 72659.3663 cycles]
                        thrpt:  [9.0824 cpb 9.0337 cpb 8.9976 cpb]
binomial/binomial_1000  time:   [312402.0450 cycles 313115.0284 cycles 313906.5481 cycles]
                        thrpt:  [39.2383 cpb 39.1394 cpb 39.0503 cpb]
binomial/binomial_1e12  time:   [145857.1377 cycles 145947.6446 cycles 146049.8795 cycles]
                        thrpt:  [18.2562 cpb 18.2435 cpb 18.2321 cpb]

#1484:
binomial/binomial       time:   [140103.0524 cycles 140148.2440 cycles 140209.3686 cycles]
                        thrpt:  [17.5262 cpb 17.5185 cpb 17.5129 cpb]
binomial/binomial_small time:   [101471.0977 cycles 101671.4520 cycles 101873.9594 cycles]
                        thrpt:  [12.7342 cpb 12.7089 cpb 12.6839 cpb]
binomial/binomial_1     time:   [66256.7652 cycles 66283.7976 cycles 66311.9841 cycles]
                        thrpt:  [8.2890 cpb 8.2855 cpb 8.2821 cpb]
binomial/binomial_10    time:   [112029.5595 cycles 112050.2365 cycles 112072.3636 cycles]
                        thrpt:  [14.0090 cpb 14.0063 cpb 14.0037 cpb]
binomial/binomial_100   time:   [113882.4511 cycles 114051.8796 cycles 114241.4027 cycles]
                        thrpt:  [14.2802 cpb 14.2565 cpb 14.2353 cpb]
binomial/binomial_1000  time:   [311388.9158 cycles 311770.8383 cycles 312189.4034 cycles]
                        thrpt:  [39.0237 cpb 38.9714 cpb 38.9236 cpb]
binomial/binomial_1e12  time:   [149142.5352 cycles 149243.5861 cycles 149350.1778 cycles]
                        thrpt:  [18.6688 cpb 18.6554 cpb 18.6428 cpb]

#1484 + #[cold]:
binomial/binomial       time:   [143735.5673 cycles 143788.3539 cycles 143848.9685 cycles]
                        thrpt:  [17.9811 cpb 17.9735 cpb 17.9669 cpb]
binomial/binomial_small time:   [112645.4022 cycles 112703.7493 cycles 112762.5052 cycles]
                        thrpt:  [14.0953 cpb 14.0880 cpb 14.0807 cpb]
binomial/binomial_1     time:   [64115.4011 cycles 64132.3238 cycles 64148.6840 cycles]
                        thrpt:  [8.0186 cpb 8.0165 cpb 8.0144 cpb]
binomial/binomial_10    time:   [111157.3925 cycles 111191.5991 cycles 111224.7641 cycles]
                        thrpt:  [13.9031 cpb 13.8989 cpb 13.8947 cpb]
binomial/binomial_100   time:   [111793.6174 cycles 111810.0698 cycles 111827.5709 cycles]
                        thrpt:  [13.9784 cpb 13.9763 cpb 13.9742 cpb]
binomial/binomial_1000  time:   [311506.0892 cycles 311679.2985 cycles 311857.8579 cycles]
                        thrpt:  [38.9822 cpb 38.9599 cpb 38.9383 cpb]
binomial/binomial_1e12  time:   [146403.7167 cycles 146454.9550 cycles 146508.4058 cycles]
                        thrpt:  [18.3136 cpb 18.3069 cpb 18.3005 cpb]

#1484 without Poisson sampling:
binomial/binomial       time:   [140592.0514 cycles 140647.8662 cycles 140707.3793 cycles]
                        thrpt:  [17.5884 cpb 17.5810 cpb 17.5740 cpb]
binomial/binomial_small time:   [41225.0324 cycles 41274.1435 cycles 41332.3819 cycles]
                        thrpt:  [5.1665 cpb 5.1593 cpb 5.1531 cpb]
binomial/binomial_1     time:   [63006.7000 cycles 63053.2162 cycles 63103.6871 cycles]
                        thrpt:  [7.8880 cpb 7.8817 cpb 7.8758 cpb]
binomial/binomial_10    time:   [108515.6262 cycles 108708.6526 cycles 108911.2978 cycles]
                        thrpt:  [13.6139 cpb 13.5886 cpb 13.5645 cpb]
binomial/binomial_100   time:   [112087.5229 cycles 112175.7373 cycles 112268.9435 cycles]
                        thrpt:  [14.0336 cpb 14.0220 cpb 14.0109 cpb]
binomial/binomial_1000  time:   [304940.7548 cycles 305098.0557 cycles 305270.9483 cycles]
                        thrpt:  [38.1589 cpb 38.1373 cpb 38.1176 cpb]
binomial/binomial_1e12  time:   [143742.0920 cycles 143998.6577 cycles 144359.3408 cycles]
                        thrpt:  [18.0449 cpb 17.9998 cpb 17.9678 cpb]

#1486:
binomial/binomial       time:   [192845.2040 cycles 192903.1009 cycles 192965.7219 cycles]
                        thrpt:  [24.1207 cpb 24.1129 cpb 24.1057 cpb]
binomial/binomial_small time:   [36715.1786 cycles 36719.2505 cycles 36723.0147 cycles]
                        thrpt:  [4.5904 cpb 4.5899 cpb 4.5894 cpb]
binomial/binomial_1     time:   [59773.3497 cycles 59805.5619 cycles 59844.2904 cycles]
                        thrpt:  [7.4805 cpb 7.4757 cpb 7.4717 cpb]
binomial/binomial_10    time:   [108832.5020 cycles 108875.0936 cycles 108914.4595 cycles]
                        thrpt:  [13.6143 cpb 13.6094 cpb 13.6041 cpb]
binomial/binomial_100   time:   [108932.2965 cycles 108968.2073 cycles 109008.4956 cycles]
                        thrpt:  [13.6261 cpb 13.6210 cpb 13.6165 cpb]
binomial/binomial_1000  time:   [308951.8288 cycles 309038.7736 cycles 309122.8502 cycles]
                        thrpt:  [38.6404 cpb 38.6298 cpb 38.6190 cpb]
binomial/binomial_1e12  time:   [145420.6104 cycles 145478.5829 cycles 145548.0519 cycles]
                        thrpt:  [18.1935 cpb 18.1848 cpb 18.1776 cpb]

The binomial_small case (n=1e6, p=1e-30) is the worst affected (2.6x cost). Is it worth the complexity of an additional code path for when n is small enough to use powi? Probably not, but likely some of you have more experience using this sampler than I do.

dhardy commented 3 months ago

This should merge master and revert the relevant part of #1480. (Or are there unsolved cases in #1378?)

jamaltas commented 3 months ago

I wonder if a combination of the commits is best. I suspect (but am not certain) the changes I proposed in #1486 may help some of the benchmarks above, and the changes to r are equally applicable to the changes proposed and bench-marked here. powf is known to be slow, relative to powi which is likely the reason for the benchmark changes you've shown.

dhardy commented 3 months ago

@jamaltas I benchmarked #1486 above.

jamaltas commented 3 months ago

@dhardy Yep sorry, missed that a few hours ago and forgot to edit after I realized. I've updated #1486 with some changes and benchmarks that should resolve the issues!

benjamin-lieser commented 3 months ago

We could also push the powf calculation into the constructor of Binomial. If only a hand full of samples are drawn this should surpass even the performance of the old implementation.

benjamin-lieser commented 3 months ago

It should be faster now than the original in the benchmarks, at the cost of a bigger Binomial struct and the usecase where we only have one sample

dhardy commented 2 months ago

New benchmarks:

#1484 (Binomial is 64 bytes!)
binomial/binomial       time:   [86172.4139 cycles 86283.1244 cycles 86396.2794 cycles]
                        thrpt:  [10.7995 cpb 10.7854 cpb 10.7716 cpb]
binomial/binomial_small time:   [15255.7991 cycles 15364.9577 cycles 15528.2203 cycles]
                        thrpt:  [1.9410 cpb 1.9206 cpb 1.9070 cpb]
binomial/binomial_1     time:   [15642.2938 cycles 15658.0865 cycles 15673.4381 cycles]
                        thrpt:  [1.9592 cpb 1.9573 cpb 1.9553 cpb]
binomial/binomial_10    time:   [44580.3549 cycles 44641.6310 cycles 44702.1187 cycles]
                        thrpt:  [5.5878 cpb 5.5802 cpb 5.5725 cpb]
binomial/binomial_100   time:   [46835.2626 cycles 46886.2751 cycles 46945.4188 cycles]
                        thrpt:  [5.8682 cpb 5.8608 cpb 5.8544 cpb]
binomial/binomial_1000  time:   [303408.6133 cycles 303517.4591 cycles 303627.5685 cycles]
                        thrpt:  [37.9534 cpb 37.9397 cpb 37.9261 cpb]
binomial/binomial_1e12  time:   [142787.7876 cycles 142833.4652 cycles 142884.0874 cycles]
                        thrpt:  [17.8605 cpb 17.8542 cpb 17.8485 cpb]

#1486 (Binomial remains 16 bytes)
binomial/binomial       time:   [131996.3806 cycles 132075.6648 cycles 132147.5931 cycles]
                        thrpt:  [16.5184 cpb 16.5095 cpb 16.4995 cpb]
binomial/binomial_small time:   [45887.2370 cycles 46044.1826 cycles 46209.5481 cycles]
                        thrpt:  [5.7762 cpb 5.7555 cpb 5.7359 cpb]
binomial/binomial_1     time:   [18069.7436 cycles 18084.6158 cycles 18099.9501 cycles]
                        thrpt:  [2.2625 cpb 2.2606 cpb 2.2587 cpb]
binomial/binomial_10    time:   [104317.2648 cycles 104374.2808 cycles 104435.8105 cycles]
                        thrpt:  [13.0545 cpb 13.0468 cpb 13.0397 cpb]
binomial/binomial_100   time:   [104437.1813 cycles 104504.4548 cycles 104570.4002 cycles]
                        thrpt:  [13.0713 cpb 13.0631 cpb 13.0546 cpb]
binomial/binomial_1000  time:   [311378.3359 cycles 311572.0775 cycles 311780.1669 cycles]
                        thrpt:  [38.9725 cpb 38.9465 cpb 38.9223 cpb]
binomial/binomial_1e12  time:   [144903.0739 cycles 144946.6092 cycles 144994.7044 cycles]
                        thrpt:  [18.1243 cpb 18.1183 cpb 18.1129 cpb]

Our implementation of Poisson includes two paths. Only the first should be relevant here since we know p is small. A direct implementation of that should save 16 bytes at the cost of less code re-use — but that could also be resolved via an enum Inner { .. } pattern for Poisson with pub(crate) parts.

benjamin-lieser commented 2 months ago

Do you thing the general approach to put precomputation in the structs is a good one? If yes I would invest a bit of time into that to get a good size/performance tradeoff.

dhardy commented 2 months ago

Do you thing the general approach to put precomputation in the structs is a good one? If yes I would invest a bit of time into that to get a good size/performance tradeoff.

As you say, it's all about tradeoffs (code size and complexity vs working state size vs CPU time). Increasing the state size of Binomial to 64 bytes is utterly irrelevant to some usages, but could possibly be a big deal in some others. It's hard to optimise for the general case.

The code as-is is probably fine, but if there are easy ways to drop the state size we should consider them.

Further, could you merge or rebase on master, then amend/remove the doc added in #1480 please.

benjamin-lieser commented 2 months ago

Thanks @dhardy for the work in the Poisson. The struct is now a bit smaller. I also changed some minor things like #1445, some names and comments. There is some room for more speed when precomputing also for btpe but this would increase the size again and I am not sure if it makes such a big difference for btpe in comparison to binv.

The struct has quite some padding: 7 bytes from the flipped bool and 7 bytes from the unnecessary large tag on the enum. I am not sure how to make this better without making the code quite a bit more complex (putting flipped inside of the enum variants)

dhardy commented 2 months ago

Thanks @benjamin-lieser. I still need to do a more detailed review, but broadly this looks good.

There's a thing known as a "niche optimization" and some interest in applying this to Rust, but I'm not sure this got anywhere. Anyway, I suppose that 48 bytes will do! (If people want, we could try to make a size-optimized variant, but that applies to quite a few distributions and mostly there doesn't seem to be enough interest in taking that route.)

Benchmarks now (where master is actually #1493, and results are 1/1000 of above due to removal of an inner loop):

$ cargo bench -- binomial --baseline master
   Compiling rand_distr v0.5.0-alpha.1 (/home/dhardy/projects/rand/rand/rand_distr)
   Compiling benches v0.1.0 (/home/dhardy/projects/rand/rand/benches)
    Finished `bench` profile [optimized] target(s) in 5.06s
     Running benches/array.rs (target/release/deps/array-2d32515a24330317)
     Running benches/bool.rs (target/release/deps/bool-3dcf572dd75fcd39)
     Running benches/distr.rs (target/release/deps/distr-20f4307022d697eb)
binomial/small          time:   [13.9097 cycles 13.9228 cycles 13.9356 cycles]
                        change: [-69.708% -69.492% -69.334%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe
binomial/1              time:   [15.6490 cycles 15.6711 cycles 15.7013 cycles]
                        change: [-37.425% -37.368% -37.305%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
binomial/10             time:   [41.9191 cycles 41.9800 cycles 42.0418 cycles]
                        change: [-34.211% -34.087% -33.964%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe
binomial/100            time:   [43.7516 cycles 43.7905 cycles 43.8298 cycles]
                        change: [-37.930% -37.879% -37.824%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) low mild
  5 (5.00%) high mild
binomial/1000           time:   [294.6777 cycles 294.7826 cycles 294.8830 cycles]
                        change: [-0.0344% +0.0362% +0.1176%] (p = 0.35 > 0.05)
                        No change in performance detected.
Found 14 outliers among 100 measurements (14.00%)
  6 (6.00%) low mild
  6 (6.00%) high mild
  2 (2.00%) high severe
binomial/1e12           time:   [132.6029 cycles 132.7085 cycles 132.8178 cycles]
                        change: [-2.2883% -1.7838% -1.4818%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild

Removing the #[inline] on fn sample marginally improves results, and since optimization is usually the only reason to want inlining anyway, I recommend we remove that.

dhardy commented 1 month ago

Thanks!

rust-random / rand

Make sure BTPE is not entered when np < 10 #1484

Summary

Details