Closed vks closed 1 month ago
So it uses a maximum of two steps, a bit like Canon's method. Might be generally preferable to #531, but probably still has a significant cost overhead?
At any rate, it may be worth investigating (implementing and benchmarking at least), but not something I'm going to put on my to-do list.
Initial benchmarks for f64
on the OpenClosed01
distrubution (test distr_openclosed01_f64
):
1,089 ns/iter (+/- 7) = 7346 MB/s
1,528 ns/iter (+/- 17) = 5235 MB/s
1,312 ns/iter (+/- 15) = 6097 MB/s
u64
to f64
: 991 ns/iter (+/- 10) = 8072 MB/s
EDIT: testing done on a Zen3 x86_64 processor, but I didn't pass -C target-cpu=native
, so rep bsf
was being used instead of tzcnt
. Rerunning with -C target-cpu=native
seemed to make all the microbenchmarks slower, even the existing implementation, which is odd.
1,217 ns/iter (+/- 25) = 6573 MB/s
1,617 ns/iter (+/- 100) = 4947 MB/s
1,458 ns/iter (+/- 20) = 5486 MB/s
u64
to f64
: 963 ns/iter (+/- 10) = 8307 MB/s
Thanks. Overhead there is not negligible but is small enough that it could be offered as an alternative to the current implementation under a feature flag, if there is genuine interest in using it.
Background
Motivation: It's possible to get higher-quality floats without having to add a loop.
Application: I don't have a concrete application, but this approach is able to generate floats < 2^-53, and does not generate 0, which should have a probability of 2^-1075. It can also generate more distinct floats than our current approach.
Feature request
Implement another (0, 1] distribution.