Open vks opened 5 years ago
Impressive performance taking advantage of how "wide" modern CPUs are — but whether this is useful in practice is another question. For ciphers it does sometimes make sense to focus on "byte fill" performance, but is this useful for general-purpose random number generation, or some specific application?
I think in practice it replaces dSFMT, which used to be faster. All applications requiring several random numbers at a time benefit. For general-purpose number generation, using this with It's actually slower by a factor two.rand_core::block::BlockRng64
is probably faster than the non-SIMD version as well.
Personally, I think ChaCha8
is a better choice. The performance is similar and the cryptographic strength makes it usable for more general purposes.
It's actually slower by a factor two.
I imagine it's going to depend on the actual application significantly. You're using micro-benchmarks here? I actually think for most applications (even ones heavily using RNGs) the speed of the RNG is not very relevant.
I was referring to the gen_u64_*
benchmarks in this repository. I agree that it is not a very relevant benchmark for most applications.
Wow, really nice! Note that (as I mentioned in the Julia discussion) the dSFMT returns only 52 significant digits, thus restricting the values in [0..1) to half of the possible values (IEEE has 53 digits precision as one bit is implied).
I also think that the Julia guys found that an 8-fold unroll is even faster. But of course you're using a lot of space for copies (i.e., you are not getting a longer period, just faster generation).
And yes, in 90% applications the speed of the PRNG is irrelevant.
I'm not particularly akin to block generation, but this is what is done in high-performance scientific computing. The Intel Math Kernel library, for example, provides only block generation. If you try to generate one value at a time, the performance is abysmal, but it is unbeatable (high vectorization) for large amounts.
There is some discussion on how to vectorize xoshiro256++ at https://github.com/JuliaLang/julia/issues/27614. The method relies on interleaving 4 xoshiro256++ generators. I implemented it and the results are impressive (see
gen_bytes_fill
):The implementation is 3.3 time faster than the non-vectorized xoshiro256++ generator and more than 2.2 times faster than splitmix64 or chacha8. It is also faster than dSFMT. However, the size of the state is blown up to 128 bytes, which is almost as large as chacha's state (136 bytes).