Vectorized xoshiro256++

vks commented 5 years ago

There is some discussion on how to vectorize xoshiro256++ at https://github.com/JuliaLang/julia/issues/27614. The method relies on interleaving 4 xoshiro256++ generators. I implemented it and the results are impressive (see gen_bytes_fill):

test gen_bytes_chacha12           ... bench:     324,494 ns/iter (+/- 4,236) = 3155 MB/s
test gen_bytes_chacha20           ... bench:     490,442 ns/iter (+/- 16,214) = 2087 MB/s
test gen_bytes_chacha8            ... bench:     243,010 ns/iter (+/- 19,972) = 4213 MB/s
test gen_bytes_fill               ... bench:     105,350 ns/iter (+/- 1,456) = 9719 MB/s
test gen_bytes_pcg64mcg           ... bench:     321,665 ns/iter (+/- 7,854) = 3183 MB/s
test gen_bytes_splitmix64         ... bench:     233,973 ns/iter (+/- 1,859) = 4376 MB/s
test gen_bytes_xoshiro256plusplus ... bench:     343,911 ns/iter (+/- 6,580) = 2977 MB/s

The implementation is 3.3 time faster than the non-vectorized xoshiro256++ generator and more than 2.2 times faster than splitmix64 or chacha8. It is also faster than dSFMT. However, the size of the state is blown up to 128 bytes, which is almost as large as chacha's state (136 bytes).

dhardy commented 5 years ago

Impressive performance taking advantage of how "wide" modern CPUs are — but whether this is useful in practice is another question. For ciphers it does sometimes make sense to focus on "byte fill" performance, but is this useful for general-purpose random number generation, or some specific application?

vks commented 5 years ago

I think in practice it replaces dSFMT, which used to be faster. All applications requiring several random numbers at a time benefit. ~~For general-purpose number generation, using this with rand_core::block::BlockRng64 is probably faster than the non-SIMD version as well.~~ It's actually slower by a factor two.

Personally, I think ChaCha8 is a better choice. The performance is similar and the cryptographic strength makes it usable for more general purposes.

dhardy commented 5 years ago

It's actually slower by a factor two.

I imagine it's going to depend on the actual application significantly. You're using micro-benchmarks here? I actually think for most applications (even ones heavily using RNGs) the speed of the RNG is not very relevant.

vks commented 5 years ago

I was referring to the gen_u64_* benchmarks in this repository. I agree that it is not a very relevant benchmark for most applications.

vigna commented 5 years ago

Wow, really nice! Note that (as I mentioned in the Julia discussion) the dSFMT returns only 52 significant digits, thus restricting the values in [0..1) to half of the possible values (IEEE has 53 digits precision as one bit is implied).

I also think that the Julia guys found that an 8-fold unroll is even faster. But of course you're using a lot of space for copies (i.e., you are not getting a longer period, just faster generation).

And yes, in 90% applications the speed of the PRNG is irrelevant.

I'm not particularly akin to block generation, but this is what is done in high-performance scientific computing. The Intel Math Kernel library, for example, provides only block generation. If you try to generate one value at a time, the performance is abysmal, but it is unbeatable (high vectorization) for large amounts.

rust-random / rngs

Vectorized xoshiro256++ #3