Closed pitdicker closed 6 years ago
I managed to get equal performance, but it seems like store_aligned
doesn't give enough speedup (maybe for very new hardware?).
it seems like store_aligned doesn't give enough speedup (maybe for very new hardware?).
The newer the hardware the faster unaligned loads are. For modern hardware, they are pretty much as fast as aligned loads. For older hardware, unaligned loads are slower, but depending on how much work you do per load, the speed-up you might get from switching from unaligned loads to aligned ones might be minimal.
various convenience stuff: SIMD rotate_left/right
I think we should add these to stdsimd
, feel free to open a PR there and I'll review these.
Should we consider offering an SIMD PRNG? Concerns are portability, quality, correlation between streams/lanes, reproducibility(?).
IMO yes we should but as you say, there are several considerations. Of course the naive approach is to use multiple simple PRNGs in parallel (and this may also be the fastest), but ideally there would be some mixing between the different parts of state (this should extend cycle length at least).
Note that some generators compute with higher state sizes internally than they output, so SIMD may not always be appropriate (e.g. MCGs and LCGs may use 128-bit arithmetic to output 64-bit values).
However, first, we might look at whether any of the existing generators can have SIMD optimisations, and then look for existing designs for SIMD generators.
Most modern SIMD designs also use AES-NI.
This paper is useful: Random Numbers for Parallel Computers: Requirements and Methods, With Emphasis on GPUs
It lists a couple methods:
(param1, param1), (param2, param2), ...
(requires at least twice as much memory)≈ (1 - streams * stream_length / period) ^ (streams - 1)
p
that the s
random keys are not all distinct:
1 - p ≈ streams^2 / period
≈ streams^2 * stream_length / (period_1 * period_2)
I published a crate with my PRNG tests: https://github.com/TheIronBorn/simd_prngs
Most are just parallelized scalar algorithms. I'm not aware of a decent method to mix state. Pseudo-Random Number Generators for Vector Processors and Multicore Processors might have something.
The only non-cryptographic PRNGs designed for SIMD I'm aware of are those of Random123.
VeryFast
algs w/t counters)xorshift128+
& xorshift128
are very fast. (using all bits)Pcg64
variants use 128-bit multiplication, difficult with SIMDcat_rng.rs
)seed_fast
, xorshift jumps, or rand's seed_from_u64
.Shuffling the 16 bytes of xorshift128+x2
according to some randomly shuffled unique indices actually bumps it from 128MB in PractRand to 64GB (binary rank failure)
let x = u8x16::from_bits(xorshift128plus_x2);
let indices = [9, 15, 0, 8, 3, 11, 6, 13, 4, 14, 10, 1, 7, 5, 2, 12]; // rng.shuffle(0..16)
let r: u8x16 = unsafe { simd_shuffle16(x, x, indices) };
return u64x2::from_bits(r);
Mixing state output might allow us to use a poorer, faster PRNG. (I did have to try several indices
though).
I don't know what effect this might have on correlation.
You improved test performance just by shuffling the output? Interesting. I wonder if this just makes correlations harder to spot though? If PractRand is still seeing all the same output then literally the only change is the byte order. I think many of the tests are engineered to look for certain types of pattern, and some of those may not show up with the extra shuffle.
On the other hand I think shuffling the state could significantly improve the quality (cycle length could in theory be increased by the power of the number of streams), though in effect this produces a totally different generator, hence there could be problems and it might be preferable to use different constants. I'm not an expert on this; it would be much better to have input from someone who actually designs PRNGs.
Nice work @TheIronBorn assembling that collection of SIMD generators.
What's XSM? A pseudo-LCG?
It would be interesting to include 96-bit / 128-bit LCGs. Basically half of PCG, with a bit better performance:
test lcg32x2 ... bench: 3,152 ns/iter (+/- 60) = 2598 MB/s
test lcg32x4 ... bench: 3,161 ns/iter (+/- 64) = 5183 MB/s
test lcg32x8 ... bench: 3,841 ns/iter (+/- 150) = 8531 MB/s
I'm not quite sure where this is going, but publishing some of these at least as part of Rand would be nice. I guess we should get back to #431.
Could you make that LCG code available? Perhaps just a pull request on https://github.com/TheIronBorn/simd_prngs.
And here’s @pitdicker’s evaluation of XSM https://github.com/dhardy/rand/issues/60#issuecomment-346896365
What's XSM? A pseudo-LCG?
https://github.com/dhardy/rand/issues/60#issuecomment-346896365
I managed to get xorshift128+_x4
to 4GB by swizzling the full state before the regular xorshift, but it's got 512 bits of state so that isn't too impressive
No, that's not impressive. It seems to me that multiplication or even addition by a constant would cause better bit-avalanche than bit-shift-and-XOR operations, but then Xorshift stays away from these for performance.
I'm not sure whether it's worth continuing down this path. It's essentially building a new PRNG using only automated tests for evaluation. But I'll have a go at building a mixing LCG/MCG if you like.
Intel has a mixing LCG here https://software.intel.com/en-us/articles/fast-random-number-generator-on-the-intel-pentiumr-4-processor/. (Perhaps I'm reading it incorrectly and it's not actually mixing state)
My quick testing doesn't indicate it's very fast and they don't give many details on its quality.
I was only looking into the state mixing to try to find some way of achieving
ideally there would be some mixing between the different parts of state
Probably not the best approach to test mixing with a poorer quality generator like xorshift128+
. Also we still won't have the math guarantees of the period/etc of the mixed generator.
I was just experimenting with a generator based on this. Unfortunately performance is abysmal using simd_shuffle16
etc (2-3GB/s) which means it's probably not worth further investigation.
In all honesty, I think it might be time to close this issue. Lets recap.
Standard
and Uniform
distributions (thanks especially to @pitdicker and @TheIronBorn!).There has been some discussion of supporting SIMD types for other distributions, notably Normal
and the Ziggurat algorithm, and possible alternatives (Central Limit Theorem).
There has been some discussion on adapting the RngCore
type, though for now it looks like the current API might be sufficient.
There has been a lot of discussion on parallel/SIMD RNG implementations and @TheIronBorn has assembled a collection of SIMD generators. I remain sceptical about simply bolting together multiple small RNGs since there are several existing PRNGs with 128-bit output at least, and larger generators tend to have better quality than smaller ones, although for some applications multiple small generators may be faster and have sufficient quality.
At this point I think it makes sense to create smaller, more topical discussions on issues of continued interest.
I also suggest that when considering SIMD PRNGs we should try benchmarking with something more complex than micro-benchmarks where possible. @gnzlbg's Ambient Occlusion ray casting benchmark may be useful for this (though multiple applications would be preferred).
In fact, it may be worth opening an issue just to talk about benchmarking (SIMD as well as scalar PRNGs).
FYI the purpose of aobench
is to compare Rust's code generation with that of ISPC, so I will keep the benchmark algorithmically/data-structure-wise as close as possible to the benchmark implementation in ISPC (see the volta subdirectory).
However, ISPC only provides one PRNG, and if we can beat it conveniently in Rust, that benchmark might be a good place to try that out and even report it. Just keep in mind what the original intent of the benchmark is, and if you submit a PR with a better PRNGs, put them behind a feature flag so that we can still use the benchmark in the lolbench
suite to track Rust run-time performance regressions.
Hi! This topic is interesting to me, I have this pet project where I'm using the monte carlo method to simulate football matches. I've used this project as an excuse to learn more about SIMD and different topics. I also wanted to learn Rust so naturally I ended up here. I noticed that the current rand
SIMD feature doesn't really vectorize random number generation, so I've ended up implementing some vectorized PRNGs here: https://github.com/martinothamar/simd-rand
I have some portable implementations which are using nightly portable SIMD, and some that are using SIMD intrinsics directly from std::arch
. The fastest one is Xoshiro256+ for AVX512 which has reached ~50 GiB/s on my laptop when generating u64x8 vectors.
I thought I'd mention it here in case there is any interest in this still
Disclaimer: I don't really know what I am talking about here.
The first step, as far as Rand is concerned, is to generate a small array of values from an RNG. Some RNG's may lend itself well for this, like ChaCha, and possibly HC-128, although that one does not use SIMD. Other options are packing together multiple small RNGs, like Xorshift+ or Xoroshiro+. The result type of such an RNG should be some array, so maybe we can make use of the
BlockRngCore
trait?Making use of the generated values can be a more interesting problem. This involves the distributions code.
Converting to floating point: the
Standard
uniform distribution should not cause any problems, and neither should the range code.I suppose the other distributions are going to give trouble, at least anything that needs branches or a look-up table (i.e. the current ziggurat code) is not going to work well. And what should happen when an algorithm needs to do rejections sampling, and one of a couple of SIMD values get rejected? Return
NaN
? For most distributions there should exist an algorithm suitable for SIMD. The Box-Muller transform for example looks promising for theNormal
distribution, although it would need a library to provide functions likesin
andcos
.Integer ranges: I don't expect the widening multiply method to work for integer range reduction, and division or modulus also do not seem to be available. And I suppose we have no way to indicate a value is rejected because it falls outside the acceptable zone and can cause bias.
Now I don't know how far we should go, and if Rand should provide any implementations at all. But I think we should make sure the API does not prevent Rand from being used in this area.