rust-random / rand

A Rust library for random number generation.
https://crates.io/crates/rand
Other
1.66k stars 432 forks source link

Investigate SIMD support #377

Closed pitdicker closed 6 years ago

pitdicker commented 6 years ago

Disclaimer: I don't really know what I am talking about here.

The first step, as far as Rand is concerned, is to generate a small array of values from an RNG. Some RNG's may lend itself well for this, like ChaCha, and possibly HC-128, although that one does not use SIMD. Other options are packing together multiple small RNGs, like Xorshift+ or Xoroshiro+. The result type of such an RNG should be some array, so maybe we can make use of the BlockRngCore trait?

Making use of the generated values can be a more interesting problem. This involves the distributions code.

Converting to floating point: the Standard uniform distribution should not cause any problems, and neither should the range code.

I suppose the other distributions are going to give trouble, at least anything that needs branches or a look-up table (i.e. the current ziggurat code) is not going to work well. And what should happen when an algorithm needs to do rejections sampling, and one of a couple of SIMD values get rejected? Return NaN? For most distributions there should exist an algorithm suitable for SIMD. The Box-Muller transform for example looks promising for the Normal distribution, although it would need a library to provide functions like sin and cos.

Integer ranges: I don't expect the widening multiply method to work for integer range reduction, and division or modulus also do not seem to be available. And I suppose we have no way to indicate a value is rejected because it falls outside the acceptable zone and can cause bias.

Now I don't know how far we should go, and if Rand should provide any implementations at all. But I think we should make sure the API does not prevent Rand from being used in this area.

TheIronBorn commented 6 years ago

I managed to get equal performance, but it seems like store_aligned doesn't give enough speedup (maybe for very new hardware?).

gnzlbg commented 6 years ago

it seems like store_aligned doesn't give enough speedup (maybe for very new hardware?).

The newer the hardware the faster unaligned loads are. For modern hardware, they are pretty much as fast as aligned loads. For older hardware, unaligned loads are slower, but depending on how much work you do per load, the speed-up you might get from switching from unaligned loads to aligned ones might be minimal.

various convenience stuff: SIMD rotate_left/right

I think we should add these to stdsimd, feel free to open a PR there and I'll review these.

TheIronBorn commented 6 years ago

Should we consider offering an SIMD PRNG? Concerns are portability, quality, correlation between streams/lanes, reproducibility(?).

dhardy commented 6 years ago

IMO yes we should but as you say, there are several considerations. Of course the naive approach is to use multiple simple PRNGs in parallel (and this may also be the fastest), but ideally there would be some mixing between the different parts of state (this should extend cycle length at least).

Note that some generators compute with higher state sizes internally than they output, so SIMD may not always be appropriate (e.g. MCGs and LCGs may use 128-bit arithmetic to output 64-bit values).

However, first, we might look at whether any of the existing generators can have SIMD optimisations, and then look for existing designs for SIMD generators.

vks commented 6 years ago

Most modern SIMD designs also use AES-NI.

TheIronBorn commented 6 years ago

Correlation

This paper is useful: Random Numbers for Parallel Computers: Requirements and Methods, With Emphasis on GPUs

It lists a couple methods:

  1. A different generator for each stream (i.e. different parameters)
    • viable unless the parameters are used in shifts/rotates (variable shift/rotates are only on newer hardware, mostly AVX512). In that case we could iterate through parameters. i.e. with 2-streams: (param1, param1), (param2, param2), ... (requires at least twice as much memory)
    • We could even iterate through more parameter sets than vector lanes.
  2. Splitting a single RNG into equally-spaced blocks (streams)
    • requires a PRNG that can easily compute seed "distance"
    • xorshift jumps, etc
  3. A single RNG with a “random” seed for each stream
    • probability that none of those streams overlap ≈ (1 - streams * stream_length / period) ^ (streams - 1)
  4. Counter-based RNGs
    • this is an excellent solution: jumping is trivial, we could even do it dynamically if needed.
    • probability p that the s random keys are not all distinct: 1 - p ≈ streams^2 / period
    • AES or other hashes/ciphers. HighwayHash, CLHash I hear are fast with SIMD
      • Random123 has several: ARS-5/7 (non-cryptographic) might be the fastest around
    • We could adapt some of these techniques to the counters of other PRNGs like SFC https://sourceforge.net/p/pracrand/discussion/366935/thread/a7761577

Other approaches

TheIronBorn commented 6 years ago

I published a crate with my PRNG tests: https://github.com/TheIronBorn/simd_prngs

Most are just parallelized scalar algorithms. I'm not aware of a decent method to mix state. Pseudo-Random Number Generators for Vector Processors and Multicore Processors might have something.

The only non-cryptographic PRNGs designed for SIMD I'm aware of are those of Random123.

A few other notes:

TheIronBorn commented 6 years ago

Shuffling the 16 bytes of xorshift128+x2 according to some randomly shuffled unique indices actually bumps it from 128MB in PractRand to 64GB (binary rank failure)

let x = u8x16::from_bits(xorshift128plus_x2);
let indices = [9, 15, 0, 8, 3, 11, 6, 13, 4, 14, 10, 1, 7, 5, 2, 12]; // rng.shuffle(0..16)
let r: u8x16 = unsafe { simd_shuffle16(x, x, indices) };
return u64x2::from_bits(r);

Mixing state output might allow us to use a poorer, faster PRNG. (I did have to try several indices though).

I don't know what effect this might have on correlation.

dhardy commented 6 years ago

You improved test performance just by shuffling the output? Interesting. I wonder if this just makes correlations harder to spot though? If PractRand is still seeing all the same output then literally the only change is the byte order. I think many of the tests are engineered to look for certain types of pattern, and some of those may not show up with the extra shuffle.

On the other hand I think shuffling the state could significantly improve the quality (cycle length could in theory be increased by the power of the number of streams), though in effect this produces a totally different generator, hence there could be problems and it might be preferable to use different constants. I'm not an expert on this; it would be much better to have input from someone who actually designs PRNGs.

dhardy commented 6 years ago

Nice work @TheIronBorn assembling that collection of SIMD generators.

What's XSM? A pseudo-LCG?

It would be interesting to include 96-bit / 128-bit LCGs. Basically half of PCG, with a bit better performance:

test lcg32x2                 ... bench:       3,152 ns/iter (+/- 60) = 2598 MB/s
test lcg32x4                 ... bench:       3,161 ns/iter (+/- 64) = 5183 MB/s
test lcg32x8                 ... bench:       3,841 ns/iter (+/- 150) = 8531 MB/s

I'm not quite sure where this is going, but publishing some of these at least as part of Rand would be nice. I guess we should get back to #431.

TheIronBorn commented 6 years ago

Could you make that LCG code available? Perhaps just a pull request on https://github.com/TheIronBorn/simd_prngs.

And here’s @pitdicker’s evaluation of XSM https://github.com/dhardy/rand/issues/60#issuecomment-346896365

pitdicker commented 6 years ago

What's XSM? A pseudo-LCG?

https://github.com/dhardy/rand/issues/60#issuecomment-346896365

TheIronBorn commented 6 years ago

I managed to get xorshift128+_x4 to 4GB by swizzling the full state before the regular xorshift, but it's got 512 bits of state so that isn't too impressive

dhardy commented 6 years ago

No, that's not impressive. It seems to me that multiplication or even addition by a constant would cause better bit-avalanche than bit-shift-and-XOR operations, but then Xorshift stays away from these for performance.

I'm not sure whether it's worth continuing down this path. It's essentially building a new PRNG using only automated tests for evaluation. But I'll have a go at building a mixing LCG/MCG if you like.

TheIronBorn commented 6 years ago

Intel has a mixing LCG here https://software.intel.com/en-us/articles/fast-random-number-generator-on-the-intel-pentiumr-4-processor/. (Perhaps I'm reading it incorrectly and it's not actually mixing state)

My quick testing doesn't indicate it's very fast and they don't give many details on its quality.

TheIronBorn commented 6 years ago

I was only looking into the state mixing to try to find some way of achieving

ideally there would be some mixing between the different parts of state

Probably not the best approach to test mixing with a poorer quality generator like xorshift128+. Also we still won't have the math guarantees of the period/etc of the mixed generator.

dhardy commented 6 years ago

I was just experimenting with a generator based on this. Unfortunately performance is abysmal using simd_shuffle16 etc (2-3GB/s) which means it's probably not worth further investigation.

dhardy commented 6 years ago

In all honesty, I think it might be time to close this issue. Lets recap.

523 landed support for SIMD types on the Standard and Uniform distributions (thanks especially to @pitdicker and @TheIronBorn!).

There has been some discussion of supporting SIMD types for other distributions, notably Normal and the Ziggurat algorithm, and possible alternatives (Central Limit Theorem).

There has been some discussion on adapting the RngCore type, though for now it looks like the current API might be sufficient.

There has been a lot of discussion on parallel/SIMD RNG implementations and @TheIronBorn has assembled a collection of SIMD generators. I remain sceptical about simply bolting together multiple small RNGs since there are several existing PRNGs with 128-bit output at least, and larger generators tend to have better quality than smaller ones, although for some applications multiple small generators may be faster and have sufficient quality.


At this point I think it makes sense to create smaller, more topical discussions on issues of continued interest.

I also suggest that when considering SIMD PRNGs we should try benchmarking with something more complex than micro-benchmarks where possible. @gnzlbg's Ambient Occlusion ray casting benchmark may be useful for this (though multiple applications would be preferred).

In fact, it may be worth opening an issue just to talk about benchmarking (SIMD as well as scalar PRNGs).

gnzlbg commented 6 years ago

FYI the purpose of aobench is to compare Rust's code generation with that of ISPC, so I will keep the benchmark algorithmically/data-structure-wise as close as possible to the benchmark implementation in ISPC (see the volta subdirectory).

However, ISPC only provides one PRNG, and if we can beat it conveniently in Rust, that benchmark might be a good place to try that out and even report it. Just keep in mind what the original intent of the benchmark is, and if you submit a PR with a better PRNGs, put them behind a feature flag so that we can still use the benchmark in the lolbench suite to track Rust run-time performance regressions.

martinothamar commented 1 year ago

Hi! This topic is interesting to me, I have this pet project where I'm using the monte carlo method to simulate football matches. I've used this project as an excuse to learn more about SIMD and different topics. I also wanted to learn Rust so naturally I ended up here. I noticed that the current rand SIMD feature doesn't really vectorize random number generation, so I've ended up implementing some vectorized PRNGs here: https://github.com/martinothamar/simd-rand

I have some portable implementations which are using nightly portable SIMD, and some that are using SIMD intrinsics directly from std::arch. The fastest one is Xoshiro256+ for AVX512 which has reached ~50 GiB/s on my laptop when generating u64x8 vectors.

I thought I'd mention it here in case there is any interest in this still