Investigate SIMD support

pitdicker commented 6 years ago

Disclaimer: I don't really know what I am talking about here.

The first step, as far as Rand is concerned, is to generate a small array of values from an RNG. Some RNG's may lend itself well for this, like ChaCha, and possibly HC-128, although that one does not use SIMD. Other options are packing together multiple small RNGs, like Xorshift+ or Xoroshiro+. The result type of such an RNG should be some array, so maybe we can make use of the BlockRngCore trait?

Making use of the generated values can be a more interesting problem. This involves the distributions code.

Converting to floating point: the Standard uniform distribution should not cause any problems, and neither should the range code.

I suppose the other distributions are going to give trouble, at least anything that needs branches or a look-up table (i.e. the current ziggurat code) is not going to work well. And what should happen when an algorithm needs to do rejections sampling, and one of a couple of SIMD values get rejected? Return NaN? For most distributions there should exist an algorithm suitable for SIMD. The Box-Muller transform for example looks promising for the Normal distribution, although it would need a library to provide functions like sin and cos.

Integer ranges: I don't expect the widening multiply method to work for integer range reduction, and division or modulus also do not seem to be available. And I suppose we have no way to indicate a value is rejected because it falls outside the acceptable zone and can cause bias.

Now I don't know how far we should go, and if Rand should provide any implementations at all. But I think we should make sure the API does not prevent Rand from being used in this area.

TheIronBorn commented 6 years ago

I’ve actually been playing with this a lot. Let me see about putting something worth scrutiny on github.

Sent from my mobile device

On Apr 5, 2018, at 12:58, Paul Dicker notifications@github.com wrote:

Disclaimer: I don't really know what I am talking about here.

The first step, as far as Rand is concerned, is to generate a small array of values from an RNG. Some RNG's may lend itself well for this, like ChaCha, and possibly HC-128, although that one does not use SIMD. Other options are packing together multiple small RNGs, like Xorshift+ or Xoroshiro+. The result type of such an RNG should be some array, so maybe we can make use of the BlockRngCore trait?

Making use of the generated values can be a more interesting problem. This involves the distributions code.

Converting to floating point: the Standard uniform distribution should not cause any problems, and neither should the range code.

I suppose the other distributions are going to give trouble, at least anything that needs branches or a look-up table (i.e. the current ziggurat code) is not going to work well. And what should happen when an algorithm needs to do rejections sampling, and one of a couple of SIMD values get rejected? Return NaN? For most distributions there should exist an algorithm suitable for SIMD. The Box-Muller transform for example looks promising for the Normal distribution, although it would need a library to provide functions like sin and cos.

Integer ranges: I don't expect the widening multiply method to work for integer range reduction, and division or modulus also do not seem to be available. And I suppose we have no way to indicate a value is rejected because it falls outside the acceptable zone and can cause bias.

Now I don't know how far we should go, and if Rand should provide any implementations at all. But I think we should make sure the API does not prevent Rand from being used in this area.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

TheIronBorn commented 6 years ago

I think this might give us some direction https://github.com/lemire/SIMDxorshift, specifically the avx_randombound_epu32 function: https://github.com/lemire/SIMDxorshift/blob/1766a98f4be5e1d2e78d9e2c3623800fbf92931f/src/simdxorshift128plus.c#L90

TheIronBorn commented 6 years ago

Also of note: for multi-rngs we could use several sets of constants at once. More or less alternating between Xorshift64(10, 13, 10), Xorshift64(8, 9, 22), Xorshift64(2, 7, 3), and Xorshift64(23, 3, 24) for example. Might help raise statistical quality.

pitdicker commented 6 years ago

Great if you would want to help out a bit here!

I believe @vks has also been exploring this, for example in https://github.com/vks/xoroshiro/blob/master/src/rng/xoroshiro128simd.rs

I think this might give us some direction lemire/SIMDxorshift

:+1: I can't read those function calls yet, but he is doing two widening multplies, even and odd, and keeping the high parts? And without checking against low they will be biased, but this is not always a problem, especially if the range is small.

Also of note: for multi-rngs we could use several sets of constants at once.

We could, and that seems like a good idea. But just to make sure, that doesn't mean all four (/eight?) can use the same seed.

@TheIronBorn Would you be interested in exploring what it would take to have a Distribution take the output of such an RNG, and transforming it wile staying in the SIMD area?

Converting it to a float, or a floating point range are the simplest methods.

TheIronBorn commented 6 years ago

We could, and that seems like a good idea. But just to make sure, that doesn't mean all four (/eight?) can use the same seed.

Are you asking if such an implementation would require each lane to have the same value upon seeding?

TheIronBorn commented 6 years ago

Float gen is pretty simple. Here's a quick gist https://gist.github.com/TheIronBorn/1ba4da07d6e8e05ab20cb780c383dbec

TheIronBorn commented 6 years ago

Added range sampling

dhardy commented 6 years ago

Speeding up block RNGs with SIMD sounds reasonable; anything else sounds application-specific IMO.

Distribution::sample(..) -> T so in theory the trait can support SIMD distributions, given native consumers of the SIMD types. I guess it should be possible to make adapter implementations, i.e. automatically implement Distribution<(f64, f64)> for any type supporting Distribution<f64x2>.

For anything else, e.g. attempting to produce Normal samples in blocks for individual consumption, I guess something like the MutDistribution or CachedDistribution (i.e. equivalent to the old Sample trait) is needed to allow caching — but this significantly increases complexity and I'm not sure it would even be faster in many cases.

One specific case might be worth a specific solution: StandardBlockRng<T>. It may be possible to make a wrapper around any BlockRngCore which immediately applies the Standard distribution to convert its results buffer to an array of type T, then draws values from this buffer. In theory this might allow some optimisation between the RNG (generation) and the distribution (conversion) steps, although probably whatever loop the RNG step requires to fill the buffer would prevent this. This has the added advantages of a common design with & without SIMD and that it can simply implement Iterator<Item=T>.

pitdicker commented 6 years ago

Thank you for code @TheIronBorn, something to look into. Can I also ask you to create a branch in your fork of rand with the changes? Or did you have another way set up to test things?

Speeding up block RNGs with SIMD sounds reasonable; anything else sounds application-specific IMO.

When I was exploring papers around RNGs and related stuff GPU processing was the current trend, and exploring the possibilities of SIMD also seems to come up regularly. As it is a topic we did not give any attention until know, at least good to explore a bit...

But I have no idea yet how useful it will really be. Especially for some distributions it may be hard for SIMD to provide any benefit over the traditional methods.

One thing that seems like an obvious problem is that Distribution takes an Rng, but Rng exposes no methods to give something like next_u32x4. I think a lot can be solved by adding another method and associated type to RngCore. This type would indicate the native value produced by this RNG. It is something I thought that would be useful for months, but never spoke about...

So our current Xorshift implementation natively generates u32s, Xoroshiro128+ u64s, ChaCha [u32; 16], etc. With an addition like this is is possible to implement RNGs for specific types not covered by the trait. For example one specialized for u16 (maybe for embedded?), for SIMD with a type like next_u32x4, or one designed for floats with f64. I created an incomplete branch with changes. @dhardy Please don't shoot it down instantly...

pitdicker commented 6 years ago

For anything else, e.g. attempting to produce Normal samples in blocks for individual consumption, I guess something like the MutDistribution or CachedDistribution (i.e. equivalent to the old Sample trait) is needed to allow caching

Why would we need caching? To return one value at a time instead of a SIMD type? Or did you write this because the smaller number of sampling algorithms may need some mutable state?

dhardy commented 6 years ago

Caching would just be to return one value at a time, yes; the trouble with that approach is that there is significant overhead storing/retrieving the cached values.

@pitdicker your branch looks interesting, though you could probably just do RngCore: Iterator instead. Also what would the OsRng implementation look like?

The BlockRngCore::Results return values are probably going to be too big for many uses and there is no provided way of breaking the block up into smaller segments; this can be worked around however by choosing an RNG with Item type of the correct size.

There is a significant reason I don't want to add an associated type to RngCore: it is no longer object-safe without specifying the type (the same problem we hit in #307). It might however make sense to use a completely separate trait, possibly just Iterator, and then just implement SIMD distributions as simple functions usable via Iterator::map — this is not well integrated however.

But I hold to my previous point: other than a few cases where SIMD can fit into the existing API (e.g. ChaCha core) I see little point trying to develop the infrastructure without at least one (preferably several) application able to consume SIMD values.

TheIronBorn commented 6 years ago

Here's my branch https://github.com/TheIronBorn/rand/tree/simd/src/simd

TheIronBorn commented 6 years ago

Blending sets of constants for Xorshift seems to only optimize well for machines later than somewhere around Haswell: https://godbolt.org/g/G2x1E2. (Adapted from O'Neill's Xorshift library) I don't have such a machine so I can't confirm actual speed.

Removing the multiply helps.

pitdicker commented 6 years ago

There is a significant reason I don't want to add an associated type to RngCore: it is no longer object-safe without specifying the type

I realized just after posting. So it was a bad idea from me, feel free to shoot it down :smile:.

Here's my branch TheIronBorn/rand:src/simd@simd

Nice, I have been playing with it. And SFC seems a pretty good choice here.

I think I have found a way to support SIMD without extra traits like SimdRng and SimdDistribution. The idea is that RngCore::fill_bytes does a simple copy of its result (like u32x4) to the byte slice if they have exactly the same length. And the integer distributions for Standard do the opposite: they initialize an empty u32x4, and pass it to fill_bytes as a byte slice. The optimizer is able to work things out.

I don't think the BlockRng wrapper is going to work well in combination with a simple RNG adapted for SIMD. I expect the bookkeeping to take as much time as just generating a new batch of values. So we may just as well always generate a new batch, and use as much as needed.

I have a proof of concept in https://github.com/pitdicker/rand/commits/simd_support. The code in sfc.rs is very bad and ugly. I am not sure using BlockRngCore in there is useful. But the trick with using fill_bytes seems to work.

But I hold to my previous point: other than a few cases where SIMD can fit into the existing API (e.g. ChaCha core) I see little point trying to develop the infrastructure without at least one (preferably several) application able to consume SIMD values.

I agree we shouldn't turn everything upside down only to support SIMD, especially when we are not sure how useful it will be.

Does implementing Standard and Range for SIMD types fall for you in the category 'develop infrastructure' or 'extension of existing functionality'? Without tests it takes about 125 extra loc: https://github.com/pitdicker/rand/compare/c03deb6...pitdicker:6918e11. The first commit is a necessary change in rand_core to get the inlining right for fill_bytes, and I think that is something that should be in rand_core 0.1. The rest is definitely not interesting until after rand 0.5.

dhardy commented 6 years ago

Nice, I didn't expect fill_bytes to work so well!

Implementing existing traits/distributions is fine — well, we have to look at how much code gets added, but probably fine.

pitdicker commented 6 years ago

@TheIronBorn I have tried to clean up my branch a bit, removed some unnecessary unsafe code etc. Do you want to have a look and maybe experiment from there?

I have tried to make the fill_bytes method on Sfc32x* in such a way that just about all of it is optimized out when the target is a SIMD type, but to also support filling arbitrary slices, as it supposed to be able to do. One thing missing is byte-swapping on big-endian architectures. This can probably be done pretty efficiently with SIMD, do you know a method?

One nice benchmark on my PC: Sfc32x4Rng can generate f32 floating point values at 10GB/s, 4x as fast as the current Xorshift. And we would have to get very creative with supporting Distribution on tuples or slices for even Xoroshiro128+ to get 5GB/s (might be worth the experiment though).

TheIronBorn commented 6 years ago

You could probably use simd_shuffle for btye-swapping. It's not available via #![feature(stdsimd)] though. I suppose the better thing to do is implement it at stdsimd.

rohitjoshi commented 6 years ago

Recently I switch to SIMD enabled Fast Mersenne Twister and could see improvement compared to mersenne_twister

vks commented 6 years ago

If you care about performance, you should probably not use the Mersenne Twister, it is a lot slower than more modern RNGs like PCG and xoroshiro, while having worse statistical quality.

TheIronBorn commented 6 years ago

I just pushed an update to my repo which implements Box-Muller https://github.com/TheIronBorn/rand/commit/f0b2f387d159cefb707955f308bc045275302584.

distr_normal          ... bench:  33,329 ns/iter (+/- 53,512) = 240 MB/s
distr_normal_Sfc32x4  ... bench:  92,156 ns/iter (+/- 58,657) = 347 MB/s

The performance probably suffers significantly as my machine doesn't have SIMD FMA instructions.

The API could be better and the math could probably be optimized, but it seems to work fine.

TheIronBorn commented 6 years ago

I also just pushed some helper functions which should make implementing a SIMD RNG easier https://github.com/TheIronBorn/rand/commit/8da1cd68a4e668a0713a25318b6d1a0dd2a427eb

TheIronBorn commented 6 years ago

A nice example of what SIMD RNGs can do for you: Monte Carlo estimation of pi. https://gist.github.com/TheIronBorn/ba1f1e39b6466353b1eee6cf450fe247

test scalar ... bench:  26,424,886 ns/iter (+/- 10,327,348) = 605 MB/s
test simd   ... bench:  28,363,059 ns/iter (+/- 48,045,043) = 2256 MB/s

pitdicker commented 6 years ago

I just pushed an update to my repo which implements Box-Muller

Wow, respect for the math implementations there! Did you use methods similar to those in Cephes library, or another source? We really need something like that to be a crate in Rust.

The performance is not super impressive (it has to approximate sin and cos after all)... I wonder if this is a situation where SIMD really makes sense. I'll have a closer look at your code later.

vks commented 6 years ago

Did you use methods similar to those in Cephes library, or another source? We really need something like that to be a crate in Rust.

I'm not sure what you are looking for, but there are bindings to cephes, special functions implemented in Rust and special functions implemented for statrs.

TheIronBorn commented 6 years ago

Implementing Box-Muller and other complex stuff with SIMD requires complex math like sin/cos/log. For most machines those instructions aren't available, but it is still possible to have fast functions modeled after something like Cephes library. I don't know of any such Rust implementations.

I don't think most of those intrinsics are even in stdsimd yet: https://github.com/rust-lang-nursery/stdsimd/issues/310

Sorry I didn't read that right. Those libraries look useful thanks

TheIronBorn commented 6 years ago

Most of the my math impls come from https://github.com/miloyip/normaldist-benchmark

TheIronBorn commented 6 years ago

Also of note: the performance should be better on Haswell and above (I have only Sandy Bridge) as the math functions make heavy use of SIMD FMA operations.

gnzlbg commented 6 years ago

To showcase the portable packed SIMD vector facilities in std::simd I was porting the Ambient Occlusion ray casting benchmark (aobench) from ISPC to Rust: https://github.com/gnzlbg/aobench where I noticed that pseudo-random number generation was a bottleneck of the Rust version (using the rand crate).

The scalar version of the benchmark needs to generate random f32s, and the vectorized version of the benchmark needs to generate pseudo-random SIMD vectors of f32s. I've switched from the rand crate to a hacked-in pseudo-random number generator for the scalar version (src here), and explicitly vectorizing it for generating SIMD vectors (src here). I've used the same algorithm for both versions since the point of the benchmark was to test the portable SIMD facilities in std::simd.

I've added some benchmarks (src here):

scalar (Xeon E5-2690 v4 @ 2.60GHz): throughput: 174*10^6 f32/s, 5.7ns per function call
vector (Xeon E5-2690 v4 @ 2.60GHz): throughput: 2072*10^6 f32/s (12x larger), 3.8ns per function call (generates one f32x8 per call)
scalar (Intel Core i5 @1.8 GHz): throughput: 190*10^6 f32/s, 5.2ns per function call
vector (Intel Core i5 @1.8 GHz)): throughput: 673*10^6 f32/s (3.5x larger), 11.9ns per function call (generates one f32x8 per call)

These numbers do not make much sense to me (feel free to investigate further), but they hint that explicitly vectorized PRNG might make sense in some cases. For example, if my intent was to populate a large vector of f32s with pseudo-random numbers, on my laptop the vector version has twice the latency but still 3.5x higher throughput.

It would be cool if some of the pseudo-random number generators in the rand crate could be explicitly vectorized to generate random SIMD vectors.

TheIronBorn commented 6 years ago

Check out my branch of the Rand library, might have some code you could use.

Sent from my mobile device

On Jun 6, 2018, at 06:32, gnzlbg notifications@github.com wrote:

To showcase the portable packed SIMD vector facilities in std::simd I was porting the Ambient Occlusion ray casting benchmark (aobench) from ISPC to Rust: https://github.com/gnzlbg/aobench where I noticed that pseudo-random number generation was a bottleneck of the Rust version (using the rand crate).

The scalar version of the benchmark needs to generate random f32s, and the vectorized version of the benchmark needs to generate pseudo-random SIMD vectors of f32s. I've switched from the rand crate to a hacked-in pseudo-random number generator for the scalar version (src here), and explicitly vectorizing it for generating SIMD vectors (src here). I've used the same algorithm for both versions since didn't want the choice of the algorithm to affect the benchmark (I only wanted to measure the impact of parallelism and SIMD).

I've added some benchmarks (src here):

scalar (Xeon E5-2690 v4 @ 2.60GHz): throughput: 174*10^6 f32/s, 5.7ns per function call

vector (Xeon E5-2690 v4 @ 2.60GHz): throughput: 2072*10^6 f32/s (12x larger), 3.8ns per function call (generates one f32x8 per call)

scalar (Intel Core i5 @1.8 GHz): throughput: 190*10^6 f32/s, 5.2ns per function call

vector (Intel Core i5 @1.8 GHz)): throughput: 673*10^6 f32/s (3.5x larger), 11.9ns per function call (generates one f32x8 per call)

These numbers do not make much sense to me (feel free to investigate further), but they hint that explicitly vectorized PRNG might make sense in some cases. For example, if my intent was to populate a large vector of f32s with pseudo-random numbers, on my laptop the vector version has twice the latency but still 3.5x higher throughput.

It would be cool if some of the pseudo-random number generators in the rand crate could be explicitly vectorized to generate random SIMD vectors.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

TheIronBorn commented 6 years ago

I should update it to v0.5.0.

You could also try swapping the generator for the SIMDized SFC generator in my branch. I found it much faster than most others.

TheIronBorn commented 6 years ago

Oh! You added sin/cos to stdsimd, thank you @gnzlbg! This should allow better SIMD support for the library. Let me quickly adapt and run benchmarks

TheIronBorn commented 6 years ago

First impression is that stdsimd's sin/cos are slower than my hacked together polynomial approximations. (On a Sandy Bridge machine)

(Using the Box-Muller transform to generate normally distributed random numbers)

TheIronBorn commented 6 years ago

On newer machines, stdsimd's version is probably significantly faster.

Perhaps for older machines we could use the polar variant of Box-Muller, which doesn't need sin/cos. (Both versions still use my hacked natural logarithm).

It needs rejection sampling, but only for rare cases. Even the most naive simd rejection sampling implementation might be fast enough:

loop {
    let r = rng.gen();
    if (r.ne(0) & r.lt(1)).all() {
        return r;
    }
}

(Hell, such a rejection sampling might even be enough for general use)

TheIronBorn commented 6 years ago

// scalar w/ SmallRng
test distr_normal                      ... bench:      22,639 ns/iter (+/- 42,859) = 361 MB/s

// simd box-muller w/ four Sfc32s in parallel
// with sin/cos
test distr_normal_sfc32x4_f32x2        ... bench:     154,028 ns/iter (+/- 661,069) = 53 MB/s
test distr_normal_sfc32x4_f32x4        ... bench:      74,048 ns/iter (+/- 27,974) = 221 MB/s
test distr_normal_sfc32x4_f32x8        ... bench:     684,396 ns/iter (+/- 998,530) = 47 MB/s
// with polar method
test distr_normal_sfc32x4_polar_f32x8        ... bench:      69,253 ns/iter (+/- 38,512) = 946 MB/s // (no rejection sampling, just to obtain an upper limit)
test distr_normal_sfc32x4_polar_f32x8_reject ... bench:     122,029 ns/iter (+/- 139,657) = 537 MB/s

I added the optimization:


if mask_is_to_be_rejected {
    u = mask.select(rng.sample(range), u);

TheIronBorn commented 6 years ago

Using the above rejection sampling method, I got these benchmarks (code available on my branch):

test distr_uniform_i32           ... bench:       5,166 ns/iter (+/- 3,536) = 774 MB/s
test distr_uniform_i32x4         ... bench:      11,494 ns/iter (+/- 9,834) = 1392 MB/s

Decent speedup.

Here's a better one:

test distr_uniform_i16       ... bench:       9,492 ns/iter (+/- 13,556) = 210 MB/s
test distr_uniform_i16x16    ... bench:      19,285 ns/iter (+/- 24,979) = 1659 MB/s

The large speedup is likely because Uniform<i16> more or less discards half the bits of Uniform<u32> while Uniform<i16x16> does not. Discarding half of Uniform<u32x4> would be comparable.

There's probably more optimizations to be made, and some tuning is probably in order for various rejection rates. Does anyone know the worst/best case for UniformInt::sample?

gnzlbg commented 6 years ago

Oh! You added sin/cos to stdsimd, thank you @gnzlbg! This should allow better SIMD support for the library. Let me quickly adapt and run benchmarks

[...]

First impression is that stdsimd's sin/cos are slower than my hacked together polynomial approximations. (On a Sandy Bridge machine)

If you don't need full precision you can experiment using these instead:

#![feature(platform_intrinsics)]
extern "platform-intrinsic" {
    fn simd_fsin<T>(x: T) -> T;
    fn simd_fcos<T>(x: T) -> T;
}

// ...
let x: f32x2 = ...;
let x = simd_fsin(x);

These are not exposed by std::simd yet and they are 200% unstable, so don't release anything using these, but we already expose similar "estimate" intrinsics (srqte, rsqrte, ...), so if someone makes a case for these in the stdsimd repo, I could imagine that we could expose these as well. By a case I mean some benchmarks, and maybe a summary of which architectures/features expose these (a survey of x86, arm, ppc, and mips).

TheIronBorn commented 6 years ago

fsin/fcos might make a small improvement:


 name              stdsimd_trig ns/iter   ftrig ns/iter          diff ns/iter   diff %  speedup 
+::norm_bm_f32x16  19,798,087 (105 MB/s)  17,295,863 (121 MB/s)    -2,502,224  -12.64%   x 1.14 
+::norm_bm_f32x2   2,807,414 (93 MB/s)    2,693,984 (97 MB/s)        -113,430   -4.04%   x 1.04 
-::norm_bm_f32x4   3,464,032 (151 MB/s)   3,677,753 (142 MB/s)        213,721    6.17%   x 0.94 
-::norm_bm_f32x8   7,569,272 (138 MB/s)   7,731,884 (135 MB/s)        162,612    2.15%   x 0.98 
+::norm_bm_f64x2   4,834,865 (108 MB/s)   3,915,587 (133 MB/s)       -919,278  -19.01%   x 1.23 
-::norm_bm_f64x4   9,061,893 (115 MB/s)   9,346,309 (112 MB/s)        284,416    3.14%   x 0.97 
-::norm_bm_f64x8   16,960,038 (123 MB/s)  18,187,133 (115 MB/s)     1,227,095    7.24%   x 0.93

TheIronBorn commented 6 years ago

I found a sufficiently fast normal distribution method for older hardware: the central limit theorem.

It can be used to generate scalars:

let normal: f64 = (rng.gen::<f64x4>().sum() - IH_MEAN) * ih_std_dev;
// test clt_f32x16_f32 ... bench:     271,874 ns/iter (+/- 115,838) = 241 MB/s
// test clt_f32x2_f32  ... bench:     103,488 ns/iter (+/- 76,453) = 633 MB/s
// test clt_f32x4_f32  ... bench:      83,799 ns/iter (+/- 57,234) = 782 MB/s
// test clt_f32x8_f32  ... bench:     149,204 ns/iter (+/- 54,927) = 439 MB/s
// test clt_f64x2_f64  ... bench:      75,939 ns/iter (+/- 34,830) = 1726 MB/s
// test clt_f64x4_f64  ... bench:     123,902 ns/iter (+/- 92,866) = 1057 MB/s
// test clt_f64x8_f64  ... bench:     233,040 ns/iter (+/- 146,729) = 562 MB/s

or vectors:

let mut sum = f64x4::default();
for _ in 0..4 {
    sum += rng.gen::<f64x4>();
}
let normal: f64x4 = (sum - IH_MEAN) * ih_std_dev;
// test ctl_f32x16_fx ... bench:     726,348 ns/iter (+/- 405,316) = 1443 MB/s
// test ctl_f32x2_fx  ... bench:     370,790 ns/iter (+/- 346,423) = 353 MB/s
// test ctl_f32x4_fx  ... bench:     217,103 ns/iter (+/- 69,860) = 1207 MB/s
// test ctl_f32x8_fx  ... bench:     403,474 ns/iter (+/- 361,454) = 1299 MB/s
// test ctl_f64x2_fx  ... bench:     219,494 ns/iter (+/- 105,493) = 1194 MB/s
// test ctl_f64x4_fx  ... bench:     418,359 ns/iter (+/- 358,857) = 1253 MB/s
// test ctl_f64x8_fx  ... bench:     712,107 ns/iter (+/- 500,121) = 1472 MB/s

Ziggurat standard normal with XorShiftRng for comparison (the CLT method might even be faster for some vectors):

test distr_stdnorm_xorshift      ... bench:      29,458 ns/iter (+/- 48,023) = 271 MB/s
test clt_xorshift_f64x2_f64      ... bench:     398,253 ns/iter (+/- 382,346) = 329 MB/s

And it appears to have similar statistical quality to the ziggurat algorithm:

// 2^19 samples
  method        (mean - 0.0).abs()  (standard deviation - 1.0).abs()
zigg            1.3504e-3           1.3796e-3
clt_f32x2_f32   1.5408e-4           9.2259e-4
clt_f32x4_f32   1.4903e-4           7.6055e-4
clt_f32x8_f32   5.3196e-4           1.9712e-4
clt_f32x16_f32  1.4224e-3           1.2537e-3
clt_f64x2_f64   2.2513e-3           9.8080e-6
clt_f64x4_f64   1.3423e-3           5.9939e-4
clt_f64x8_f64   3.5632e-4           4.4575e-4

(wikipedia offers some concerns though: https://en.wikipedia.org/wiki/Normal_distribution#Central_limit_theorem)

dhardy commented 6 years ago

As noted above, using fill_bytes works reasonably well, so I don't think anything stands in the way of basic support for SIMD types (u64x4, f64x4 etc.) on the Standard distribution. Does anyone want to get started on this?

If fill_bytes ends up being sub-optimal for performance we can discuss ways to optimise it later.

Support for trig functions on SIMD types is certainly neat, but a bit beyond the scope of Rand... or perhaps we should open a new issue to discuss SIMD Normal samples etc.

TheIronBorn commented 6 years ago

You’re asking about the speed of gen::<u32x4>() using rand’s current fill_bytes? Give me a minute I’ll have some benchmarks

TheIronBorn commented 6 years ago

All XorShiftRng. (No rust flags / profile.bench features)

test distr_standard_i128         ... bench:       6,842 ns/iter (+/- 6,224) = 2338 MB/s
test distr_standard_i16          ... bench:       2,116 ns/iter (+/- 493) = 945 MB/s
test distr_standard_i32          ... bench:       2,121 ns/iter (+/- 434) = 1885 MB/s
test distr_standard_i64          ... bench:       4,169 ns/iter (+/- 598) = 1918 MB/s
test distr_standard_i8           ... bench:       2,206 ns/iter (+/- 226) = 453 MB/s

test distr_standard_i16x16 ... bench:      26,898 ns/iter (+/- 3,864) = 1189 MB/s
test distr_standard_i16x8  ... bench:      17,125 ns/iter (+/- 6,384) = 934 MB/s
test distr_standard_i32x4  ... bench:      16,546 ns/iter (+/- 4,092) = 967 MB/s
test distr_standard_i32x8  ... bench:      27,547 ns/iter (+/- 7,210) = 1161 MB/s
test distr_standard_i64x2  ... bench:      14,665 ns/iter (+/- 14,600) = 1091 MB/s
test distr_standard_i64x4  ... bench:      27,504 ns/iter (+/- 7,988) = 1163 MB/s
test distr_standard_i8x16  ... bench:      16,492 ns/iter (+/- 3,818) = 970 MB/s
test distr_standard_i8x32  ... bench:      26,914 ns/iter (+/- 5,585) = 1188 MB/s

(Seems that concatenating two next_u64s is faster than fill_bytes_via_next)

pitdicker commented 6 years ago

What @dhardy means is: time to start making PRs!

I just rebased my branch with the basics, @TheIronBorn do you want to review?

And do you want to make a list of the other functionalities you have implemented? I'm not sure everything should be in rand itself, but that is something to look at now. And you had multiple helper functions for PRNGs using SIMD?

dhardy commented 6 years ago

fill_bytes_via_next uses next_u64 so that should be obvious. What @pitdicker did above was implement fill_bytes directly for a SIMD PRNG, which apparently performed well.

Oh, thanks @pitdicker!

TheIronBorn commented 6 years ago

Yeah I’ve got quite a few functionalities uncommitted locally. I’m not sure how viable they all are but I’ll try to list them.

TheIronBorn commented 6 years ago

The helper functions were mostly to ease my own PRNG testing. I’m not sure of the API but they were useful for me

TheIronBorn commented 6 years ago

Functionalities

Helper functions for implementing RngCore functions via SIMD
- SIMD fill_bytes_via_simd optimizations (need to look into this more, could it help scalar block RNGs?)
SIMD rejection sampling trait
UniformInt for integer vectors (with rejection sampling)
- sample_single with leading zeros approx
- SIMD wmul (could probably be better optimized. there are mull instructions for some lane widths, mulh for fewer)
- optimizations for the gen_range(vec(0), vec(N)) case
shuffle (generates M indices at once for a uNxM vector, especially helpful for block rngs)
distributions
- box muller
- central limit theorem
- char & alphanumeric
- boolean, Bernoulli and Standard
lots of SIMD parallelized PRNGs (few are worthwhile)
various convenience stuff: SIMD rotate_left/right, AsByteSliceMut

Possible?

seq module (probably at least sample_indices_inplace)
WeightedChoice (via SIMD binary search, perhaps even linear?)
Cauchy (tangent might be slow on older hardware)
Binomial, Poisson, & Gamma via some complicated rejection sampling

I doubt large lookup tables (ziggurat) will ever be viable (though perhaps with VGATHER instructions?). Small tables (like Alphanumeric) are possible with a couple comparisons or SIMD vector shuffling (perfect for random hex characters).

(inlining throughout my branch is inconsistent and might be doing more harm than good)

Edit: ~~the gen_below optimizations~~, ~~bool & char distrs~~, other PRNGs are local and not available in my branch

pitdicker commented 6 years ago

Quite a list already!

SIMD fill_bytes_via_simd optimizations (need to look into this more, could it help scalar block RNGs?)

Yes, I think this one is a good start for a PR. I spend a lot of effort optimizing the scalar fill_bytes implementations, so that may not be worth the energy (but feel free to try :smile:).

SIMD rejection sampling trait

This one will probably be difficult to merge, but can you describe it some more?

And did you somewhere implement the endianness conversion functions? That is somewhat of a requirement for SIMD RNGs.

TheIronBorn commented 6 years ago

The rejection sampling trait is available in my branch. It allows a user to pass a comparison function similar to sort_by, and a distribution to sample from. Returns only once each lane of the SIMD mask returned by the comparator is true.

I didn’t implement endianness conversion. Thinking on it now it shouldn’t be too difficult to write.

TheIronBorn commented 6 years ago

I added swap_bytes using stdsimd's intrinsics: https://github.com/TheIronBorn/rand/commit/3c6d9d49a2d9aa9ae7b908d8253983d218ab59b3 We should switch to the safe portable shuffles when they land.

TheIronBorn commented 6 years ago

fill_bytes_via_simd for scalar RNGs doesn't seem worthwhile


test gen_bytes_hc128       ... bench:     976,078 ns/iter (+/- 717,525) = 1049 MB/s
test gen_bytes_hc128_simd  ... bench:   1,147,668 ns/iter (+/- 1,017,842) = 892 MB/s

rust-random / rand

Investigate SIMD support #377

Functionalities

Possible?