Closed pitdicker closed 6 years ago
I’ve actually been playing with this a lot. Let me see about putting something worth scrutiny on github.
Sent from my mobile device
On Apr 5, 2018, at 12:58, Paul Dicker notifications@github.com wrote:
Disclaimer: I don't really know what I am talking about here.
The first step, as far as Rand is concerned, is to generate a small array of values from an RNG. Some RNG's may lend itself well for this, like ChaCha, and possibly HC-128, although that one does not use SIMD. Other options are packing together multiple small RNGs, like Xorshift+ or Xoroshiro+. The result type of such an RNG should be some array, so maybe we can make use of the BlockRngCore trait?
Making use of the generated values can be a more interesting problem. This involves the distributions code.
Converting to floating point: the Standard uniform distribution should not cause any problems, and neither should the range code.
I suppose the other distributions are going to give trouble, at least anything that needs branches or a look-up table (i.e. the current ziggurat code) is not going to work well. And what should happen when an algorithm needs to do rejections sampling, and one of a couple of SIMD values get rejected? Return NaN? For most distributions there should exist an algorithm suitable for SIMD. The Box-Muller transform for example looks promising for the Normal distribution, although it would need a library to provide functions like sin and cos.
Integer ranges: I don't expect the widening multiply method to work for integer range reduction, and division or modulus also do not seem to be available. And I suppose we have no way to indicate a value is rejected because it falls outside the acceptable zone and can cause bias.
Now I don't know how far we should go, and if Rand should provide any implementations at all. But I think we should make sure the API does not prevent Rand from being used in this area.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
I think this might give us some direction https://github.com/lemire/SIMDxorshift, specifically the avx_randombound_epu32
function: https://github.com/lemire/SIMDxorshift/blob/1766a98f4be5e1d2e78d9e2c3623800fbf92931f/src/simdxorshift128plus.c#L90
Also of note: for multi-rngs we could use several sets of constants at once. More or less alternating between Xorshift64(10, 13, 10), Xorshift64(8, 9, 22), Xorshift64(2, 7, 3), and Xorshift64(23, 3, 24) for example. Might help raise statistical quality.
Great if you would want to help out a bit here!
I believe @vks has also been exploring this, for example in https://github.com/vks/xoroshiro/blob/master/src/rng/xoroshiro128simd.rs
I think this might give us some direction lemire/SIMDxorshift
:+1: I can't read those function calls yet, but he is doing two widening multplies, even and odd, and keeping the high
parts? And without checking against low
they will be biased, but this is not always a problem, especially if the range is small.
Also of note: for multi-rngs we could use several sets of constants at once.
We could, and that seems like a good idea. But just to make sure, that doesn't mean all four (/eight?) can use the same seed.
@TheIronBorn Would you be interested in exploring what it would take to have a Distribution
take the output of such an RNG, and transforming it wile staying in the SIMD area?
Converting it to a float, or a floating point range are the simplest methods.
We could, and that seems like a good idea. But just to make sure, that doesn't mean all four (/eight?) can use the same seed.
Are you asking if such an implementation would require each lane to have the same value upon seeding?
Float gen is pretty simple. Here's a quick gist https://gist.github.com/TheIronBorn/1ba4da07d6e8e05ab20cb780c383dbec
Added range sampling
Speeding up block RNGs with SIMD sounds reasonable; anything else sounds application-specific IMO.
Distribution::sample(..) -> T
so in theory the trait can support SIMD distributions, given native consumers of the SIMD types. I guess it should be possible to make adapter implementations, i.e. automatically implement Distribution<(f64, f64)>
for any type supporting Distribution<f64x2>
.
For anything else, e.g. attempting to produce Normal
samples in blocks for individual consumption, I guess something like the MutDistribution
or CachedDistribution
(i.e. equivalent to the old Sample
trait) is needed to allow caching — but this significantly increases complexity and I'm not sure it would even be faster in many cases.
One specific case might be worth a specific solution: StandardBlockRng<T>
. It may be possible to make a wrapper around any BlockRngCore
which immediately applies the Standard
distribution to convert its results
buffer to an array of type T
, then draws values from this buffer. In theory this might allow some optimisation between the RNG (generation) and the distribution (conversion) steps, although probably whatever loop the RNG step requires to fill the buffer would prevent this. This has the added advantages of a common design with & without SIMD and that it can simply implement Iterator<Item=T>
.
Thank you for code @TheIronBorn, something to look into. Can I also ask you to create a branch in your fork of rand with the changes? Or did you have another way set up to test things?
Speeding up block RNGs with SIMD sounds reasonable; anything else sounds application-specific IMO.
When I was exploring papers around RNGs and related stuff GPU processing was the current trend, and exploring the possibilities of SIMD also seems to come up regularly. As it is a topic we did not give any attention until know, at least good to explore a bit...
But I have no idea yet how useful it will really be. Especially for some distributions it may be hard for SIMD to provide any benefit over the traditional methods.
One thing that seems like an obvious problem is that Distribution
takes an Rng
, but Rng
exposes no methods to give something like next_u32x4
. I think a lot can be solved by adding another method and associated type to RngCore
. This type would indicate the native value produced by this RNG. It is something I thought that would be useful for months, but never spoke about...
So our current Xorshift implementation natively generates u32
s, Xoroshiro128+ u64
s, ChaCha [u32; 16]
, etc. With an addition like this is is possible to implement RNGs for specific types not covered by the trait. For example one specialized for u16
(maybe for embedded?), for SIMD with a type like next_u32x4
, or one designed for floats with f64
. I created an incomplete branch with changes. @dhardy Please don't shoot it down instantly...
For anything else, e.g. attempting to produce
Normal
samples in blocks for individual consumption, I guess something like theMutDistribution
orCachedDistribution
(i.e. equivalent to the oldSample
trait) is needed to allow caching
Why would we need caching? To return one value at a time instead of a SIMD type? Or did you write this because the smaller number of sampling algorithms may need some mutable state?
Caching would just be to return one value at a time, yes; the trouble with that approach is that there is significant overhead storing/retrieving the cached values.
@pitdicker your branch looks interesting, though you could probably just do RngCore: Iterator
instead. Also what would the OsRng
implementation look like?
The BlockRngCore::Results
return values are probably going to be too big for many uses and there is no provided way of breaking the block up into smaller segments; this can be worked around however by choosing an RNG with Item
type of the correct size.
There is a significant reason I don't want to add an associated type to RngCore
: it is no longer object-safe without specifying the type (the same problem we hit in #307). It might however make sense to use a completely separate trait, possibly just Iterator
, and then just implement SIMD distributions as simple functions usable via Iterator::map
— this is not well integrated however.
But I hold to my previous point: other than a few cases where SIMD can fit into the existing API (e.g. ChaCha core) I see little point trying to develop the infrastructure without at least one (preferably several) application able to consume SIMD values.
Here's my branch https://github.com/TheIronBorn/rand/tree/simd/src/simd
Blending sets of constants for Xorshift seems to only optimize well for machines later than somewhere around Haswell: https://godbolt.org/g/G2x1E2. (Adapted from O'Neill's Xorshift library) I don't have such a machine so I can't confirm actual speed.
Removing the multiply helps.
There is a significant reason I don't want to add an associated type to RngCore: it is no longer object-safe without specifying the type
I realized just after posting. So it was a bad idea from me, feel free to shoot it down :smile:.
Here's my branch TheIronBorn/rand:src/simd@simd
Nice, I have been playing with it. And SFC seems a pretty good choice here.
I think I have found a way to support SIMD without extra traits like SimdRng
and SimdDistribution
. The idea is that RngCore::fill_bytes
does a simple copy of its result (like u32x4
) to the byte slice if they have exactly the same length. And the integer distributions for Standard
do the opposite: they initialize an empty u32x4
, and pass it to fill_bytes
as a byte slice. The optimizer is able to work things out.
I don't think the BlockRng
wrapper is going to work well in combination with a simple RNG adapted for SIMD. I expect the bookkeeping to take as much time as just generating a new batch of values. So we may just as well always generate a new batch, and use as much as needed.
I have a proof of concept in https://github.com/pitdicker/rand/commits/simd_support. The code in sfc.rs
is very bad and ugly. I am not sure using BlockRngCore
in there is useful. But the trick with using fill_bytes
seems to work.
But I hold to my previous point: other than a few cases where SIMD can fit into the existing API (e.g. ChaCha core) I see little point trying to develop the infrastructure without at least one (preferably several) application able to consume SIMD values.
I agree we shouldn't turn everything upside down only to support SIMD, especially when we are not sure how useful it will be.
Does implementing Standard
and Range
for SIMD types fall for you in the category 'develop infrastructure' or 'extension of existing functionality'? Without tests it takes about 125 extra loc: https://github.com/pitdicker/rand/compare/c03deb6...pitdicker:6918e11. The first commit is a necessary change in rand_core
to get the inlining right for fill_bytes
, and I think that is something that should be in rand_core
0.1. The rest is definitely not interesting until after rand 0.5.
Nice, I didn't expect fill_bytes
to work so well!
Implementing existing traits/distributions is fine — well, we have to look at how much code gets added, but probably fine.
@TheIronBorn I have tried to clean up my branch a bit, removed some unnecessary unsafe code etc. Do you want to have a look and maybe experiment from there?
I have tried to make the fill_bytes
method on Sfc32x*
in such a way that just about all of it is optimized out when the target is a SIMD type, but to also support filling arbitrary slices, as it supposed to be able to do. One thing missing is byte-swapping on big-endian architectures. This can probably be done pretty efficiently with SIMD, do you know a method?
One nice benchmark on my PC: Sfc32x4Rng
can generate f32
floating point values at 10GB/s, 4x as fast as the current Xorshift. And we would have to get very creative with supporting Distribution
on tuples or slices for even Xoroshiro128+ to get 5GB/s (might be worth the experiment though).
You could probably use simd_shuffle
for btye-swapping. It's not available via #![feature(stdsimd)]
though. I suppose the better thing to do is implement it at stdsimd
.
Recently I switch to SIMD enabled Fast Mersenne Twister and could see improvement compared to mersenne_twister
If you care about performance, you should probably not use the Mersenne Twister, it is a lot slower than more modern RNGs like PCG and xoroshiro, while having worse statistical quality.
I just pushed an update to my repo which implements Box-Muller https://github.com/TheIronBorn/rand/commit/f0b2f387d159cefb707955f308bc045275302584.
distr_normal ... bench: 33,329 ns/iter (+/- 53,512) = 240 MB/s
distr_normal_Sfc32x4 ... bench: 92,156 ns/iter (+/- 58,657) = 347 MB/s
The performance probably suffers significantly as my machine doesn't have SIMD FMA instructions.
The API could be better and the math could probably be optimized, but it seems to work fine.
I also just pushed some helper functions which should make implementing a SIMD RNG easier https://github.com/TheIronBorn/rand/commit/8da1cd68a4e668a0713a25318b6d1a0dd2a427eb
A nice example of what SIMD RNGs can do for you: Monte Carlo estimation of pi. https://gist.github.com/TheIronBorn/ba1f1e39b6466353b1eee6cf450fe247
test scalar ... bench: 26,424,886 ns/iter (+/- 10,327,348) = 605 MB/s
test simd ... bench: 28,363,059 ns/iter (+/- 48,045,043) = 2256 MB/s
I just pushed an update to my repo which implements Box-Muller
Wow, respect for the math implementations there! Did you use methods similar to those in Cephes library, or another source? We really need something like that to be a crate in Rust.
The performance is not super impressive (it has to approximate sin and cos after all)... I wonder if this is a situation where SIMD really makes sense. I'll have a closer look at your code later.
Did you use methods similar to those in Cephes library, or another source? We really need something like that to be a crate in Rust.
I'm not sure what you are looking for, but there are bindings to cephes, special functions implemented in Rust and special functions implemented for statrs.
Implementing Box-Muller and other complex stuff with SIMD requires complex math like sin/cos/log. For most machines those instructions aren't available, but it is still possible to have fast functions modeled after something like Cephes library. I don't know of any such Rust implementations.
I don't think most of those intrinsics are even in stdsimd yet: https://github.com/rust-lang-nursery/stdsimd/issues/310
Sorry I didn't read that right. Those libraries look useful thanks
Most of the my math impls come from https://github.com/miloyip/normaldist-benchmark
Also of note: the performance should be better on Haswell and above (I have only Sandy Bridge) as the math functions make heavy use of SIMD FMA operations.
To showcase the portable packed SIMD vector facilities in std::simd
I was porting the Ambient Occlusion ray casting benchmark (aobench) from ISPC to Rust: https://github.com/gnzlbg/aobench where I noticed that pseudo-random number generation was a bottleneck of the Rust version (using the rand
crate).
The scalar version of the benchmark needs to generate random f32
s, and the vectorized version of the benchmark needs to generate pseudo-random SIMD vectors of f32
s. I've switched from the rand
crate to a hacked-in pseudo-random number generator for the scalar version (src here), and explicitly vectorizing it for generating SIMD vectors (src here). I've used the same algorithm for both versions since the point of the benchmark was to test the portable SIMD facilities in std::simd
.
I've added some benchmarks (src here):
scalar (Xeon E5-2690 v4 @ 2.60GHz): throughput: 174*10^6 f32/s
, 5.7ns
per function call
vector (Xeon E5-2690 v4 @ 2.60GHz): throughput: 2072*10^6 f32/s
(12x larger), 3.8ns
per function call (generates one f32x8
per call)
scalar (Intel Core i5 @1.8 GHz): throughput: 190*10^6 f32/s
, 5.2ns
per function call
vector (Intel Core i5 @1.8 GHz)): throughput: 673*10^6 f32/s
(3.5x larger), 11.9ns
per function call (generates one f32x8
per call)
These numbers do not make much sense to me (feel free to investigate further), but they hint that explicitly vectorized PRNG might make sense in some cases. For example, if my intent was to populate a large vector of f32
s with pseudo-random numbers, on my laptop the vector version has twice the latency but still 3.5x higher throughput.
It would be cool if some of the pseudo-random number generators in the rand
crate could be explicitly vectorized to generate random SIMD vectors.
Check out my branch of the Rand library, might have some code you could use.
Sent from my mobile device
On Jun 6, 2018, at 06:32, gnzlbg notifications@github.com wrote:
To showcase the portable packed SIMD vector facilities in std::simd I was porting the Ambient Occlusion ray casting benchmark (aobench) from ISPC to Rust: https://github.com/gnzlbg/aobench where I noticed that pseudo-random number generation was a bottleneck of the Rust version (using the rand crate).
The scalar version of the benchmark needs to generate random f32s, and the vectorized version of the benchmark needs to generate pseudo-random SIMD vectors of f32s. I've switched from the rand crate to a hacked-in pseudo-random number generator for the scalar version (src here), and explicitly vectorizing it for generating SIMD vectors (src here). I've used the same algorithm for both versions since didn't want the choice of the algorithm to affect the benchmark (I only wanted to measure the impact of parallelism and SIMD).
I've added some benchmarks (src here):
scalar (Xeon E5-2690 v4 @ 2.60GHz): throughput: 174*10^6 f32/s, 5.7ns per function call
vector (Xeon E5-2690 v4 @ 2.60GHz): throughput: 2072*10^6 f32/s (12x larger), 3.8ns per function call (generates one f32x8 per call)
scalar (Intel Core i5 @1.8 GHz): throughput: 190*10^6 f32/s, 5.2ns per function call
vector (Intel Core i5 @1.8 GHz)): throughput: 673*10^6 f32/s (3.5x larger), 11.9ns per function call (generates one f32x8 per call)
These numbers do not make much sense to me (feel free to investigate further), but they hint that explicitly vectorized PRNG might make sense in some cases. For example, if my intent was to populate a large vector of f32s with pseudo-random numbers, on my laptop the vector version has twice the latency but still 3.5x higher throughput.
It would be cool if some of the pseudo-random number generators in the rand crate could be explicitly vectorized to generate random SIMD vectors.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
I should update it to v0.5.0.
You could also try swapping the generator for the SIMDized SFC generator in my branch. I found it much faster than most others.
Oh! You added sin/cos to stdsimd, thank you @gnzlbg! This should allow better SIMD support for the library. Let me quickly adapt and run benchmarks
First impression is that stdsimd
's sin/cos are slower than my hacked together polynomial approximations. (On a Sandy Bridge machine)
(Using the Box-Muller transform to generate normally distributed random numbers)
On newer machines, stdsimd's version is probably significantly faster.
Perhaps for older machines we could use the polar variant of Box-Muller, which doesn't need sin/cos. (Both versions still use my hacked natural logarithm).
It needs rejection sampling, but only for rare cases. Even the most naive simd rejection sampling implementation might be fast enough:
loop {
let r = rng.gen();
if (r.ne(0) & r.lt(1)).all() {
return r;
}
}
(Hell, such a rejection sampling might even be enough for general use)
// scalar w/ SmallRng
test distr_normal ... bench: 22,639 ns/iter (+/- 42,859) = 361 MB/s
// simd box-muller w/ four Sfc32s in parallel
// with sin/cos
test distr_normal_sfc32x4_f32x2 ... bench: 154,028 ns/iter (+/- 661,069) = 53 MB/s
test distr_normal_sfc32x4_f32x4 ... bench: 74,048 ns/iter (+/- 27,974) = 221 MB/s
test distr_normal_sfc32x4_f32x8 ... bench: 684,396 ns/iter (+/- 998,530) = 47 MB/s
// with polar method
test distr_normal_sfc32x4_polar_f32x8 ... bench: 69,253 ns/iter (+/- 38,512) = 946 MB/s // (no rejection sampling, just to obtain an upper limit)
test distr_normal_sfc32x4_polar_f32x8_reject ... bench: 122,029 ns/iter (+/- 139,657) = 537 MB/s
I added the optimization:
if mask_is_to_be_rejected {
u = mask.select(rng.sample(range), u);
Using the above rejection sampling method, I got these benchmarks (code available on my branch):
test distr_uniform_i32 ... bench: 5,166 ns/iter (+/- 3,536) = 774 MB/s
test distr_uniform_i32x4 ... bench: 11,494 ns/iter (+/- 9,834) = 1392 MB/s
Decent speedup.
Here's a better one:
test distr_uniform_i16 ... bench: 9,492 ns/iter (+/- 13,556) = 210 MB/s
test distr_uniform_i16x16 ... bench: 19,285 ns/iter (+/- 24,979) = 1659 MB/s
The large speedup is likely because Uniform<i16>
more or less discards half the bits of Uniform<u32>
while Uniform<i16x16>
does not. Discarding half of Uniform<u32x4>
would be comparable.
There's probably more optimizations to be made, and some tuning is probably in order for various rejection rates. Does anyone know the worst/best case for UniformInt::sample
?
Oh! You added sin/cos to stdsimd, thank you @gnzlbg! This should allow better SIMD support for the library. Let me quickly adapt and run benchmarks
[...]
First impression is that stdsimd's sin/cos are slower than my hacked together polynomial approximations. (On a Sandy Bridge machine)
If you don't need full precision you can experiment using these instead:
#![feature(platform_intrinsics)]
extern "platform-intrinsic" {
fn simd_fsin<T>(x: T) -> T;
fn simd_fcos<T>(x: T) -> T;
}
// ...
let x: f32x2 = ...;
let x = simd_fsin(x);
These are not exposed by std::simd
yet and they are 200% unstable, so don't release anything using these, but we already expose similar "estimate" intrinsics (srqte
, rsqrte
, ...), so if someone makes a case for these in the stdsimd
repo, I could imagine that we could expose these as well. By a case I mean some benchmarks, and maybe a summary of which architectures/features expose these (a survey of x86, arm, ppc, and mips).
fsin/fcos might make a small improvement:
name stdsimd_trig ns/iter ftrig ns/iter diff ns/iter diff % speedup
+::norm_bm_f32x16 19,798,087 (105 MB/s) 17,295,863 (121 MB/s) -2,502,224 -12.64% x 1.14
+::norm_bm_f32x2 2,807,414 (93 MB/s) 2,693,984 (97 MB/s) -113,430 -4.04% x 1.04
-::norm_bm_f32x4 3,464,032 (151 MB/s) 3,677,753 (142 MB/s) 213,721 6.17% x 0.94
-::norm_bm_f32x8 7,569,272 (138 MB/s) 7,731,884 (135 MB/s) 162,612 2.15% x 0.98
+::norm_bm_f64x2 4,834,865 (108 MB/s) 3,915,587 (133 MB/s) -919,278 -19.01% x 1.23
-::norm_bm_f64x4 9,061,893 (115 MB/s) 9,346,309 (112 MB/s) 284,416 3.14% x 0.97
-::norm_bm_f64x8 16,960,038 (123 MB/s) 18,187,133 (115 MB/s) 1,227,095 7.24% x 0.93
I found a sufficiently fast normal distribution method for older hardware: the central limit theorem.
It can be used to generate scalars:
let normal: f64 = (rng.gen::<f64x4>().sum() - IH_MEAN) * ih_std_dev;
// test clt_f32x16_f32 ... bench: 271,874 ns/iter (+/- 115,838) = 241 MB/s
// test clt_f32x2_f32 ... bench: 103,488 ns/iter (+/- 76,453) = 633 MB/s
// test clt_f32x4_f32 ... bench: 83,799 ns/iter (+/- 57,234) = 782 MB/s
// test clt_f32x8_f32 ... bench: 149,204 ns/iter (+/- 54,927) = 439 MB/s
// test clt_f64x2_f64 ... bench: 75,939 ns/iter (+/- 34,830) = 1726 MB/s
// test clt_f64x4_f64 ... bench: 123,902 ns/iter (+/- 92,866) = 1057 MB/s
// test clt_f64x8_f64 ... bench: 233,040 ns/iter (+/- 146,729) = 562 MB/s
or vectors:
let mut sum = f64x4::default();
for _ in 0..4 {
sum += rng.gen::<f64x4>();
}
let normal: f64x4 = (sum - IH_MEAN) * ih_std_dev;
// test ctl_f32x16_fx ... bench: 726,348 ns/iter (+/- 405,316) = 1443 MB/s
// test ctl_f32x2_fx ... bench: 370,790 ns/iter (+/- 346,423) = 353 MB/s
// test ctl_f32x4_fx ... bench: 217,103 ns/iter (+/- 69,860) = 1207 MB/s
// test ctl_f32x8_fx ... bench: 403,474 ns/iter (+/- 361,454) = 1299 MB/s
// test ctl_f64x2_fx ... bench: 219,494 ns/iter (+/- 105,493) = 1194 MB/s
// test ctl_f64x4_fx ... bench: 418,359 ns/iter (+/- 358,857) = 1253 MB/s
// test ctl_f64x8_fx ... bench: 712,107 ns/iter (+/- 500,121) = 1472 MB/s
Ziggurat standard normal with XorShiftRng
for comparison (the CLT method might even be faster for some vectors):
test distr_stdnorm_xorshift ... bench: 29,458 ns/iter (+/- 48,023) = 271 MB/s
test clt_xorshift_f64x2_f64 ... bench: 398,253 ns/iter (+/- 382,346) = 329 MB/s
And it appears to have similar statistical quality to the ziggurat algorithm:
// 2^19 samples
method (mean - 0.0).abs() (standard deviation - 1.0).abs()
zigg 1.3504e-3 1.3796e-3
clt_f32x2_f32 1.5408e-4 9.2259e-4
clt_f32x4_f32 1.4903e-4 7.6055e-4
clt_f32x8_f32 5.3196e-4 1.9712e-4
clt_f32x16_f32 1.4224e-3 1.2537e-3
clt_f64x2_f64 2.2513e-3 9.8080e-6
clt_f64x4_f64 1.3423e-3 5.9939e-4
clt_f64x8_f64 3.5632e-4 4.4575e-4
(wikipedia offers some concerns though: https://en.wikipedia.org/wiki/Normal_distribution#Central_limit_theorem)
As noted above, using fill_bytes
works reasonably well, so I don't think anything stands in the way of basic support for SIMD types (u64x4
, f64x4
etc.) on the Standard
distribution. Does anyone want to get started on this?
If fill_bytes
ends up being sub-optimal for performance we can discuss ways to optimise it later.
Support for trig functions on SIMD types is certainly neat, but a bit beyond the scope of Rand... or perhaps we should open a new issue to discuss SIMD Normal samples etc.
You’re asking about the speed of gen::<u32x4>()
using rand’s current fill_bytes
? Give me a minute I’ll have some benchmarks
All XorShiftRng
. (No rust flags / profile.bench features)
test distr_standard_i128 ... bench: 6,842 ns/iter (+/- 6,224) = 2338 MB/s
test distr_standard_i16 ... bench: 2,116 ns/iter (+/- 493) = 945 MB/s
test distr_standard_i32 ... bench: 2,121 ns/iter (+/- 434) = 1885 MB/s
test distr_standard_i64 ... bench: 4,169 ns/iter (+/- 598) = 1918 MB/s
test distr_standard_i8 ... bench: 2,206 ns/iter (+/- 226) = 453 MB/s
test distr_standard_i16x16 ... bench: 26,898 ns/iter (+/- 3,864) = 1189 MB/s
test distr_standard_i16x8 ... bench: 17,125 ns/iter (+/- 6,384) = 934 MB/s
test distr_standard_i32x4 ... bench: 16,546 ns/iter (+/- 4,092) = 967 MB/s
test distr_standard_i32x8 ... bench: 27,547 ns/iter (+/- 7,210) = 1161 MB/s
test distr_standard_i64x2 ... bench: 14,665 ns/iter (+/- 14,600) = 1091 MB/s
test distr_standard_i64x4 ... bench: 27,504 ns/iter (+/- 7,988) = 1163 MB/s
test distr_standard_i8x16 ... bench: 16,492 ns/iter (+/- 3,818) = 970 MB/s
test distr_standard_i8x32 ... bench: 26,914 ns/iter (+/- 5,585) = 1188 MB/s
(Seems that concatenating two next_u64
s is faster than fill_bytes_via_next
)
What @dhardy means is: time to start making PRs!
I just rebased my branch with the basics, @TheIronBorn do you want to review?
And do you want to make a list of the other functionalities you have implemented? I'm not sure everything should be in rand itself, but that is something to look at now. And you had multiple helper functions for PRNGs using SIMD?
fill_bytes_via_next
uses next_u64
so that should be obvious. What @pitdicker did above was implement fill_bytes
directly for a SIMD PRNG, which apparently performed well.
Oh, thanks @pitdicker!
Yeah I’ve got quite a few functionalities uncommitted locally. I’m not sure how viable they all are but I’ll try to list them.
The helper functions were mostly to ease my own PRNG testing. I’m not sure of the API but they were useful for me
RngCore
functions via SIMD
fill_bytes_via_simd
optimizations (need to look into this more, could it help scalar block RNGs?)UniformInt
for integer vectors (with rejection sampling)
sample_single
with leading zeros approxwmul
(could probably be better optimized. there are mull
instructions for some lane widths, mulh
for fewer)gen_range(vec(0), vec(N))
caseshuffle
(generates M
indices at once for a uNxM
vector, especially helpful for block rngs)Bernoulli
and Standard
rotate_left/right
, AsByteSliceMut
seq
module (probably at least sample_indices_inplace
)WeightedChoice
(via SIMD binary search, perhaps even linear?)Cauchy
(tangent might be slow on older hardware)Binomial
, Poisson
, & Gamma
via some complicated rejection samplingI doubt large lookup tables (ziggurat) will ever be viable (though perhaps with VGATHER
instructions?). Small tables (like Alphanumeric
) are possible with a couple comparisons or SIMD vector shuffling (perfect for random hex characters).
(inlining throughout my branch is inconsistent and might be doing more harm than good)
Edit: the , gen_below
optimizationsbool & char distrs, other PRNGs are local and not available in my branch
Quite a list already!
- SIMD
fill_bytes_via_simd
optimizations (need to look into this more, could it help scalar block RNGs?)
Yes, I think this one is a good start for a PR. I spend a lot of effort optimizing the scalar fill_bytes
implementations, so that may not be worth the energy (but feel free to try :smile:).
- SIMD rejection sampling trait
This one will probably be difficult to merge, but can you describe it some more?
And did you somewhere implement the endianness conversion functions? That is somewhat of a requirement for SIMD RNGs.
The rejection sampling trait is available in my branch. It allows a user to pass a comparison function similar to sort_by
, and a distribution to sample from. Returns only once each lane of the SIMD mask returned by the comparator is true.
I didn’t implement endianness conversion. Thinking on it now it shouldn’t be too difficult to write.
I added swap_bytes
using stdsimd's intrinsics: https://github.com/TheIronBorn/rand/commit/3c6d9d49a2d9aa9ae7b908d8253983d218ab59b3
We should switch to the safe portable shuffles when they land.
fill_bytes_via_simd
for scalar RNGs doesn't seem worthwhile
test gen_bytes_hc128 ... bench: 976,078 ns/iter (+/- 717,525) = 1049 MB/s
test gen_bytes_hc128_simd ... bench: 1,147,668 ns/iter (+/- 1,017,842) = 892 MB/s
Disclaimer: I don't really know what I am talking about here.
The first step, as far as Rand is concerned, is to generate a small array of values from an RNG. Some RNG's may lend itself well for this, like ChaCha, and possibly HC-128, although that one does not use SIMD. Other options are packing together multiple small RNGs, like Xorshift+ or Xoroshiro+. The result type of such an RNG should be some array, so maybe we can make use of the
BlockRngCore
trait?Making use of the generated values can be a more interesting problem. This involves the distributions code.
Converting to floating point: the
Standard
uniform distribution should not cause any problems, and neither should the range code.I suppose the other distributions are going to give trouble, at least anything that needs branches or a look-up table (i.e. the current ziggurat code) is not going to work well. And what should happen when an algorithm needs to do rejections sampling, and one of a couple of SIMD values get rejected? Return
NaN
? For most distributions there should exist an algorithm suitable for SIMD. The Box-Muller transform for example looks promising for theNormal
distribution, although it would need a library to provide functions likesin
andcos
.Integer ranges: I don't expect the widening multiply method to work for integer range reduction, and division or modulus also do not seem to be available. And I suppose we have no way to indicate a value is rejected because it falls outside the acceptable zone and can cause bias.
Now I don't know how far we should go, and if Rand should provide any implementations at all. But I think we should make sure the API does not prevent Rand from being used in this area.