oconnor663 / blake2_simd

high-performance implementations of BLAKE2b/s/bp/sp in pure Rust with dynamic SIMD
MIT License
126 stars 22 forks source link

very weird throughput variance in BLAKE2bp #8

Closed oconnor663 closed 5 years ago

oconnor663 commented 5 years ago

I've noticed that in my BLAKE2bp benchmarks, throughput drops measurably when we properly initialize the input buffer. (For example with RandCore::fill_slice, but actually even with just explicitly setting it to zeros, rather than allowing Vec to request all-zero pages that don't need to be initialized.) Notably this doesn't apply to hash_many, only to blake2bp. And the weirdest part, the size of the effect depends on what byte offset the input starts at:

https://gist.github.com/oconnor663/e9c5a45df3803f92baa85e09c35718a6 offset

That's hashing a 1 GB buffer with blake2bp and measuring throughput (inverse seconds). The x-axis if offset in a larger buffer. That is, we allocate a Vec of more than 1 GB and then pull a 1 GB slice from that allocation at the given offset. The high water mark for throughput is on par with hash_many, but performance dips in several regions, by different amounts. The effect has a period of 512 bytes, at which point the regions repeat. The region offsets are [49, 80, 112, 176, 208, 432, 464], and they're completely stable from run to run. (Note that the subtracting 48 gives [1, 32, 64, 128, 160, 384, 416], and the differences are [31, 32, 64, 32, 224, 32].) The effect is pronounced and stable with an input 1 GB long, but shrinks as your input gets smaller, disappearing around 1 MB. The offsets I noted are the most obvious features in the graph, but there are also smaller stable features. For example, point 464 is the start of a region higher than 463, but it's also consistently spiked above the rest of its region (465 etc).

What the hell am I seeing?

oconnor663 commented 5 years ago

Further random note, my vector allocations always have an offset of 16 in a 4096 byte page.

oconnor663 commented 5 years ago

Offset-dependent variation in the BLAKE2bp throughput reproduces in several different implementations. Here's my implementation which combines blake2b and hash_many using a stride argument (byte offset into an allocated Vec on the X axis, throughput in GB/s on the Y axis):

stride

Here's a different implementation of mine (the nostride branch),

nostride

And here's Samuel Neves' C implementation from https://github.com/sneves/blake2-avx2:

neves

oconnor663 commented 5 years ago

Here's the with-stride vs without-stride comparison again, this time reducing the compression function to one round to exaggerate the load performance differences between the different offsets.

master_oneround_512 nostride_oneround_512

oconnor663 commented 5 years ago

(EDIT: I've changed the methodology in later comments, so if you use the code today the bit about adding 16 is no longer applicable, and you'll probably need to adjust the offsets to observe the effect. If anyone is trying this and sees that the code doesn't work anymore, let me know, and I'll fix it.)

Here's an example of how to reproduce this. This runs for about 10 seconds with two different input offsets (the WORKER_OFFSET value, which again probably gets added to 16, the page offset of big vectors on my machine), and we see that the BLAKE2bp throughput is 0.992 GB/s on the first offset and is 1.053 GB/s on the second offset.

$ cd benches/bench_multiprocess
$ MS_PER_BENCH=10000 WORKER_OFFSET=64 cargo run --release "blake2b_simd BLAKE2bp"
    Finished release [optimized] target(s) in 0.10s
     Running `target/release/bench_multiprocess 'blake2b_simd BLAKE2bp'`
0.992
$ MS_PER_BENCH=10000 WORKER_OFFSET=336 cargo run --release "blake2b_simd BLAKE2bp"
    Finished release [optimized] target(s) in 0.10s
     Running `target/release/bench_multiprocess 'blake2b_simd BLAKE2bp'`
1.053

And here's the same experiment repeated with the blake2-avx2 C implementation. I find different offsets show the effect.

$ cd benches/bench_multiprocess
$ MS_PER_BENCH=10000 WORKER_OFFSET=182 cargo run --release "blake2b_avx2_neves BLAKE2bp"
    Finished release [optimized] target(s) in 0.10s
     Running `target/release/bench_multiprocess 'blake2b_avx2_neves BLAKE2bp'`
0.968
$ MS_PER_BENCH=10000 WORKER_OFFSET=336 cargo run --release "blake2b_avx2_neves BLAKE2bp"
    Finished release [optimized] target(s) in 0.10s
     Running `target/release/bench_multiprocess 'blake2b_avx2_neves BLAKE2bp'`
1.035
oconnor663 commented 5 years ago

So most of the above was studying offsets from "wherever the start of my vector happened to be allocated." It so happens that in my test environment (with a small and predictable number of vectors), that's usually 16 bytes into a 4096 byte page, but unsurprisingly that's not stable if I write a little program that keeps vectors around longer. (Also somewhere in here I switched from averaging runs to selecting the best run, and the result got a lot less noisy.)

If I account for where my vectors are getting allocated, so that the offsets on my X axis are with respect to actual page boundaries, I notice a decent spike at offset zero. Here's the result for the one round hash function:

page_boundary

And a wider view of the same showing the 512-byte period:

page_boundary_wide

oconnor663 commented 5 years ago

Repeating that same methodology for blake2-avx2 (reduced from 12 rounds to 1 to exaggerate the effect of message loading, selecting the best-performing run at each offset to reduce noise, offsets measured from the page boundary):

neves_zoom

Same data, wider view:

neves

oconnor663 commented 5 years ago

Since this reproduces cleanly across implementations, I'm going to assume it's not my fault :) The simplest "fair" benchmarking strategy seems to be to make sure all inputs start at zero offset from the page, which is a decent performance point but not the very best (the best seems to be 384 for some reason).

oconnor663 commented 5 years ago

Update: Ugh. Whether zero-offset is a "good" point seems to depend on unpredictable implementation details of the compression loop. We have to randomize the offset or sum over all offsets instead.