Closed oconnor663 closed 5 years ago
Further random note, my vector allocations always have an offset of 16 in a 4096 byte page.
Offset-dependent variation in the BLAKE2bp throughput reproduces in several different implementations. Here's my implementation which combines blake2b
and hash_many
using a stride
argument (byte offset into an allocated Vec on the X axis, throughput in GB/s on the Y axis):
Here's a different implementation of mine (the nostride
branch),
And here's Samuel Neves' C implementation from https://github.com/sneves/blake2-avx2:
Here's the with-stride vs without-stride comparison again, this time reducing the compression function to one round to exaggerate the load performance differences between the different offsets.
(EDIT: I've changed the methodology in later comments, so if you use the code today the bit about adding 16 is no longer applicable, and you'll probably need to adjust the offsets to observe the effect. If anyone is trying this and sees that the code doesn't work anymore, let me know, and I'll fix it.)
Here's an example of how to reproduce this. This runs for about 10 seconds with two different input offsets (the WORKER_OFFSET
value, which again probably gets added to 16, the page offset of big vectors on my machine), and we see that the BLAKE2bp throughput is 0.992 GB/s on the first offset and is 1.053 GB/s on the second offset.
$ cd benches/bench_multiprocess
$ MS_PER_BENCH=10000 WORKER_OFFSET=64 cargo run --release "blake2b_simd BLAKE2bp"
Finished release [optimized] target(s) in 0.10s
Running `target/release/bench_multiprocess 'blake2b_simd BLAKE2bp'`
0.992
$ MS_PER_BENCH=10000 WORKER_OFFSET=336 cargo run --release "blake2b_simd BLAKE2bp"
Finished release [optimized] target(s) in 0.10s
Running `target/release/bench_multiprocess 'blake2b_simd BLAKE2bp'`
1.053
And here's the same experiment repeated with the blake2-avx2
C implementation. I find different offsets show the effect.
$ cd benches/bench_multiprocess
$ MS_PER_BENCH=10000 WORKER_OFFSET=182 cargo run --release "blake2b_avx2_neves BLAKE2bp"
Finished release [optimized] target(s) in 0.10s
Running `target/release/bench_multiprocess 'blake2b_avx2_neves BLAKE2bp'`
0.968
$ MS_PER_BENCH=10000 WORKER_OFFSET=336 cargo run --release "blake2b_avx2_neves BLAKE2bp"
Finished release [optimized] target(s) in 0.10s
Running `target/release/bench_multiprocess 'blake2b_avx2_neves BLAKE2bp'`
1.035
So most of the above was studying offsets from "wherever the start of my vector happened to be allocated." It so happens that in my test environment (with a small and predictable number of vectors), that's usually 16 bytes into a 4096 byte page, but unsurprisingly that's not stable if I write a little program that keeps vectors around longer. (Also somewhere in here I switched from averaging runs to selecting the best run, and the result got a lot less noisy.)
If I account for where my vectors are getting allocated, so that the offsets on my X axis are with respect to actual page boundaries, I notice a decent spike at offset zero. Here's the result for the one round hash function:
And a wider view of the same showing the 512-byte period:
Repeating that same methodology for blake2-avx2
(reduced from 12 rounds to 1 to exaggerate the effect of message loading, selecting the best-performing run at each offset to reduce noise, offsets measured from the page boundary):
Same data, wider view:
Since this reproduces cleanly across implementations, I'm going to assume it's not my fault :) The simplest "fair" benchmarking strategy seems to be to make sure all inputs start at zero offset from the page, which is a decent performance point but not the very best (the best seems to be 384 for some reason).
Update: Ugh. Whether zero-offset is a "good" point seems to depend on unpredictable implementation details of the compression loop. We have to randomize the offset or sum over all offsets instead.
I've noticed that in my BLAKE2bp benchmarks, throughput drops measurably when we properly initialize the input buffer. (For example with
RandCore::fill_slice
, but actually even with just explicitly setting it to zeros, rather than allowingVec
to request all-zero pages that don't need to be initialized.) Notably this doesn't apply tohash_many
, only toblake2bp
. And the weirdest part, the size of the effect depends on what byte offset the input starts at:https://gist.github.com/oconnor663/e9c5a45df3803f92baa85e09c35718a6
That's hashing a 1 GB buffer with
blake2bp
and measuring throughput (inverse seconds). The x-axis if offset in a larger buffer. That is, we allocate aVec
of more than 1 GB and then pull a 1 GB slice from that allocation at the given offset. The high water mark for throughput is on par withhash_many
, but performance dips in several regions, by different amounts. The effect has a period of 512 bytes, at which point the regions repeat. The region offsets are [49, 80, 112, 176, 208, 432, 464], and they're completely stable from run to run. (Note that the subtracting 48 gives [1, 32, 64, 128, 160, 384, 416], and the differences are [31, 32, 64, 32, 224, 32].) The effect is pronounced and stable with an input 1 GB long, but shrinks as your input gets smaller, disappearing around 1 MB. The offsets I noted are the most obvious features in the graph, but there are also smaller stable features. For example, point 464 is the start of a region higher than 463, but it's also consistently spiked above the rest of its region (465 etc).What the hell am I seeing?