oconnor663 / blake2_simd

high-performance implementations of BLAKE2b/s/bp/sp in pure Rust with dynamic SIMD
MIT License
126 stars 22 forks source link

Refactor to use arrayvec 0.7 #24

Closed workingjubilee closed 3 years ago

workingjubilee commented 3 years ago

In spite of the large diff, this is a fairly small actual change: just use arrayvec 0.7 everywhere and instead of using the array parameter, use <T, const N: usize> parameters.

Bench difference appears to be largely negligible, with an admittedly notable hit on two benches and an explosive improvement on bench_long_blake2sp! No idea why! But I use an AMD processor and did not exhaustively bench and profile this, I just ran it a few times on each to make sure the diffs were roughly constant, so please feel free to do your own review.

 name                             main-all.txt ns/iter  branch-all.txt ns/iter   diff ns/iter   diff %  speedup
 bench_long_blake2b_avx2          43,285 (1514 MB/s)    43,280 (1514 MB/s)                 -5   -0.01%   x 1.00
 bench_long_blake2b_many_2x       59,737 (2194 MB/s)    59,750 (2193 MB/s)                 13    0.02%   x 1.00
 bench_long_blake2b_many_4x       65,285 (4015 MB/s)    70,791 (3703 MB/s)              5,506    8.43%   x 0.92
 bench_long_blake2b_portable      49,335 (1328 MB/s)    49,820 (1315 MB/s)                485    0.98%   x 0.99
 bench_long_blake2bp              16,564 (3956 MB/s)    17,854 (3670 MB/s)              1,290    7.79%   x 0.93
 bench_long_blake2s_many_4x       108,650 (2412 MB/s)   108,450 (2417 MB/s)              -200   -0.18%   x 1.00
 bench_long_blake2s_many_8x       123,798 (4235 MB/s)   121,030 (4331 MB/s)            -2,768   -2.24%   x 1.02
 bench_long_blake2s_portable      82,352 (795 MB/s)     82,782 (791 MB/s)                 430    0.52%   x 0.99
 bench_long_blake2s_sse41         68,866 (951 MB/s)     68,767 (953 MB/s)                 -99   -0.14%   x 1.00
 bench_long_blake2sp              32,954 (1988 MB/s)    15,509 (4225 MB/s)            -17,445  -52.94%   x 2.12
 bench_oneblock_blake2b_avx2      94 (1361 MB/s)        95 (1347 MB/s)                      1    1.06%   x 0.99
 bench_oneblock_blake2b_many_2x   197 (1299 MB/s)       190 (1347 MB/s)                    -7   -3.55%   x 1.04
 bench_oneblock_blake2b_many_4x   239 (2142 MB/s)       232 (2206 MB/s)                    -7   -2.93%   x 1.03
 bench_oneblock_blake2b_portable  110 (1163 MB/s)       110 (1163 MB/s)                     0    0.00%   x 1.00
 bench_oneblock_blake2s_many_4x   219 (1168 MB/s)       235 (1089 MB/s)                    16    7.31%   x 0.93
 bench_oneblock_blake2s_many_8x   291 (1759 MB/s)       286 (1790 MB/s)                    -5   -1.72%   x 1.02
 bench_oneblock_blake2s_portable  93 (688 MB/s)         93 (688 MB/s)                       0    0.00%   x 1.00
 bench_oneblock_blake2s_sse41     64 (1000 MB/s)        64 (1000 MB/s)                      0    0.00%   x 1.00
 bench_onebyte_blake2b_avx2       101                   101                                 0    0.00%   x 1.00
 bench_onebyte_blake2s_sse41      80                    80                                  0    0.00%   x 1.00
oconnor663 commented 3 years ago

Thank you! This looks good. I'll hold off on landing it until I have some time to read through it carefully, but this is definitely a change I was interested in making.

The built-in benchmark harness does tend to be a little finnicky, since it doesn't include warm-up iterations or anything like that. bench_long_blake2sp should be mostly equivalent to bench_long_blake2s_many_8x, so I think what we're seeing is that bench_long_blake2sp underperformed by half on your main branch for "some random reason". I'm curious whether that blip might disappear, if you ran it ten times in a row or something like that?