lukechampine / blake3

An AVX-512 accelerated implementation of the BLAKE3 cryptographic hash function
MIT License
352 stars 25 forks source link

Remove permute() and alternate method of getting g() inlined leads to better performance #4

Closed renthraysk closed 4 years ago

renthraysk commented 4 years ago
name      old time/op    new time/op    delta
Write-4     6.18ns ± 0%    3.41ns ± 0%  -44.93%  (p=0.000 n=9+10)
Sum256-4    6.21µs ± 0%    3.51µs ± 0%  -43.43%  (p=0.000 n=9+9)
XOF-4       5.73ns ± 0%    3.13ns ± 0%  -45.36%  (p=0.000 n=9+10)

name      old speed      new speed      delta
Write-4    162MB/s ± 0%   294MB/s ± 0%  +81.53%  (p=0.000 n=9+10)
XOF-4      174MB/s ± 0%   319MB/s ± 0%  +83.04%  (p=0.000 n=9+10)

name      old alloc/op   new alloc/op   delta
Write-4      0.00B          0.00B          ~     (all equal)
Sum256-4     0.00B          0.00B          ~     (all equal)
XOF-4        0.00B          0.00B          ~     (all equal)

name      old allocs/op  new allocs/op  delta
Write-4       0.00           0.00          ~     (all equal)
Sum256-4      0.00           0.00          ~     (all equal)
XOF-4         0.00           0.00          ~     (all equal)
lukechampine commented 4 years ago

Thanks again -- another huge increase!

I decided not to keep the codegen, since it's not likely to ever change, and the generated code isn't terribly hard to verify. However, if someone wants to add an asm implementation, I would push for codegen there -- a tool like avo would make the asm much easier to to review and verify.