Issue reported in the context of Kudelski Security's audit
The implementation does not leverage vectorized instructions. For example, on platforms supporting AVX2, a reference, portable implemnentations is about 40% slower than an AVX2 implementation, as reported on a Cannonlake microarchitecture benchmark from SUPERCOP.
An AVX2 implementation of BLAKE2b can be found in the SUPERCOP archive as well as in Libsodium.
An AVX512-optimized version of BLAKE2s (not BLAKE2b) is used in Wireguard.
Similar techniques may be used to optimize BLAKE2b for the AVX512 instruction set.
Yes, we are using the reference implementations for both Blake2 and Argon2 since neither is performance-critical. Supporting optimized implementations may be desirable.
Issue reported in the context of Kudelski Security's audit
The implementation does not leverage vectorized instructions. For example, on platforms supporting AVX2, a reference, portable implemnentations is about 40% slower than an AVX2 implementation, as reported on a Cannonlake microarchitecture benchmark from SUPERCOP.
An AVX2 implementation of BLAKE2b can be found in the SUPERCOP archive as well as in Libsodium. An AVX512-optimized version of BLAKE2s (not BLAKE2b) is used in Wireguard. Similar techniques may be used to optimize BLAKE2b for the AVX512 instruction set.