Closed arnetheduck closed 3 months ago
The interesting bit that we can potentially reuse is the hash_64b_blocks
function defined here:
https://github.com/potuz/mammon/blob/main/ssz/sha256.asm
The interesting bit that we can potentially reuse is the
hash_64b_blocks
function defined here: https://github.com/potuz/mammon/blob/main/ssz/sha256.asm
That assembly is only for SHA-ni capable CPUs, if there's interest in a general library implementation I could cook up some AVX2 and pure SSE versions. I wouldn't know how to do (or test) the ARM versions. It is also possible to implement them with C-intrinsics, I do not know if it's easy to import this to Nim. I do not know if you use your own implementation or a third-party library, but if it's the latter I'd raise an issue with them.
I do not know if you use your own implementation or a third-party library, but if it's the latter I'd raise an issue with them.
I can imagine the ideal place for this code would be a patch to blst where the platform-specific code already resides - I can imagine that's where we'd want to do any hardware detection as well - given the low penetration of these instructions, supporting multiple versions and choosing the right one at runtime seems like the right thing to do
FWIW, I posted SSE, AVX, AVX2 and SHA modifications to Intel's assembly. If you raise an issue with your SHA provider you may want to link to my repo if they want to take the constants from there. Here are my benchmarks running a generic release build
Nimbus | Prysm | Lighthouse (single thread) | Mammon | |
---|---|---|---|---|
NUC i5-8259U @ 2.30GHz | 1112ms (AVX2) | 1125ms (AVX2) | 860ms (AVX2) | 563 (AVX2) 596(AVX) 1042 (SSE) |
Ryzen 5 @3.60 GHz | 251ms (SHA) | 292ms (SHA) | 760ms (SHA) | 161ms (SHA) 760ms (AVX2) 813ms (AVX) 1361 ms (SSE) |
XPS 13 i5-7200U @ 2.50GHz | 953 ms (AVX2) | 1085ms (AVX2) | 998ms (AVX2) | 654 ms (AVX2) 681ms(AVX) 964ms (SSE) |
It's true that it's Intel's assembly running on a Ryzen, but I am not buying AMD again.
I can imagine the ideal place for this code would be a patch to blst where the platform-specific code already resides - I can imagine that's where we'd want to do any hardware detection as well - given the low penetration of these instructions, supporting multiple versions and choosing the right one at runtime seems like the right thing to do
choosing the right implementation at runtime is just a one liner call to cpuid, here's mammon's implementation of that (I'm only supporting reasonable CPU's though) https://github.com/potuz/mammon/blob/main/ssz/hasher.cpp#L46
somewhat old:ish paper on intel sha256 implementations: https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/sha-256-implementations-paper.pdf
cc @etan-status
more from @potuz : https://hackmd.io/@potuz/BJyrx9DOF
more from @potuz : https://hackmd.io/@potuz/BJyrx9DOF
For hardcoded padding, is it something like precomputing sha256(0x0000...0000) like done here for hash to curve?
more from @potuz : https://hackmd.io/@potuz/BJyrx9DOF
For hardcoded padding, is it something like precomputing sha256(0x0000...0000) like done here for hash to curve?
nope, the hardcoded padding block refers to the fact that you can have the scheduled words hardcoded in the assembly. The relevant part of the assembly code is here https://github.com/potuz/mammon/blob/main/ssz/sha256_avx2.asm#L730-L738
And a little explanation of the phenomenon is in the document linked in the first description of this issue.
We'll be using https://github.com/prysmaticlabs/hashtree at prysm, it's still too early to adopt that library, but I'd be very happy to see some benchmarks from Nimbus if you test it.
v24.6.0 shipped hashtree.
@potuz found some nice cherries on the hash optimization tree: https://hackmd.io/@potuz/rJX9iD30u