Investigate faster hash implementations

arnetheduck commented 3 years ago

@potuz found some nice cherries on the hash optimization tree: https://hackmd.io/@potuz/rJX9iD30u

Recall that while we cannot update the DIGEST for one round of the algorithm without computing the previous round, we can in fact compute the scheduled words without knowing the updated DIGEST. This is particularly useful in the Merkle tree computation: we have an entire block of 64 bytes consisting of the padding block, that we know before hand. So the corresponding scheduled words SCHEDB for each round of the algorithm can be hardcoded since they are known before compile time!. The overal time savings from this idea ammounted to 40% of the total hashing time in the benchmark below.

This implementation detail is used in at least one of bitcoind’s hashing algorithms, but surprisingly seems to be missing on the existing Ethereum clients algorithms.

zah commented 3 years ago

The interesting bit that we can potentially reuse is the hash_64b_blocks function defined here: https://github.com/potuz/mammon/blob/main/ssz/sha256.asm

potuz commented 3 years ago

The interesting bit that we can potentially reuse is the hash_64b_blocks function defined here: https://github.com/potuz/mammon/blob/main/ssz/sha256.asm

That assembly is only for SHA-ni capable CPUs, if there's interest in a general library implementation I could cook up some AVX2 and pure SSE versions. I wouldn't know how to do (or test) the ARM versions. It is also possible to implement them with C-intrinsics, I do not know if it's easy to import this to Nim. I do not know if you use your own implementation or a third-party library, but if it's the latter I'd raise an issue with them.

arnetheduck commented 3 years ago

I do not know if you use your own implementation or a third-party library, but if it's the latter I'd raise an issue with them.

I can imagine the ideal place for this code would be a patch to blst where the platform-specific code already resides - I can imagine that's where we'd want to do any hardware detection as well - given the low penetration of these instructions, supporting multiple versions and choosing the right one at runtime seems like the right thing to do

potuz commented 3 years ago

FWIW, I posted SSE, AVX, AVX2 and SHA modifications to Intel's assembly. If you raise an issue with your SHA provider you may want to link to my repo if they want to take the constants from there. Here are my benchmarks running a generic release build

	Nimbus	Prysm	Lighthouse (single thread)	Mammon
NUC i5-8259U @ 2.30GHz	1112ms (AVX2)	1125ms (AVX2)	860ms (AVX2)	563 (AVX2) 596(AVX) 1042 (SSE)
Ryzen 5 @3.60 GHz	251ms (SHA)	292ms (SHA)	760ms (SHA)	161ms (SHA) 760ms (AVX2) 813ms (AVX) 1361 ms (SSE)
XPS 13 i5-7200U @ 2.50GHz	953 ms (AVX2)	1085ms (AVX2)	998ms (AVX2)	654 ms (AVX2) 681ms(AVX) 964ms (SSE)

It's true that it's Intel's assembly running on a Ryzen, but I am not buying AMD again.

I can imagine the ideal place for this code would be a patch to blst where the platform-specific code already resides - I can imagine that's where we'd want to do any hardware detection as well - given the low penetration of these instructions, supporting multiple versions and choosing the right one at runtime seems like the right thing to do

choosing the right implementation at runtime is just a one liner call to cpuid, here's mammon's implementation of that (I'm only supporting reasonable CPU's though) https://github.com/potuz/mammon/blob/main/ssz/hasher.cpp#L46

arnetheduck commented 3 years ago

somewhat old:ish paper on intel sha256 implementations: https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/sha-256-implementations-paper.pdf

mratsim commented 2 years ago

cc @etan-status

arnetheduck commented 2 years ago

more from @potuz : https://hackmd.io/@potuz/BJyrx9DOF

mratsim commented 2 years ago

more from @potuz : https://hackmd.io/@potuz/BJyrx9DOF

For hardcoded padding, is it something like precomputing sha256(0x0000...0000) like done here for hash to curve?

https://github.com/supranational/blst/blob/0fdf02e9d220e970d574163bfaf8591c38923ca2/src/hash_to_field.c#L22-L35

potuz commented 2 years ago

more from @potuz : https://hackmd.io/@potuz/BJyrx9DOF

For hardcoded padding, is it something like precomputing sha256(0x0000...0000) like done here for hash to curve?

https://github.com/supranational/blst/blob/0fdf02e9d220e970d574163bfaf8591c38923ca2/src/hash_to_field.c#L22-L35

nope, the hardcoded padding block refers to the fact that you can have the scheduled words hardcoded in the assembly. The relevant part of the assembly code is here https://github.com/potuz/mammon/blob/main/ssz/sha256_avx2.asm#L730-L738

And a little explanation of the phenomenon is in the document linked in the first description of this issue.

potuz commented 2 years ago

We'll be using https://github.com/prysmaticlabs/hashtree at prysm, it's still too early to adopt that library, but I'd be very happy to see some benchmarks from Nimbus if you test it.

tersec commented 4 months ago

tersec commented 3 months ago

v24.6.0 shipped hashtree.

status-im / nimbus-eth2

Investigate faster hash implementations #2740