Open FiloSottile opened 2 years ago
https://go.dev/cl/404174 is the promised ScalarBaseMult optimization, so it's possible that the assembly is now slower than the fiat-crypto code!
Change https://go.dev/cl/404174 mentions this issue: crypto/elliptic: precompute ScalarBaseMult doublings
Here are comparisons using noasm vs. asm using latest:
crypto/internal/nistec:
ScalarMult/P256 153µs ± 0% 145µs ± 0% -5.84% (p=1.000 n=1+1)
ScalarBaseMult/P256 45.2µs ± 0% 23.5µs ± 0% -48.14% (p=1.000 n=1+1)
crypto/elltipic:
ScalarBaseMult/P256 52.3µs ± 0% 38.3µs ± 0% -26.78% (p=1.000 n=1+1)
ScalarMult/P256 161µs ± 0% 160µs ± 0% -1.02% (p=1.000 n=1+1)
crypto/ecdsa:
Sign/P256 96.2µs ± 0% 87.8µs ± 0% -8.71% (p=1.000 n=1+1)
Verify/P256 212µs ± 0% 196µs ± 0% -7.43% (p=1.000 n=1+1)
GenerateKey/P256 53.8µs ± 0% 40.0µs ± 0% -25.63% (p=1.000 n=1+1)
No meaningful difference in the crypto/tls benchmarks. Looks like the assembler version is still significantly faster than the native Go version for some.
In https://github.com/golang/go/issues/52182#issuecomment-1099583629, @laboger reports that the fiat-crypto (https://github.com/golang/go/issues/40171) code with @pmur's compiler improvements (https://go.dev/cl/393656) is within range of the assembly performance!
This is extremely impressive considering the fiat-crypto code also uses safer but slower complete formulas and a somewhat naive 4-bit scalar multiplication window.
The ScalarBaseMult benchmark is still significantly slower, because the assembly uses a large precomputed table, while the fiat-crypto code just runs ScalarMult. This is very much fixable.
I will land the ScalarBaseMult optimization in the fiat-crypto code, and then we can remove the ppc64le assembly entirely!