golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
122.67k stars 17.49k forks source link

crypto/internal/nistec: remove ppc64le assembly #52424

Open FiloSottile opened 2 years ago

FiloSottile commented 2 years ago

In https://github.com/golang/go/issues/52182#issuecomment-1099583629, @laboger reports that the fiat-crypto (https://github.com/golang/go/issues/40171) code with @pmur's compiler improvements (https://go.dev/cl/393656) is within range of the assembly performance!

This is extremely impressive considering the fiat-crypto code also uses safer but slower complete formulas and a somewhat naive 4-bit scalar multiplication window.

ScalarBaseMult/P256                    237µs ± 0%      52µs ± 0%   -78.22%  (p=1.000 n=1+1)
ScalarMult/P256                        239µs ± 0%     213µs ± 0%   -10.95%  (p=1.000 n=1+1)

The ScalarBaseMult benchmark is still significantly slower, because the assembly uses a large precomputed table, while the fiat-crypto code just runs ScalarMult. This is very much fixable.

I will land the ScalarBaseMult optimization in the fiat-crypto code, and then we can remove the ppc64le assembly entirely!

FiloSottile commented 2 years ago

https://go.dev/cl/404174 is the promised ScalarBaseMult optimization, so it's possible that the assembly is now slower than the fiat-crypto code!

gopherbot commented 2 years ago

Change https://go.dev/cl/404174 mentions this issue: crypto/elliptic: precompute ScalarBaseMult doublings

laboger commented 2 years ago

Here are comparisons using noasm vs. asm using latest:

crypto/internal/nistec:
ScalarMult/P256         153µs ± 0%     145µs ± 0%   -5.84%  (p=1.000 n=1+1)
ScalarBaseMult/P256    45.2µs ± 0%    23.5µs ± 0%  -48.14%  (p=1.000 n=1+1)

crypto/elltipic:
ScalarBaseMult/P256                   52.3µs ± 0%    38.3µs ± 0%  -26.78%  (p=1.000 n=1+1)
ScalarMult/P256                        161µs ± 0%     160µs ± 0%   -1.02%  (p=1.000 n=1+1)

crypto/ecdsa:
Sign/P256           96.2µs ± 0%    87.8µs ± 0%   -8.71%  (p=1.000 n=1+1)
Verify/P256          212µs ± 0%     196µs ± 0%   -7.43%  (p=1.000 n=1+1)
GenerateKey/P256    53.8µs ± 0%    40.0µs ± 0%  -25.63%  (p=1.000 n=1+1)

No meaningful difference in the crypto/tls benchmarks. Looks like the assembler version is still significantly faster than the native Go version for some.