Open krizhanovsky opened 5 years ago
Updated benchmarks for ECDSA (performance core on i9-12900HK):
$ openssl version
OpenSSL 3.0.2 15 Mar 2022 (Library: OpenSSL 3.0.2 15 Mar 2022)
$ taskset --cpu-list 2 openssl speed ecdsa
....
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -ffile-prefix-map=/build/openssl-olCZw9/openssl-3.0.2=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_TLS_SECURITY_LEVEL=2 -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
CPUINFO: OPENSSL_ia32cap=0x7ffaf3ffffebffff:0x98c007bc239ca7eb
sign verify sign/s verify/s
160 bits ecdsa (secp160r1) 0.0001s 0.0001s 10476.5 9825.9
192 bits ecdsa (nistp192) 0.0001s 0.0001s 8490.5 8223.9
224 bits ecdsa (nistp224) 0.0000s 0.0001s 34613.6 16034.2
256 bits ecdsa (nistp256) 0.0000s 0.0000s 63743.3 20365.9
384 bits ecdsa (nistp384) 0.0005s 0.0004s 2097.6 2455.4
521 bits ecdsa (nistp521) 0.0002s 0.0003s 5812.7 2880.5
163 bits ecdsa (nistk163) 0.0001s 0.0002s 8635.0 4368.9
233 bits ecdsa (nistk233) 0.0002s 0.0003s 6390.0 3231.3
283 bits ecdsa (nistk283) 0.0003s 0.0006s 3569.2 1808.5
409 bits ecdsa (nistk409) 0.0005s 0.0010s 2060.4 1051.3
571 bits ecdsa (nistk571) 0.0011s 0.0021s 924.0 470.3
163 bits ecdsa (nistb163) 0.0001s 0.0002s 8257.3 4171.4
233 bits ecdsa (nistb233) 0.0002s 0.0003s 6005.8 3078.1
283 bits ecdsa (nistb283) 0.0003s 0.0006s 3367.1 1706.5
409 bits ecdsa (nistb409) 0.0005s 0.0010s 1946.0 989.7
571 bits ecdsa (nistb571) 0.0012s 0.0023s 858.7 437.6
256 bits ecdsa (brainpoolP256r1) 0.0002s 0.0002s 4942.6 4953.8
256 bits ecdsa (brainpoolP256t1) 0.0002s 0.0002s 4939.3 5119.0
384 bits ecdsa (brainpoolP384r1) 0.0005s 0.0004s 2080.8 2338.1
384 bits ecdsa (brainpoolP384t1) 0.0005s 0.0004s 2113.9 2474.1
512 bits ecdsa (brainpoolP512r1) 0.0008s 0.0007s 1243.8 1458.2
512 bits ecdsa (brainpoolP512t1) 0.0008s 0.0006s 1267.2 1556.7
(Results are basically the same).
Scope
Following algorithms must be implemented or optimized in Tempesta TLS:
[ ] [TLS 1.3, #1031] Curve25519 (Montgomery, already is in the kernel) as defined in RFC 7748 for ECDHE. (EdDSA certificates seem aren't wide spread enough https://github.com/letsencrypt/boulder/issues/3649). See High-performance Implementation of Elliptic Curve Cryptography Using Vector Instructions by Armando Faz-Hernández
[x] ~ChaCha20_Poly1305 (already is in the kernel, also required for QUIC, usually preferred by mobile devices) must be implemented. MbedTLS uses additional callback stream_func in mbedtls_cipher_base_t which is used by ChaCha20 and ARC4 only, but maybe we can find a better solution.~ This algorithm is also slower than AES https://github.com/tempesta-tech/tempesta/issues/1031#issuecomment-519985387 , so we can go to the first version w/o the algorithm.
[ ] #1064 improves MPI, but doesn't takes specific steps to improve RSA performance, so the algorithm must be optimized. See
rsaz
code and referenced papers in OpenSSL. Actually, RSA is asymmetric, so that client computations are less expensive, than the servers' ones. It's not recommended to use RSA in terms of performance and DDoS resistance, so probably it makes sense to abandon it or at least recommend users not to use it and not to spend much development resources on it.[x] ~#1064 was focused on SECP 256 elliptic curve, so SECP 384 should also be profiled and optimized. https://w3.lasca.ic.unicamp.br/media/publications/ST4-1.pdf addresses the curve-specific optimizations.~ SECP 384 should be removed.
[ ] The Intel Ice Lake CPU familiy doesn't have the dramatic downclocking on AVX-512 any more, so explore algorithms for SECP 256 (Hernandez proposed AVX2), Curve25519 (the kernel uses plain MULX implementation, see Hernandez and Hisil) and RSA for AVX-512 (also see for generic bigints).
[ ] The kernel AES-GCM optimizations. Use Karatsuba precomputations for AES in the same TLS connection (see at the below), TLS performance characterization on modern x86 CPUs. Also AVX-512 version of VPCLMULQDQ instruction can be used for faster carry-less multiplication (the current OpenSSL also doesn't use this though).
Testing
tls/t
for curve25519test_tls_cert
: certificates and handshakes forRSASSA_PSS
and certificates for other EC (TTLS_PK_ECKEY
vsTTLS_PK_ECKEY_DH
andTTLS_PK_ECKEY_ECDSA
)Notes
Deprecation of SECP 384
SECP 384 technically a legacy and x448 provides better performance (checked w/ OpenSSL):
It seems that OpenSSL doesn't optimize the curve at all, since even 521 has better performance. However, CA/B Forum Baseline Requirements section 6.1.5 requires certificates to be signed with either RSA or NIST curves of 256, 384 or 521. Let's leave RSA for the legacy usage and remove secp384 completely. Also note that ECDSA secp256 outperforms Ed25519 for signing, so we should leave secp256 to support EC certificates. ECDHE is faster for x25519:
AES-GCM precomputations for Karatsuba multiplication
The paper TLS performance characterization on modern x86 CPUs references two original Intel papers:
The header comments for the Linux implementation explicitly says that it was developed by these two papers. The first one mentions hash key precomputations:
Htbl
in OpenSSLcrypto/modes/asm/ghash-x86_64.pl
andHashKey*
offsets inlinux/arch/x86/crypto/aesni-intel_avx-x86_64.S
, so these precomputations are used in both the implementations. The second one proposes to precompute carry-less multiplication ofBh
andBl
parts in Karatsuba multiplication. There is also Intel paper Carry-Less Multiplication Instruction and its Usage for Computing the GCM Mode, which doesn't consider the precomputation optimizations.