randombit / botan

Cryptography Toolkit
https://botan.randombit.net
BSD 2-Clause "Simplified" License
2.59k stars 567 forks source link

Low performance creating 4096 bit private key #1761

Closed ingmarkrause closed 5 years ago

ingmarkrause commented 5 years ago

Hello,

Switching from GnuTLS to Botan 2.4.0, we faced a big difference in the performance when creating a 4096 bit private key. While this operation takes 2-3 seconds with GnuTLS, it takes 40-180 seconds with Botan.

This is the used code:

Botan::AutoSeeded_RNG rng;
Botan::RSA_PrivateKey privKey( rng, bitCount );

where bitCount=4096. We are using VS2017 (VS toolset v141), and dynamically link to botan.dll.

When trying to create a key via the CLI tool, we achieved similar results (40-180 seconds): .\botan-cli.exe keygen --params=4096

I did some profiling and found the following call tree:

capture

Looks as if the Montgomery_Exponentiator::execute spends most of the time, especially in BigInt::const_time_lookup()and bigint_monty_sqr()

Side information: botan-cli.exe speed RSA result in:

RSA-1024 3 keygen/sec; 256.35 ms/op 703591 cycles/op (1 op in 256 ms)
RSA-1024 EME-PKCS1-v1_5 5517 encrypt/sec; 0.18 ms/op 110.29 cycles/op (556 ops in 100.78 ms)
RSA-1024 EME-PKCS1-v1_5 648 decrypt/sec; 1.54 ms/op 4176.11 cycles/op (66 ops in 101.72 ms)
RSA-1024 OAEP(SHA-1) 5039 encrypt/sec; 0.20 ms/op 126.99 cycles/op (508 ops in 100.81 ms)
RSA-1024 OAEP(SHA-1) 727 decrypt/sec; 1.37 ms/op 3868.96 cycles/op (74 ops in 101.69 ms)
RSA-1024 EMSA-PKCS1-v1_5(SHA-1) 667 sign/sec; 1.50 ms/op 4025.61 cycles/op (67 ops in 100.39 ms)
RSA-1024 EMSA-PKCS1-v1_5(SHA-1) 5533 verify/sec; 0.18 ms/op 111.57 cycles/op (554 ops in 100.11 ms)
RSA-1024 PSSR(SHA-256) 660 sign/sec; 1.51 ms/op 3971.37 cycles/op (67 ops in 101.46 ms)
RSA-1024 PSSR(SHA-256) 5537 verify/sec; 0.18 ms/op 115.71 cycles/op (554 ops in 100.04 ms)
RSA-2048 0 keygen/sec; 3903.61 ms/op 10705416 cycles/op (1 op in 3904 ms)
RSA-2048 EME-PKCS1-v1_5 1505 encrypt/sec; 0.66 ms/op 1243.03 cycles/op (152 ops in 100.95 ms)
RSA-2048 EME-PKCS1-v1_5 113 decrypt/sec; 8.79 ms/op 24075 cycles/op (12 ops in 106 ms)
RSA-2048 OAEP(SHA-1) 1489 encrypt/sec; 0.67 ms/op 1255.93 cycles/op (149 ops in 100.00 ms)
RSA-2048 OAEP(SHA-1) 111 decrypt/sec; 8.96 ms/op 24764 cycles/op (12 ops in 107 ms)
RSA-2048 EMSA-PKCS1-v1_5(SHA-1) 113 sign/sec; 8.79 ms/op 24806 cycles/op (12 ops in 106 ms)
RSA-2048 EMSA-PKCS1-v1_5(SHA-1) 1471 verify/sec; 0.68 ms/op 1255.78 cycles/op (150 ops in 101.95 ms)
RSA-2048 PSSR(SHA-256) 117 sign/sec; 8.54 ms/op 24005 cycles/op (12 ops in 102 ms)
RSA-2048 PSSR(SHA-256) 1464 verify/sec; 0.68 ms/op 1283.46 cycles/op (148 ops in 101.02 ms)
RSA-3072 0 keygen/sec; 40974.98 ms/op 112357794 cycles/op (1 op in 40975 ms)
RSA-3072 EME-PKCS1-v1_5 585 encrypt/sec; 1.71 ms/op 4741.63 cycles/op (59 ops in 100.73 ms)
RSA-3072 EME-PKCS1-v1_5 27 decrypt/sec; 36.15 ms/op 98694 cycles/op (4 ops in 145 ms)
RSA-3072 OAEP(SHA-1) 580 encrypt/sec; 1.72 ms/op 4732.00 cycles/op (59 ops in 101.69 ms)
RSA-3072 OAEP(SHA-1) 28 decrypt/sec; 35.24 ms/op 96310 cycles/op (3 ops in 106 ms)
RSA-3072 EMSA-PKCS1-v1_5(SHA-1) 31 sign/sec; 31.91 ms/op 88186 cycles/op (4 ops in 128 ms)
RSA-3072 EMSA-PKCS1-v1_5(SHA-1) 615 verify/sec; 1.62 ms/op 4478.39 cycles/op (62 ops in 100.74 ms)
RSA-3072 PSSR(SHA-256) 31 sign/sec; 31.92 ms/op 88307 cycles/op (4 ops in 128 ms)
RSA-3072 PSSR(SHA-256) 609 verify/sec; 1.64 ms/op 4498.03 cycles/op (62 ops in 101.72 ms)
RSA-4096 0 keygen/sec; 68394.63 ms/op 187546543 cycles/op (1 op in 68395 ms)
RSA-4096 EME-PKCS1-v1_5 422 encrypt/sec; 2.37 ms/op 6589.26 cycles/op (43 ops in 101.74 ms)
RSA-4096 EME-PKCS1-v1_5 16 decrypt/sec; 59.34 ms/op 161341 cycles/op (2 ops in 119 ms)
RSA-4096 OAEP(SHA-1) 416 encrypt/sec; 2.40 ms/op 6576.86 cycles/op (42 ops in 100.74 ms)
RSA-4096 OAEP(SHA-1) 16 decrypt/sec; 58.86 ms/op 160777 cycles/op (2 ops in 118 ms)
RSA-4096 EMSA-PKCS1-v1_5(SHA-1) 17 sign/sec; 58.34 ms/op 160877 cycles/op (2 ops in 117 ms)
RSA-4096 EMSA-PKCS1-v1_5(SHA-1) 416 verify/sec; 2.40 ms/op 6543.86 cycles/op (42 ops in 100.74 ms)
RSA-4096 PSSR(SHA-256) 16 sign/sec; 59.85 ms/op 163815 cycles/op (2 ops in 120 ms)
RSA-4096 PSSR(SHA-256) 416 verify/sec; 2.40 ms/op 6577.86 cycles/op (42 ops in 100.74 ms)

When building botan, configure.py results in the following output:

   INFO: configure.py invoked with options "--enable-shared --cc=msvc --cpu=x86 --module-policy=bsi --enable-modules=tls,x509,sha1_sse2"
   INFO: Guessing target OS is windows (use --os to set)
   INFO: Canonicalized CPU target x86 to x86_32/x86_32
   INFO: Auto-detected compiler version 19.15
   INFO: Target is msvc:19.15-windows-x86_32-x86_32
   INFO: Skipping (incompatible OS): darwin_secrandom dev_random fd_unix getentropy proc_walk
   INFO: Skipping (incompatible compiler): aes_armv8 pmull sha1_armv8 sha1_x86 sha2_32_armv8 sha2_32_x86 shacal2_x86 threefish_avx2
   INFO: Skipping (not requested): adler32 aont certstor_sql certstor_sqlite3 codec_filt compression crc24 crc32 cryptobox dyn_load eme_raw ffi filters fpe_fe1 hotp nist_keywrap passhash9 pbkdf1 pgp_s2k pkcs11 psk_db rfc3394 sessions_sql sessions_sqlite3 srp6 tls_cbc tss xts
   INFO: Skipping (prohibited by module policy): aria bcrypt blake2 blowfish camellia cascade cast cbc_mac ccm cecpq1 cfb chacha chacha20poly1305 chacha_rng chacha_sse2 comb4p curve25519 des eax ed25519 elgamal emsa_raw emsa_x931 gost_28147 gost_3410 gost_3411 hkdf idea idea_sse2 kasumi kdf1 kdf2 keccak lion mce mceies md4 misty1 newhope noekeon noekeon_simd ocb ofb poly1305 prf_x942 rc4 rfc6979 rmd160 salsa20 seed serpent serpent_simd shacal2 shacal2_simd shacal2_x86 shake shake_cipher siphash siv skein sm2 sm3 sm4 sp800_56a streebog threefish threefish_avx2 tiger twofish whirlpool x919_mac xtea
   INFO: Skipping (requires external dependency): bearssl boost bzip2 lzma openssl sqlite3 tpm zlib
   INFO: Loading modules: aead aes aes_ni aes_ssse3 asn1 auto_rng base base64 bigint block cbc clmul clmul_ssse3 cmac cpuid ctr dh dl_algo dl_group dlies dsa ec_gfp ec_group ecc_key ecdh ecdsa ecgdsa ecies eckcdsa eme_oaep eme_pkcs1 emsa1 emsa_pkcs1 emsa_pssr entropy gcm gmac hash hash_id hex hmac hmac_drbg http_util iso9796 kdf kdf1_iso18033 keypair locking_allocator mac md5 mdx_hash mgf1 mode_pad modes mp numbertheory par_hash pbes2 pbkdf pbkdf2 pem pk_pad poly_dbl prf_tls pubkey rdrand rdrand_rng rdseed rng rsa sha1 sha1_sse2 sha2_32 sha2_64 sha3 simd socket sp800_108 sp800_56c stateful_rng stream system_rng tls utils win32_stats x509 xmss
   INFO: Using copy to link files into build dir (use --link-method to change)
   INFO: Botan 2.4.0 (revision git:a18c7979dab3a0b42cf0ee812c759e7d2fe23922) (unreleased undated) build setup is complete

Example for the resulting compilation:

cl /MD /bigobj  /DBOTAN_DLL=__declspec(dllexport) /EHs /GR /D_ENABLE_EXTENDED_ALIGNED_STORAGE /O2 /Oi /W4 /wd4250 /wd4251 /wd4275 /Ibuild\include /nologo /c src/lib/asn1/asn1_attribute.cpp /Fobuild\obj\lib\asn1_attribute.obj asn1_attribute.cpp
webmaster128 commented 5 years ago

~This is 32 bit, right?~ RSA in Windows 32 bit is very slow. Linux 32 and Windows 64 are both way better.

randombit commented 5 years ago

If at all possible use 2.8, as many optimizations have been made to RSA key generation since 2.4.0 (#1542 #1413 and several others). On my desktop (x86-64 Linux with GCC), 2.4 takes 6-10 seconds to generate a 4K bit RSA key, master is more like .3-1 second for same operation.

But @webmaster128 is correct MSVC 32-bit is the least well optimized of the options, we have very little inline asm for MSVC. That said you should still see a big improvement using latest release since most of the optimizations were algorithmic rather than additional inline asm. Compiling with GCC for 32-bit x86 with inline asm disabled, I see 4K keygen taking between ~ 2 and 5 seconds.

(RSA sign and verify are also 1.5-3x faster in 2.8 compared to 2.4, we have been doing a lot of optimization work in the last year, so that's another reason to upgrade.)

ingmarkrause commented 5 years ago

Thanks. 2.8 is indeed much faster (on my machine factor 20). In our case, it's currently not possible to switch to that new version, but probably in the near future. Here my results regarding the RSA 4096 key generation speed:

2.4.0: RSA-4096 0 keygen/sec; 54823.87 ms/op 150335295 cycles/op (1 op in 54824 ms)
2.8.0: RSA-4096 0 keygen/sec; 2817.95 ms/op 7727325 cycles/op (1 op in 2818 ms)
webmaster128 commented 5 years ago

Wow, happy to hear that! That's a huge win given people install fresh 32 bit versions of Windows 10 in 2018 -.- Will check RSA keygen times again. Thanks and congratz to that achievement!

ingmarkrause commented 5 years ago

@randombit: Can you give me a hint, which commit(s) where responsible for the performance improvement between 2.4.0 and 2.8.0? Or is it too distributed over the commits? Thanks in advance!

randombit commented 5 years ago

1413 and #1542 are probably the most important but #1564 #1523 #1472 and others improved the generic integer math and are probably contributing somewhat to the improvement. If you are going to backport I'd try just #1413 first since that landed right after 2.4 was released, and so should apply cleanly. #1542 arrived in 2.7 and may depend on other patches that were added between 2.4 and 2.7, so might require some work to get it applied to 2.4