Closed hannesm closed 9 months ago
Thanks @hannesm for that work! The rodata section has increased for about 23k (0x9780 B to 0xf3c0 B) which seems to be correct as the constant precomputed table has gone from 15232 = 960 B to 3283*32 = 24 kB. This seems affordable to me :) I may have done something wrong, but I don't see significant improvement :( (only one run, I'll retry several to avoid a potential side effect in my VMs). (this is current head vs this pr)
* [ecdsa-generate] * [ecdsa-generate]
P224: 1458.572 ops per second (12667 iters in 8.685) P224: 1272.979 ops per second (12664 iters in 9.948)
P256: 1465.086 ops per second (11918 iters in 8.135) P256: 1189.996 ops per second (11930 iters in 10.025)
P384: 355.013 ops per second (3546 iters in 9.988) P384: 354.965 ops per second (3543 iters in 9.981)
P521: 122.818 ops per second (1229 iters in 10.007) P521: 122.783 ops per second (1226 iters in 9.985)
Ed25519: 16974.709 ops per second (172413 iters in 10.157) Ed25519: 16969.520 ops per second (163934 iters in 9.660)
* [ecdsa-sign] * [ecdsa-sign]
P224: 1101.906 ops per second (10981 iters in 9.965) P224: 1100.878 ops per second (10936 iters in 9.934)
P256: 1026.197 ops per second (10250 iters in 9.988) P256: 1024.318 ops per second (10199 iters in 9.957)
P384: 320.084 ops per second (3201 iters in 10.000) P384: 320.348 ops per second (3201 iters in 9.992)
P521: 97.783 ops per second (974 iters in 9.961) P521: 98.176 ops per second (977 iters in 9.951)
Ed25519: 8371.510 ops per second (83333 iters in 9.954) Ed25519: 8367.399 ops per second (82918 iters in 9.910)
* [ecdsa-verify] * [ecdsa-verify]
P224: 406.745 ops per second (4070 iters in 10.006) P224: 406.703 ops per second (4061 iters in 9.985)
P256: 379.167 ops per second (3799 iters in 10.019) P256: 379.098 ops per second (3763 iters in 9.926)
P384: 114.519 ops per second (1143 iters in 9.981) P384: 114.566 ops per second (1142 iters in 9.968)
P521: 40.434 ops per second (404 iters in 9.992) P521: 40.432 ops per second (403 iters in 9.967)
Ed25519: 5550.770 ops per second (54288 iters in 9.780) Ed25519: 5539.897 ops per second (53763 iters in 9.705)
* [ecdh-secret] * [ecdh-secret]
P224: 1193.372 ops per second (11938 iters in 10.004) P224: 1194.816 ops per second (11947 iters in 9.999)
P256: 1110.876 ops per second (11101 iters in 9.993) P256: 1109.396 ops per second (11086 iters in 9.993)
P384: 338.256 ops per second (3380 iters in 9.992) P384: 338.534 ops per second (3382 iters in 9.990)
P521: 118.926 ops per second (1194 iters in 10.040) P521: 118.943 ops per second (1186 iters in 9.971)
X25519: 9282.841 ops per second (92764 iters in 9.993) X25519: 9301.330 ops per second (90252 iters in 9.703)
* [ecdh-share] * [ecdh-share]
P224: 1264.362 ops per second (11933 iters in 9.438) P224: 1186.644 ops per second (11684 iters in 9.846)
P256: 1109.354 ops per second (11059 iters in 9.969) P256: 1109.134 ops per second (11116 iters in 10.022)
P384: 336.237 ops per second (3385 iters in 10.067) P384: 358.229 ops per second (3359 iters in 9.377)
P521: 119.199 ops per second (1165 iters in 9.774) P521: 121.154 ops per second (1200 iters in 9.905)
X25519: 9362.748 ops per second (93457 iters in 9.982) X25519: 9345.699 ops per second (92936 iters in 9.944)
My results are (again, on my laptop i7-5600U, 2.60GHz)
op | main (36bc72f) | PR#196 (0ba3c80) | PR#196 / main | OpenSSL |
---|---|---|---|---|
generate | 23690.458 | 42317.766 | 1.79 | |
sign | 12088.631 | 20398.831 | 1.69 | 21942.7 |
verify | 8887.761 | 10203.711 | 1.15 | 7080.6 |
dh-secret | 15478.436 | 15747.423 | 1.02 | |
dh-share | 15949.516 | 16072.456 | 1.01 | 26200.7 |
Thanks I surely failed somewhere, I'll check twice! (And maybe my VM lacks some CPU features with Qubes?) For me, the speed improvement is worth the extra 23kB of rodata :) and I see that this pr challenges openSSL on some operations!
Nice work! Though I also have a hard time reproducing the improvement: on my AMD Ryzen 7 3700X (3.6GHz), outside of any VM environment, I get the following:
op | main (36bc72f) | PR#196 (0ba3c80) | PR#196 / main |
---|---|---|---|
generate | 43652 | 36535 | 0.84 |
sign | 21370 | 17937 | 0.84 |
verify | 14512 | 13610 | 0.93 |
dh-secret | 26489 | 26658 | 1.01 |
dh-share | 26484 | 26774 | 1.01 |
I retried e‧g. generate
multiple times to see if there was any side effect but I consistently get a similar ratio
I'll try with a lower-end machine as well to see if there is a difference
I think my numbers are sensitive to side effects (I started with 10 runs on a row and tried to limit my impact when running tests), I have a large variance (from 1 to 1.8x on different runs), both with the main branch and this pr :(
@palainp thanks for your numbers (also to @Firobe and @dinosaure) and your comment.
Do you see a lot of variance as well when executing "openssl speed ed25519 x25519"?
I updated my results into my comment to integrate OpenSSL performances. I relaunch several time OpenSSL and mirage-crypto
and results seem stable for me.
It does, but the amplitude is smaller (13500-18000 for x25519, 17000-18900 for ed25519-sign and 6700-7400 for ed25519-verify):
$ for i in `seq 1 10` ; do openssl speed ed25519 ecdhx25519 2>/dev/null | grep "253 bits"; done
253 bits ecdh (X25519) 0.0001s 14835.3
253 bits EdDSA (Ed25519) 0.0001s 0.0001s 17368.4 6777.5
253 bits ecdh (X25519) 0.0001s 18058.8
253 bits EdDSA (Ed25519) 0.0001s 0.0001s 18919.5 6698.8
253 bits ecdh (X25519) 0.0001s 14456.1
253 bits EdDSA (Ed25519) 0.0001s 0.0001s 17022.5 6714.8
253 bits ecdh (X25519) 0.0001s 15185.1
253 bits EdDSA (Ed25519) 0.0001s 0.0001s 17378.2 6787.4
253 bits ecdh (X25519) 0.0001s 14219.3
253 bits EdDSA (Ed25519) 0.0001s 0.0001s 17355.8 7362.4
253 bits ecdh (X25519) 0.0001s 15398.2
253 bits EdDSA (Ed25519) 0.0001s 0.0001s 17353.6 7103.8
253 bits ecdh (X25519) 0.0001s 15570.9
253 bits EdDSA (Ed25519) 0.0001s 0.0001s 17346.1 6729.5
253 bits ecdh (X25519) 0.0001s 14264.1
253 bits EdDSA (Ed25519) 0.0001s 0.0001s 17318.0 6733.5
253 bits ecdh (X25519) 0.0001s 13489.1
253 bits EdDSA (Ed25519) 0.0001s 0.0001s 17091.9 6815.3
253 bits ecdh (X25519) 0.0001s 15376.2
253 bits EdDSA (Ed25519) 0.0001s 0.0001s 17366.7 7401.4
EDIT: well the range 13500-18000 is also quite a big range. What I had with mirage-crypto was:
Thanks again, so I'm rather undecided how to move. This PR clearly shows that performance depends a lot on the CPU being evaluated (well, plus C compiler etc. -- keep in mind this change includes #if defined(__clang__) __asm__("" : "+," (t_bytes) : /* no inputs */ ); #endif
.
I'm undecided whether there is a specific CPU we should take as "if it is faster on that CPU, then merge the PR". A speedup of 1.7 (the sign operation on my laptop) is nice, but if the same change results in a negative speedup of 0.8 for @dinosaure and @Firobe - what is the value?
Anyone who is more deep into performance has an opinion on this? After all, this PR adds quite some data (23kB) to the binary.
Could someone with such a fancy new processor give a try at the first commit only, and post some measurements? From the commit message and BoringSSL commit message, my intuition is that it should result in fewer work and thus more performance.
I get the following:
op | main (36bc72f) | PR#196 4e8790da | PR#196 / main |
---|---|---|---|
generate | 42355 | 42303 | 1.00 |
sign | 20580 | 20719 | 0.99 |
verify | 13952 | 14013 | 1.00 |
dh-secret | 25748 | 25898 | 0.99 |
dh-share | 26155 | 26221 | 1.00 |
which is within the margin of error of "no overhead". So if there is noticeable improvement in other environment, I think it's good enough to be considered
I just started a USB live with Ubuntu 22.04 and I can confirm a similar numbers and ratio (EDIT: similar as @Firobe , the speed is almost the same as main) on my i7-1365U (yes the same processor as the one with the lowest performances :) ) with only the first commit.
As for the C pragma comment, I wondered why gcc cannot benefits from that, which leads to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111774 (So it seems that enabling the same __asm__
for GCC is helping a bit, but it's not in the boringSSL main tree :( ).
I removed the second commit (big 25519 tables) from this PR. I guess the first commit (fewer multiplications in square root) is not controversial.
The other (big tables) increases code complexity and binary size -- at the tradeoff of enhanced speed on some CPUs, but it is unclear where it is worth. It is still unclear to me how to decide on such controversial improvements (that are not improvements on all CPUs). When in doubt, I stick to fewer code (&complexity) & binary size (of course, some configuration time selection could be done as well, but to be honest, that would make the code paths even more complex - and I'd like to avoid it).
So, below the line: I pushed the big table to https://github.com/hannesm/mirage-crypto/tree/25519-big-table - we can always revive commit if there's urgent need from someone.
see first and last item of #193 which are implemented here. comments about the tradeoff (code size vs performance) are welcome.