mirage / mirage-crypto

Cryptographic primitives for OCaml, in OCaml (also used in MirageOS)
ISC License
77 stars 43 forks source link

improvements for 25519 #196

Closed hannesm closed 9 months ago

hannesm commented 9 months ago

see first and last item of #193 which are implemented here. comments about the tradeoff (code size vs performance) are welcome.

palainp commented 9 months ago

Thanks @hannesm for that work! The rodata section has increased for about 23k (0x9780 B to 0xf3c0 B) which seems to be correct as the constant precomputed table has gone from 15232 = 960 B to 3283*32 = 24 kB. This seems affordable to me :) I may have done something wrong, but I don't see significant improvement :( (only one run, I'll retry several to avoid a potential side effect in my VMs). (this is current head vs this pr)

* [ecdsa-generate]  * [ecdsa-generate]
    P224:  1458.572 ops per second (12667 iters in 8.685)       P224:  1272.979 ops per second (12664 iters in 9.948)
    P256:  1465.086 ops per second (11918 iters in 8.135)       P256:  1189.996 ops per second (11930 iters in 10.025)
    P384:  355.013 ops per second (3546 iters in 9.988)     P384:  354.965 ops per second (3543 iters in 9.981)
    P521:  122.818 ops per second (1229 iters in 10.007)        P521:  122.783 ops per second (1226 iters in 9.985)
    Ed25519:  16974.709 ops per second (172413 iters in 10.157)     Ed25519:  16969.520 ops per second (163934 iters in 9.660)

* [ecdsa-sign]  * [ecdsa-sign]
    P224:  1101.906 ops per second (10981 iters in 9.965)       P224:  1100.878 ops per second (10936 iters in 9.934)
    P256:  1026.197 ops per second (10250 iters in 9.988)       P256:  1024.318 ops per second (10199 iters in 9.957)
    P384:  320.084 ops per second (3201 iters in 10.000)        P384:  320.348 ops per second (3201 iters in 9.992)
    P521:  97.783 ops per second (974 iters in 9.961)       P521:  98.176 ops per second (977 iters in 9.951)
    Ed25519:  8371.510 ops per second (83333 iters in 9.954)        Ed25519:  8367.399 ops per second (82918 iters in 9.910)

* [ecdsa-verify]    * [ecdsa-verify]
    P224:  406.745 ops per second (4070 iters in 10.006)        P224:  406.703 ops per second (4061 iters in 9.985)
    P256:  379.167 ops per second (3799 iters in 10.019)        P256:  379.098 ops per second (3763 iters in 9.926)
    P384:  114.519 ops per second (1143 iters in 9.981)     P384:  114.566 ops per second (1142 iters in 9.968)
    P521:  40.434 ops per second (404 iters in 9.992)       P521:  40.432 ops per second (403 iters in 9.967)
    Ed25519:  5550.770 ops per second (54288 iters in 9.780)        Ed25519:  5539.897 ops per second (53763 iters in 9.705)

* [ecdh-secret] * [ecdh-secret]
    P224:  1193.372 ops per second (11938 iters in 10.004)      P224:  1194.816 ops per second (11947 iters in 9.999)
    P256:  1110.876 ops per second (11101 iters in 9.993)       P256:  1109.396 ops per second (11086 iters in 9.993)
    P384:  338.256 ops per second (3380 iters in 9.992)     P384:  338.534 ops per second (3382 iters in 9.990)
    P521:  118.926 ops per second (1194 iters in 10.040)        P521:  118.943 ops per second (1186 iters in 9.971)
    X25519:  9282.841 ops per second (92764 iters in 9.993)     X25519:  9301.330 ops per second (90252 iters in 9.703)

* [ecdh-share]  * [ecdh-share]
    P224:  1264.362 ops per second (11933 iters in 9.438)       P224:  1186.644 ops per second (11684 iters in 9.846)
    P256:  1109.354 ops per second (11059 iters in 9.969)       P256:  1109.134 ops per second (11116 iters in 10.022)
    P384:  336.237 ops per second (3385 iters in 10.067)        P384:  358.229 ops per second (3359 iters in 9.377)
    P521:  119.199 ops per second (1165 iters in 9.774)     P521:  121.154 ops per second (1200 iters in 9.905)
    X25519:  9362.748 ops per second (93457 iters in 9.982)     X25519:  9345.699 ops per second (92936 iters in 9.944)
hannesm commented 9 months ago

My results are (again, on my laptop i7-5600U, 2.60GHz)

op main (36bc72f) PR#196 (0ba3c80) PR#196 / main OpenSSL
generate 23690.458 42317.766 1.79
sign 12088.631 20398.831 1.69 21942.7
verify 8887.761 10203.711 1.15 7080.6
dh-secret 15478.436 15747.423 1.02
dh-share 15949.516 16072.456 1.01 26200.7
palainp commented 9 months ago

Thanks I surely failed somewhere, I'll check twice! (And maybe my VM lacks some CPU features with Qubes?) For me, the speed improvement is worth the extra 23kB of rodata :) and I see that this pr challenges openSSL on some operations!

Firobe commented 9 months ago

Nice work! Though I also have a hard time reproducing the improvement: on my AMD Ryzen 7 3700X (3.6GHz), outside of any VM environment, I get the following:

op main (36bc72f) PR#196 (0ba3c80) PR#196 / main
generate 43652 36535 0.84
sign 21370 17937 0.84
verify 14512 13610 0.93
dh-secret 26489 26658 1.01
dh-share 26484 26774 1.01

I retried e‧g. generate multiple times to see if there was any side effect but I consistently get a similar ratio

I'll try with a lower-end machine as well to see if there is a difference

dinosaure commented 9 months ago
op main (36bc72f) PR#196 (0ba3c80) PR#196 / main OpenSSL
generate 62885 50135 0.80
sign 31090 24879 0.80 39678
verify 21694 19922 0.92 14442
dh-secret 38624 38854 1.00
dh-share 38961 39036 1.00 46380

I got the same result than @Firobe on AMD Ryzen 9 7950X (4.5 GHz) (no virtualisation too).

palainp commented 9 months ago

I think my numbers are sensitive to side effects (I started with 10 runs on a row and tried to limit my impact when running tests), I have a large variance (from 1 to 1.8x on different runs), both with the main branch and this pr :(

hannesm commented 9 months ago

@palainp thanks for your numbers (also to @Firobe and @dinosaure) and your comment.

Do you see a lot of variance as well when executing "openssl speed ed25519 x25519"?

dinosaure commented 9 months ago

I updated my results into my comment to integrate OpenSSL performances. I relaunch several time OpenSSL and mirage-crypto and results seem stable for me.

palainp commented 9 months ago

It does, but the amplitude is smaller (13500-18000 for x25519, 17000-18900 for ed25519-sign and 6700-7400 for ed25519-verify):

$ for i in `seq 1 10` ; do openssl speed ed25519 ecdhx25519 2>/dev/null | grep "253 bits"; done
 253 bits ecdh (X25519)   0.0001s  14835.3
 253 bits EdDSA (Ed25519)   0.0001s   0.0001s  17368.4   6777.5
 253 bits ecdh (X25519)   0.0001s  18058.8
 253 bits EdDSA (Ed25519)   0.0001s   0.0001s  18919.5   6698.8
 253 bits ecdh (X25519)   0.0001s  14456.1
 253 bits EdDSA (Ed25519)   0.0001s   0.0001s  17022.5   6714.8
 253 bits ecdh (X25519)   0.0001s  15185.1
 253 bits EdDSA (Ed25519)   0.0001s   0.0001s  17378.2   6787.4
 253 bits ecdh (X25519)   0.0001s  14219.3
 253 bits EdDSA (Ed25519)   0.0001s   0.0001s  17355.8   7362.4
 253 bits ecdh (X25519)   0.0001s  15398.2
 253 bits EdDSA (Ed25519)   0.0001s   0.0001s  17353.6   7103.8
 253 bits ecdh (X25519)   0.0001s  15570.9
 253 bits EdDSA (Ed25519)   0.0001s   0.0001s  17346.1   6729.5
 253 bits ecdh (X25519)   0.0001s  14264.1
 253 bits EdDSA (Ed25519)   0.0001s   0.0001s  17318.0   6733.5
 253 bits ecdh (X25519)   0.0001s  13489.1
 253 bits EdDSA (Ed25519)   0.0001s   0.0001s  17091.9   6815.3
 253 bits ecdh (X25519)   0.0001s  15376.2
 253 bits EdDSA (Ed25519)   0.0001s   0.0001s  17366.7   7401.4

EDIT: well the range 13500-18000 is also quite a big range. What I had with mirage-crypto was:

hannesm commented 9 months ago

Thanks again, so I'm rather undecided how to move. This PR clearly shows that performance depends a lot on the CPU being evaluated (well, plus C compiler etc. -- keep in mind this change includes #if defined(__clang__) __asm__("" : "+," (t_bytes) : /* no inputs */ ); #endif.

I'm undecided whether there is a specific CPU we should take as "if it is faster on that CPU, then merge the PR". A speedup of 1.7 (the sign operation on my laptop) is nice, but if the same change results in a negative speedup of 0.8 for @dinosaure and @Firobe - what is the value?

Anyone who is more deep into performance has an opinion on this? After all, this PR adds quite some data (23kB) to the binary.

hannesm commented 9 months ago

Could someone with such a fancy new processor give a try at the first commit only, and post some measurements? From the commit message and BoringSSL commit message, my intuition is that it should result in fewer work and thus more performance.

Firobe commented 9 months ago

I get the following:

op main (36bc72f) PR#196 4e8790da PR#196 / main
generate 42355 42303 1.00
sign 20580 20719 0.99
verify 13952 14013 1.00
dh-secret 25748 25898 0.99
dh-share 26155 26221 1.00

which is within the margin of error of "no overhead". So if there is noticeable improvement in other environment, I think it's good enough to be considered

palainp commented 9 months ago

I just started a USB live with Ubuntu 22.04 and I can confirm a similar numbers and ratio (EDIT: similar as @Firobe , the speed is almost the same as main) on my i7-1365U (yes the same processor as the one with the lowest performances :) ) with only the first commit. As for the C pragma comment, I wondered why gcc cannot benefits from that, which leads to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111774 (So it seems that enabling the same __asm__ for GCC is helping a bit, but it's not in the boringSSL main tree :( ).

hannesm commented 9 months ago

I removed the second commit (big 25519 tables) from this PR. I guess the first commit (fewer multiplications in square root) is not controversial.

The other (big tables) increases code complexity and binary size -- at the tradeoff of enhanced speed on some CPUs, but it is unclear where it is worth. It is still unclear to me how to decide on such controversial improvements (that are not improvements on all CPUs). When in doubt, I stick to fewer code (&complexity) & binary size (of course, some configuration time selection could be done as well, but to be honest, that would make the code paths even more complex - and I'd like to avoid it).

So, below the line: I pushed the big table to https://github.com/hannesm/mirage-crypto/tree/25519-big-table - we can always revive commit if there's urgent need from someone.