[ec] Use windowed algorithm for base scalar mult on NIST P-curves

Firobe commented 9 months ago

Using a sliding window method with pre-computed values of multiples of the generator point, we can obtain far more efficient performance for the special case where G = P in the scalar multiplication kP. By using a safe selection algorithm for pre-computed values and no branches in the main loop, the algorithm should leak no less information about its inputs than the current Montgomery ladder.

This change is at the cost of an algorithm that is slightly less simple than the current one, but I'd say it is still relatively easy to audit, in particular for constant-time behavior. In particular, this is not a NAF-based algorithm. The other cost is the pre-computation phase, which should be negligible for long-running services, and similar to the old algorithm for one-shot computations. It also does need more memory to store the values.

As mentioned in the code, the actual implementation is largely inspired from Go's own crypto library

All tests seem to pass.

Benchmark

On a simple benchmark repeatedly signing messages, and then generating keys, on 10_000 iterations each, with OCaml 4.14.1.

Cstruct

Current main branch:

P256: 1085.8 sign/s ; 1166.4 gen/s P384: 419.4 sign/s ; 442.7 gen/s P521: 156.4 sign/s ; 164.8 gen/s P224: 1151.6 sign/s ; 1226.3 gen/s

On this branch:

P256: 1971.7 sign/s ; 2250.6 gen/s P384: 1055.9 sign/s ; 1230.1 gen/s P521: 296.7 sign/s ; 606.1 gen/s P224: 2149.2 sign/s ; 2404.5 gen/s

Bytes

I tried @dinosaure's branch replacing internal cstruct by bytes, which turns out to greatly improve performance as well on this benchmark (and visibly relieves the GC).

On the PR's base branch:

P256: 1795.8 sign/s ; 2039.9 gen/s P384: 548.6 sign/s ; 595.8 gen/s P521: 152.4 sign/s ; 203.4 gen/s P224: 1901.7 sign/s ; 2142.9 gen/s

A merge of this branch and the PR's:

P256: 5957.1 sign/s ; 9504.9 gen/s P384: 2408.8 sign/s ; 3555.8 gen/s P521: 930.6 sign/s ; 1341.4 gen/s P224: 6479.4 sign/s ; 10208.7 gen/s

Purely on P-256 signatures, the combination of Cstruct replacement and this new algorithm increases the performance by a factor of 6 :)

Further work for more performance

Implement a similar algorithm for the non-base case, as the Go library does it
Potentially rewrite to C, see https://github.com/mirage/mirage-crypto/issues/109

hannesm commented 9 months ago

Thanks for your PR. I'm positive to have this integrated into mirage-crypto. There are a couple of questions, though:

Audit, code review

constant-time behavior

What is the threat model in your mind? A remote attacker? Local side-channel attacks (memory caches)?

relatively easy to audit

At least for me this will need some time, esp. with arrays and reference cells being introduced.

Costs (pre-computation, memory)

The other cost is the pre-computation phase, which should be negligible for long-running services, and similar to the old algorithm for one-shot computations.

Since mirage-crypto-ec is used all over the place, for long-running services as well as short-lived applications, would you mind to measure this pre-computation phase?

It also does need more memory to store the values.

Could you measure this as well? As mentioned above, mirage-crypto-ec is not only used in long-running services.

Would it be worth to consider a lazily evaluated pre-computation of the tables? To decrease memory usage and pre-computation time, esp. in applications where e.g. P-384 or P-224 are not used? Do you have any thoughts about that?

Benchmark

On a simple benchmark repeatedly signing messages

Is this pre-computation phase part in your benchmarks (i.e. did you start the timer at program execution start or once initialization was complete)?

I found a patch that I developed quite some time ago which may be useful https://github.com/hannesm/mirage-crypto/commit/c512a0a4eeb3d1511011affe423e4880431a497b - which adds ECDSA (sign & verify) to the bench/speed.ml.

Also, do you have a baseline? Either "go's EC implementation" or OpenSSL would be great to compare against your numbers on your CPU (openssl speed ecdsap256). Comparing the main branch of mirage-crypto with openssl in P256 sign, I observe on my laptop a discrepancy of factor 42 (openssl is 42 times as fast: 35389 sign/s vs 842.345).

Firobe commented 9 months ago

Thanks for the comments!

Audit, code review

What is the threat model in your mind? A remote attacker? Local side-channel attacks (memory caches)?

The threat model here is local side-channel attacks, such as FLUSH+RELOAD used here or here.

At least for me this will need some time, esp. with arrays and reference cells being introduced.

I wonder if it would be more efficient to first rewrite it in C, see how that performs, and rather review that (since it's slightly easier to review the timing security without the OCaml abstractions). In any way I plan to do that in the coming days, so maybe you want to hold off review until I come back with results.

Costs (pre-computation, memory)

Since mirage-crypto-ec is used all over the place, for long-running services as well as short-lived applications, would you mind to measure this pre-computation phase?

I consistently get the following pre-computation times on my CPU (on the bytes branch):

	P-256	P-224	P-384	P-521
Time (s)	0.001	0.001	0.003	0.011

Which is consistent with the size of the pre-computation for each curve (see below).

Could you measure this as well? As mentioned above, mirage-crypto-ec is not only used in long-running services.

The current algorithm stores 15 * fe_length points per curve (so for example 480 points for P256). Each point is 3 field elements of size fe_length on the heap, so excluding array/struct overhead, that's:

	P-256	P-224	P-384	P-521
Memory (KB)	46	46	103	233

All in all roughly half a megabyte if we precompute everything.

Would it be worth to consider a lazily evaluated pre-computation of the tables? To decrease memory usage and pre-computation time, esp. in applications where e.g. P-384 or P-224 are not used? Do you have any thoughts about that?

That's a good idea! I'll add a commit to that effect, with maybe a new API call to force pre-computation (per curve) for applications who want predictable/consistent performance during other operations.

Benchmark

Is this pre-computation phase part in your benchmarks (i.e. did you start the timer at program execution start or once initialization was complete)?

The pre-computation was indeed part of the timer, but quite hidden by the number of iterations. Here's what it looks like with varying number of iterations, lazily pre-computing P256 only:

	1 sign	10 signs	100 signs	1000 signs	10000 signs
sign/s	1850	4630	5540	5720	5930

So even on a single sign, we're still (slightly) above the old algorithm even on the cstruct -> bytes branch, if we pre-compute only what we use.

I found a patch that I developed quite some time ago which may be useful hannesm@c512a0a - which adds ECDSA (sign & verify) to the bench/speed.ml.

That's great! Here's what I get with your patch (with bytes):

* [ecdsa-sign]
  P224:  6269.457 ops per second (63211 iters in 10.082)
  P256:  5693.522 ops per second (57012 iters in 10.013)
  P384:  2372.992 ops per second (18155 iters in 7.651)
  P521:  502.243 ops per second (5199 iters in 10.352)

I'm much in favor of including it in the main branch

Also, do you have a baseline? Either "go's EC implementation" or OpenSSL would be great to compare against your numbers on your CPU (openssl speed ecdsap256). Comparing the main branch of mirage-crypto with openssl in P256 sign, I observe on my laptop a discrepancy of factor 42 (openssl is 42 times as fast: 35389 sign/s vs 842.345).

I have a similar discrepancy: 57347 sign/s with openssl, though OpenSSL does use a different, more complex algorithm. Another baseline I have is BoringSSL which performs about 28000 sign/s (again, with a different algorithm). I'll try in the coming days to do a benchmark with Go. In any way, I think we'll still definitely be below that.

hannesm commented 9 months ago

Thanks a lot for your answers.

FLUSH+RELOAD

Do you by chance have developed or used a tool to figure out whether this code is vulnerable? Of course, reading code (and/or assembly) leads to a certain amount of assurance. But then, you'd need to do that for each architecture, OCaml compiler and C compiler that you're interested in. (I remember Eqaf received a lot of assembly reading, and a new OCaml compiler did just something differently).

To move forward:

I opened #192 for the speed ECDSA additions (should ECDH and X25519/Curve25519 be as well put in that tool?)
Since you seem to have nicer performance with #146 (cstruct -> bytes), should we get that one merged before this one (looks like it needs some rebasing and going through the comments, maybe another round of review)? Please feel free to work on that (either in that PR or in a fresh one, keeping the existing commits as is). This should stay a separate PR from your "use windowed algorithm".
Did you look into perf output to figure out what are the expensive operations in a ECDSA sign operation (maybe looking at #109 with that perf information will result in what to re-do in C).

Firobe commented 9 months ago

Do you by chance have developed or used a tool to figure out whether this code is vulnerable? [...]

I did not (and I'm not confident that if I did my efforts would be representative of a "good" attack), so I'm indeed relying on reading code for now. I've some assurance that it is constant-time as expected since the operations that differ depending on the input are extremely localized (the table selection), and always access every value of the tables in the same order. I've not yet looked at assembly output for the relevant operations.

I opened https://github.com/mirage/mirage-crypto/pull/192 for the speed ECDSA additions (should ECDH and X25519/Curve25519 be as well put in that tool?)

Great! Let's take the opportunity to include ECDH and X25519 as well, yes

[...] should we get that one merged before this one (looks like it needs some rebasing and going through the comments, maybe another round of review)?

I think it would be the easiest path, yes. I'll open a new PR addressing the comments, and let's do a new round of review

Did you look into perf output to figure out what are the expensive operations in a ECDSA sign operation (maybe looking at https://github.com/mirage/mirage-crypto/issues/109 with that perf information will result in what to re-do in C).

Yes! By far the most expensive operation is the scalar multiplication (83% of the time spent), with inversion (already in C) spending much of the rest. Within the basic primitives, the addition is the costlier, so any algorithm that reduces the number of additions performs very well.

I did try to rewrite some of the functions in https://github.com/mirage/mirage-crypto/issues/109 in C for comparison (Dsa.sign in particular), but the improvement was very marginal.

hannesm commented 9 months ago

ECDH and X25519

I'm working on that right now.

Firobe commented 9 months ago

FYI, a quick rewrite in C of the algorithm (+ the precomputation) yields a performance of about 8500 sign/s for P-256 with bytes. I think this justifies the rewrite :)

hannesm commented 9 months ago

Nice! only a factor of 7 left ;)

The bytes EC commit has been rebased by @palainp https://github.com/palainp/mirage-crypto/commit/559a176f44e2ea0c2e70e7d19e77f232d2e42bae -- maybe a good starting point.

palainp commented 9 months ago

I think that specific commit should be ok (at least the tests are green) :) But I failed somewhere after when I tried to remove more Big-arrays :(

Firobe commented 9 months ago

Nice! only a factor of 7 left ;)

I got to run some benchmarks on the Go crypto library. It yielded 48000 signs/s, but I found out than for common CPU architectures, it "cheats" and uses a custom ASM implementation for P256 (completely ignoring fiat primitives).

Forcing the use of the fiat-based windowed algorithm, we get about 13000 sign/s as a baseline, which makes our results here look much more reasonable :)

For reference:

BenchmarkSign/P256        15625     76744 ns/op    2400 B/op         34 allocs/op
BenchmarkSign/P384         5571    212753 ns/op    2608 B/op         34 allocs/op
BenchmarkSign/P521         1995    600929 ns/op    2976 B/op         35 allocs/op
BenchmarkVerify/P256       5703    209437 ns/op     416 B/op         12 allocs/op
BenchmarkVerify/P384       1800    664876 ns/op     592 B/op         12 allocs/op
BenchmarkVerify/P521        582   2057598 ns/op     912 B/op         12 allocs/op

hannesm commented 9 months ago

@Firobe thanks for putting these numbers into context. The question is about the aim here. On the one side, I'm not keen to adapt huge amounts of assembly code into the repository for getting more performance. On the other side, we should be honest about it: the amount of time spent on this library is less than e.g. OpenSSL. Its performance won't be on par with it.

Maybe we should have a paragraph about that in the README, together with how benchmarks can be (re)produced (openssl speed ecdsap256)

Gathering your numbers from above (on P256 sign operations per second), on your computer (architecture? OS?):

1085.8 main branch
1971.7 this PR
1795.8 #146
5957.1 (or 6269.457 with bench/speed) #146 + this PR
8500 this PR + #146 + scalar-mult in C

other implementations:

13000 Go crypo no assembly (I don't understand the table and your comment diverging in the number)
28000 BoringSSL
48000 Go crypto with assembly
57347 OpenSSL

Firobe commented 9 months ago

The numbers are right (for Go crypto the table, which is from their official benchmark, lists nanoseconds per operation, so 76744 ns per sign is 1 / (76744 / 10^9) ~= 13030 sign/s). The first column is the number if iterations performed. All benchmarks were run on an AMD Ryzen 7 3700X with Arch Linux (and always on a single core).

In this PR I was careful to select an algorithm that's not too complex to implement, review and maintain. I wholeheartedly agree that the review and maintenance effort of bringing more complex algorithms (like wNAF) or assembly while maintaining security is too much, and we should be honest about the tradeoff.

hannesm commented 9 months ago

In this PR I was careful to select an algorithm that's too complex to implement, review and maintain.

Yes, highly appreciated. I suspect you meant that's not too complex. Thanks for the clarification of go. And thanks for your work on this, I'm pretty sure we'll have this merged and released rather sooner than later.

Firobe commented 9 months ago

Now that #146 is merged, I rebased this PR, and pushed the rewrite in C.

A few considerations:

it introduces lazy pre-computation of the tables, per curve, as well as a force_precomputation that does what its name says
the pre-computed table are stored as (static) globals. This means the RAM (half a megabyte, see above) is pre-allocated, even if the values are not pre-computed. If that's not something we want, I can switch to heap allocation
it doesn't use Make_scalar.to_octets to avoid the string reversing (the algorithm is adapted to expect a big-endian scalar)
pre-computing the tables in C means knowing about the generator point of the curve in C, which wasn't readily available before. I'm not entirely comfortable with the current solution (specifying and recomputing G for each curve in C, as we do in OCaml), which is pretty redundant with the OCaml code. Furthermore, it currently doesn't do sanity checks like checking G is on the curve (I'll add them if we commit to that solution). Other solutions could be passing G from OCaml to C (but probably forcing a call to force_precomputation with G, to avoid passing it to subsequent calls where it's not useful), or pre-computing the tables entirely in OCaml, at a small single-time performance cost. Yet another solution is to generate the tables in advance and hardcode them in C. What do you think?

New benchmarks on my machine:

`main`

* [ecdsa-sign]
    P224:  1922.156 ops per second (17488 iters in 9.098)
    P256:  1837.106 ops per second (16644 iters in 9.060)
    P384:  557.228 ops per second (5584 iters in 10.021)
    P521:  167.405 ops per second (1668 iters in 9.964)
    Ed25519:  21103.479 ops per second (200803 iters in 9.515)

This branch

* [ecdsa-sign]
    P224:  9088.188 ops per second (92250 iters in 10.151)
    P256:  8289.074 ops per second (83333 iters in 10.053)
    P384:  2905.143 ops per second (29325 iters in 10.094)
    P521:  567.907 ops per second (5524 iters in 9.727)
    Ed25519:  21143.520 ops per second (205761 iters in 9.732)

hannesm commented 9 months ago

Thanks for your rebase.

the pre-computed table are stored as (static) globals. This means the RAM (half a megabyte, see above) is pre-allocated, even if the values are not pre-computed. If that's not something we want, I can switch to heap allocation pre-computing the tables in C means knowing about the generator point of the curve in C, which wasn't readily available before.

Hmm, so what about the approach we use for curve25519_tables.h (taken from BoringSSL where "make_curve25519_tables.py" is provided)?

If we use pre-computed tables, and have the memory pre-allocated, why not separate the table computation from the runtime? The tables won't change over time. We could write a small OCaml program using the Mirage_crypto_ec API which emits the table(s) as C code? This way we wouldn't need to carry around the generator into C, and would not need to mess around with lazy initialization. WDYT?

In my opinion, these tables would then be part of the repository and release -- and maybe there are rules in the ec/native Makefile to (re-)generate the tables!?

Firobe commented 9 months ago

If we use pre-computed tables, and have the memory pre-allocated, why not separate the table computation from the runtime?

I'm in favor of that, the only drawback being that this increases executable size (vs. uninitialized memory). This makes the OCaml API the source of truth. And indeed, the tables won't change enough (or at all) to justify re-generating them as part of the build phase.

Firobe commented 9 months ago

Should we expose just enough primitives in the public API to allow table generation in a separate program, or rather just an opaque function computing the tables internally and returning them? :thinking:

hannesm commented 9 months ago

I think a single function is sufficient. Exposing the internals would likely mean to expose lots of types and internal stuff that may change one day -- for no good purpose.

Firobe commented 9 months ago

I just pushed a new commit for hard-coded tables, which should be ready for review! I also addressed @palainp comment (and replaced tabs in _stubs.c files by spaces, if that's OK).

Firobe commented 9 months ago

It seems I had wrongly assumed it was easy to obtain the 32-bit tables from the 64-bit ones (i.e. the field element representation is the same in memory). It seems true for P256 and P384, but not P224/P521, e.g. for P224, G_x is represented in 32-bit as

bc905227, 6018bfaa, f22fe220, f96bec04, 6dd3af9b, a21b5e60, 92f5b516

while in 64-bit:

bc9052266d0a4aea, 852597366018bfaa, 6dd3af9bf96bec05, 00000000a21b5e60

So, I have to either figure out how/if it's possible to correctly use the 32-bit interface from a 64-bit host, or include the table generation as part of the compilation process (so only the table for the hosting architecture is generated, but it implies potential separating part of the code as a separate library (or conditional compilation shenanigans) to avoid the circular dependency between mirage-crypto-ec and the tables.

Or, keep the code but generate the 32-bit table from a 32-bit machine, and accept that the table is not re-generable from a 64-bit host.

Edit: or serialize to octets instead and de-serialize at startup, but that's hardly different from computing the whole table at startup, and I'm not even sure the serialization is compatible between 32- and 64-bit

hannesm commented 9 months ago

From my point of view, it would be best to be able to generate the tables on a 64bit system. On the other hand, since they'll be generated once and only when we add new curves we'll need additional tables, it is fine to generate them on a 32bit system as well. [Also, it's not clear to me how many moons will pass until we remove the 32bit support here. So far I haven't seen many users, but quite some burden on maintenance.]

Thanks for your work, as said - if you can pre-compute and add the 32bit tables to make CI pass, that'd be neat.

Firobe commented 9 months ago

Alright, thank you for your thoughts. I think we're good now. Both tables are in the tree, and gen_tables can generate both, if executed from the correct architecture. It's still reasonably easy to generate the 32-bit tables on a 64-bit host, by using a 32-bit OCaml switch (I used 4.11.2+32bit for example).

palainp commented 9 months ago

Thanks @Firobe I can confirm a big speed improvement (5x to 9x on current head vs this pr)!

* [ecdsa-generate]  * [ecdsa-generate]
    P224:  1458.572 ops per second (12667 iters in 8.685)       P224:  11276.635 ops per second (99009 iters in 8.780)
    P256:  1465.086 ops per second (11918 iters in 8.135)       P256:  10154.861 ops per second (95969 iters in 9.451)
    P384:  355.013 ops per second (3546 iters in 9.988)     P384:  2926.304 ops per second (28360 iters in 9.691)
    P521:  122.818 ops per second (1229 iters in 10.007)        P521:  922.291 ops per second (9030 iters in 9.791)
    Ed25519:  16974.709 ops per second (172413 iters in 10.157)     Ed25519:  17102.883 ops per second (161290 iters in 9.431)

* [ecdsa-sign]  * [ecdsa-sign]
    P224:  1101.906 ops per second (10981 iters in 9.965)       P224:  4708.182 ops per second (46816 iters in 9.944)
    P256:  1026.197 ops per second (10250 iters in 9.988)       P256:  4340.524 ops per second (43591 iters in 10.043)
    P384:  320.084 ops per second (3201 iters in 10.000)        P384:  1558.038 ops per second (15634 iters in 10.034)
    P521:  97.783 ops per second (974 iters in 9.961)       P521:  317.579 ops per second (3198 iters in 10.070)
    Ed25519:  8371.510 ops per second (83333 iters in 9.954)        Ed25519:  8432.891 ops per second (84317 iters in 9.999)

* [ecdsa-verify]    * [ecdsa-verify]
    P224:  406.745 ops per second (4070 iters in 10.006)        P224:  929.569 ops per second (9278 iters in 9.981)
    P256:  379.167 ops per second (3799 iters in 10.019)        P256:  863.855 ops per second (8668 iters in 10.034)
    P384:  114.519 ops per second (1143 iters in 9.981)     P384:  264.519 ops per second (2641 iters in 9.984)
    P521:  40.434 ops per second (404 iters in 9.992)       P521:  94.140 ops per second (940 iters in 9.985)
    Ed25519:  5550.770 ops per second (54288 iters in 9.780)        Ed25519:  5553.066 ops per second (55493 iters in 9.993)

* [ecdh-secret] * [ecdh-secret]
    P224:  1193.372 ops per second (11938 iters in 10.004)      P224:  6963.960 ops per second (65876 iters in 9.460)
    P256:  1110.876 ops per second (11101 iters in 9.993)       P256:  6453.460 ops per second (61050 iters in 9.460)
    P384:  338.256 ops per second (3380 iters in 9.992)     P384:  2084.828 ops per second (20703 iters in 9.930)
    P521:  118.926 ops per second (1194 iters in 10.040)        P521:  738.680 ops per second (7534 iters in 10.199)
    X25519:  9282.841 ops per second (92764 iters in 9.993)     X25519:  9385.173 ops per second (92936 iters in 9.902)

* [ecdh-share]  * [ecdh-share]
    P224:  1264.362 ops per second (11933 iters in 9.438)       P224:  1192.979 ops per second (11936 iters in 10.005)
    P256:  1109.354 ops per second (11059 iters in 9.969)       P256:  1114.436 ops per second (11150 iters in 10.005)
    P384:  336.237 ops per second (3385 iters in 10.067)        P384:  338.432 ops per second (3378 iters in 9.981)
    P521:  119.199 ops per second (1165 iters in 9.774)     P521:  120.097 ops per second (1200 iters in 9.992)
    X25519:  9362.748 ops per second (93457 iters in 9.982)     X25519:  9444.634 ops per second (90909 iters in 9.625)

hannesm commented 9 months ago

Thanks for your work. I added two comments, apart from that this is ready to be merged.

Firobe commented 9 months ago

Thank you for your review! Your comments should be addressed. If they are, feel free to squash/merge :)

hannesm commented 9 months ago

Thanks a lot! I'll merge when CI is happy (or only shows temporary failures).

mirage / mirage-crypto