Closed Firobe closed 9 months ago
Thanks for your PR. I'm positive to have this integrated into mirage-crypto. There are a couple of questions, though:
constant-time behavior
What is the threat model in your mind? A remote attacker? Local side-channel attacks (memory caches)?
relatively easy to audit
At least for me this will need some time, esp. with arrays and reference cells being introduced.
The other cost is the pre-computation phase, which should be negligible for long-running services, and similar to the old algorithm for one-shot computations.
Since mirage-crypto-ec is used all over the place, for long-running services as well as short-lived applications, would you mind to measure this pre-computation phase?
It also does need more memory to store the values.
Could you measure this as well? As mentioned above, mirage-crypto-ec is not only used in long-running services.
Would it be worth to consider a lazily evaluated pre-computation of the tables? To decrease memory usage and pre-computation time, esp. in applications where e.g. P-384 or P-224 are not used? Do you have any thoughts about that?
On a simple benchmark repeatedly signing messages
Is this pre-computation phase part in your benchmarks (i.e. did you start the timer at program execution start or once initialization was complete)?
I found a patch that I developed quite some time ago which may be useful https://github.com/hannesm/mirage-crypto/commit/c512a0a4eeb3d1511011affe423e4880431a497b - which adds ECDSA (sign & verify) to the bench/speed.ml.
Also, do you have a baseline? Either "go's EC implementation" or OpenSSL would be great to compare against your numbers on your CPU (openssl speed ecdsap256
). Comparing the main branch of mirage-crypto with openssl in P256 sign, I observe on my laptop a discrepancy of factor 42 (openssl is 42 times as fast: 35389 sign/s vs 842.345).
Thanks for the comments!
What is the threat model in your mind? A remote attacker? Local side-channel attacks (memory caches)?
The threat model here is local side-channel attacks, such as FLUSH+RELOAD used here or here.
At least for me this will need some time, esp. with arrays and reference cells being introduced.
I wonder if it would be more efficient to first rewrite it in C, see how that performs, and rather review that (since it's slightly easier to review the timing security without the OCaml abstractions). In any way I plan to do that in the coming days, so maybe you want to hold off review until I come back with results.
Since mirage-crypto-ec is used all over the place, for long-running services as well as short-lived applications, would you mind to measure this pre-computation phase?
I consistently get the following pre-computation times on my CPU (on the bytes
branch):
P-256 | P-224 | P-384 | P-521 | |
---|---|---|---|---|
Time (s) | 0.001 | 0.001 | 0.003 | 0.011 |
Which is consistent with the size of the pre-computation for each curve (see below).
Could you measure this as well? As mentioned above, mirage-crypto-ec is not only used in long-running services.
The current algorithm stores 15 * fe_length
points per curve (so for example 480 points for P256). Each point is 3 field elements of size fe_length
on the heap, so excluding array/struct overhead, that's:
P-256 | P-224 | P-384 | P-521 | |
---|---|---|---|---|
Memory (KB) | 46 | 46 | 103 | 233 |
All in all roughly half a megabyte if we precompute everything.
Would it be worth to consider a lazily evaluated pre-computation of the tables? To decrease memory usage and pre-computation time, esp. in applications where e.g. P-384 or P-224 are not used? Do you have any thoughts about that?
That's a good idea! I'll add a commit to that effect, with maybe a new API call to force pre-computation (per curve) for applications who want predictable/consistent performance during other operations.
Is this pre-computation phase part in your benchmarks (i.e. did you start the timer at program execution start or once initialization was complete)?
The pre-computation was indeed part of the timer, but quite hidden by the number of iterations. Here's what it looks like with varying number of iterations, lazily pre-computing P256 only:
1 sign | 10 signs | 100 signs | 1000 signs | 10000 signs | |
---|---|---|---|---|---|
sign/s | 1850 | 4630 | 5540 | 5720 | 5930 |
So even on a single sign, we're still (slightly) above the old algorithm even on the cstruct -> bytes
branch, if we pre-compute only what we use.
I found a patch that I developed quite some time ago which may be useful hannesm@c512a0a - which adds ECDSA (sign & verify) to the bench/speed.ml.
That's great! Here's what I get with your patch (with bytes
):
* [ecdsa-sign]
P224: 6269.457 ops per second (63211 iters in 10.082)
P256: 5693.522 ops per second (57012 iters in 10.013)
P384: 2372.992 ops per second (18155 iters in 7.651)
P521: 502.243 ops per second (5199 iters in 10.352)
I'm much in favor of including it in the main branch
Also, do you have a baseline? Either "go's EC implementation" or OpenSSL would be great to compare against your numbers on your CPU (
openssl speed ecdsap256
). Comparing the main branch of mirage-crypto with openssl in P256 sign, I observe on my laptop a discrepancy of factor 42 (openssl is 42 times as fast: 35389 sign/s vs 842.345).
I have a similar discrepancy: 57347 sign/s with openssl
, though OpenSSL does use a different, more complex algorithm. Another baseline I have is BoringSSL which performs about 28000 sign/s (again, with a different algorithm). I'll try in the coming days to do a benchmark with Go. In any way, I think we'll still definitely be below that.
Thanks a lot for your answers.
FLUSH+RELOAD
Do you by chance have developed or used a tool to figure out whether this code is vulnerable? Of course, reading code (and/or assembly) leads to a certain amount of assurance. But then, you'd need to do that for each architecture, OCaml compiler and C compiler that you're interested in. (I remember Eqaf received a lot of assembly reading, and a new OCaml compiler did just something differently).
To move forward:
perf
output to figure out what are the expensive operations in a ECDSA sign operation (maybe looking at #109 with that perf information will result in what to re-do in C).Do you by chance have developed or used a tool to figure out whether this code is vulnerable? [...]
I did not (and I'm not confident that if I did my efforts would be representative of a "good" attack), so I'm indeed relying on reading code for now. I've some assurance that it is constant-time as expected since the operations that differ depending on the input are extremely localized (the table selection), and always access every value of the tables in the same order. I've not yet looked at assembly output for the relevant operations.
I opened https://github.com/mirage/mirage-crypto/pull/192 for the speed ECDSA additions (should ECDH and X25519/Curve25519 be as well put in that tool?)
Great! Let's take the opportunity to include ECDH and X25519 as well, yes
[...] should we get that one merged before this one (looks like it needs some rebasing and going through the comments, maybe another round of review)?
I think it would be the easiest path, yes. I'll open a new PR addressing the comments, and let's do a new round of review
Did you look into perf output to figure out what are the expensive operations in a ECDSA sign operation (maybe looking at https://github.com/mirage/mirage-crypto/issues/109 with that perf information will result in what to re-do in C).
Yes! By far the most expensive operation is the scalar multiplication (83% of the time spent), with inversion (already in C) spending much of the rest. Within the basic primitives, the addition is the costlier, so any algorithm that reduces the number of additions performs very well.
I did try to rewrite some of the functions in https://github.com/mirage/mirage-crypto/issues/109 in C for comparison (Dsa.sign
in particular), but the improvement was very marginal.
ECDH and X25519
I'm working on that right now.
FYI, a quick rewrite in C of the algorithm (+ the precomputation) yields a performance of about 8500 sign/s for P-256 with bytes
. I think this justifies the rewrite :)
Nice! only a factor of 7 left ;)
The bytes EC commit has been rebased by @palainp https://github.com/palainp/mirage-crypto/commit/559a176f44e2ea0c2e70e7d19e77f232d2e42bae -- maybe a good starting point.
I think that specific commit should be ok (at least the tests are green) :) But I failed somewhere after when I tried to remove more Big-arrays :(
Nice! only a factor of 7 left ;)
I got to run some benchmarks on the Go crypto library. It yielded 48000 signs/s, but I found out than for common CPU architectures, it "cheats" and uses a custom ASM implementation for P256 (completely ignoring fiat
primitives).
Forcing the use of the fiat
-based windowed algorithm, we get about 13000 sign/s as a baseline, which makes our results here look much more reasonable :)
For reference:
BenchmarkSign/P256 15625 76744 ns/op 2400 B/op 34 allocs/op
BenchmarkSign/P384 5571 212753 ns/op 2608 B/op 34 allocs/op
BenchmarkSign/P521 1995 600929 ns/op 2976 B/op 35 allocs/op
BenchmarkVerify/P256 5703 209437 ns/op 416 B/op 12 allocs/op
BenchmarkVerify/P384 1800 664876 ns/op 592 B/op 12 allocs/op
BenchmarkVerify/P521 582 2057598 ns/op 912 B/op 12 allocs/op
@Firobe thanks for putting these numbers into context. The question is about the aim here. On the one side, I'm not keen to adapt huge amounts of assembly code into the repository for getting more performance. On the other side, we should be honest about it: the amount of time spent on this library is less than e.g. OpenSSL. Its performance won't be on par with it.
Maybe we should have a paragraph about that in the README, together with how benchmarks can be (re)produced (openssl speed ecdsap256
)
Gathering your numbers from above (on P256 sign operations per second), on your computer (architecture? OS?):
other implementations:
The numbers are right (for Go crypto the table, which is from their official benchmark, lists nanoseconds per operation, so 76744 ns per sign is 1 / (76744 / 10^9) ~= 13030
sign/s). The first column is the number if iterations performed. All benchmarks were run on an AMD Ryzen 7 3700X with Arch Linux (and always on a single core).
In this PR I was careful to select an algorithm that's not too complex to implement, review and maintain. I wholeheartedly agree that the review and maintenance effort of bringing more complex algorithms (like wNAF) or assembly while maintaining security is too much, and we should be honest about the tradeoff.
In this PR I was careful to select an algorithm that's too complex to implement, review and maintain.
Yes, highly appreciated. I suspect you meant that's not too complex. Thanks for the clarification of go. And thanks for your work on this, I'm pretty sure we'll have this merged and released rather sooner than later.
Now that #146 is merged, I rebased this PR, and pushed the rewrite in C.
A few considerations:
force_precomputation
that does what its name saysMake_scalar.to_octets
to avoid the string reversing (the algorithm is adapted to expect a big-endian scalar)force_precomputation
with G, to avoid passing it to subsequent calls where it's not useful), or pre-computing the tables entirely in OCaml, at a small single-time performance cost. Yet another solution is to generate the tables in advance and hardcode them in C. What do you think?New benchmarks on my machine:
main
* [ecdsa-sign]
P224: 1922.156 ops per second (17488 iters in 9.098)
P256: 1837.106 ops per second (16644 iters in 9.060)
P384: 557.228 ops per second (5584 iters in 10.021)
P521: 167.405 ops per second (1668 iters in 9.964)
Ed25519: 21103.479 ops per second (200803 iters in 9.515)
* [ecdsa-sign]
P224: 9088.188 ops per second (92250 iters in 10.151)
P256: 8289.074 ops per second (83333 iters in 10.053)
P384: 2905.143 ops per second (29325 iters in 10.094)
P521: 567.907 ops per second (5524 iters in 9.727)
Ed25519: 21143.520 ops per second (205761 iters in 9.732)
Thanks for your rebase.
the pre-computed table are stored as (static) globals. This means the RAM (half a megabyte, see above) is pre-allocated, even if the values are not pre-computed. If that's not something we want, I can switch to heap allocation pre-computing the tables in C means knowing about the generator point of the curve in C, which wasn't readily available before.
Hmm, so what about the approach we use for curve25519_tables.h (taken from BoringSSL where "make_curve25519_tables.py" is provided)?
If we use pre-computed tables, and have the memory pre-allocated, why not separate the table computation from the runtime? The tables won't change over time. We could write a small OCaml program using the Mirage_crypto_ec API which emits the table(s) as C code? This way we wouldn't need to carry around the generator into C, and would not need to mess around with lazy initialization. WDYT?
In my opinion, these tables would then be part of the repository and release -- and maybe there are rules in the ec/native
Makefile to (re-)generate the tables!?
If we use pre-computed tables, and have the memory pre-allocated, why not separate the table computation from the runtime?
I'm in favor of that, the only drawback being that this increases executable size (vs. uninitialized memory). This makes the OCaml API the source of truth. And indeed, the tables won't change enough (or at all) to justify re-generating them as part of the build phase.
Should we expose just enough primitives in the public API to allow table generation in a separate program, or rather just an opaque function computing the tables internally and returning them? :thinking:
I think a single function is sufficient. Exposing the internals would likely mean to expose lots of types and internal stuff that may change one day -- for no good purpose.
I just pushed a new commit for hard-coded tables, which should be ready for review! I also addressed @palainp comment (and replaced tabs in _stubs.c files by spaces, if that's OK).
It seems I had wrongly assumed it was easy to obtain the 32-bit tables from the 64-bit ones (i.e. the field element representation is the same in memory). It seems true for P256 and P384, but not P224/P521, e.g. for P224, G_x is represented in 32-bit as
bc905227, 6018bfaa, f22fe220, f96bec04, 6dd3af9b, a21b5e60, 92f5b516
while in 64-bit:
bc9052266d0a4aea, 852597366018bfaa, 6dd3af9bf96bec05, 00000000a21b5e60
So, I have to either figure out how/if it's possible to correctly use the 32-bit interface from a 64-bit host, or include the table generation as part of the compilation process (so only the table for the hosting architecture is generated, but it implies potential separating part of the code as a separate library (or conditional compilation shenanigans) to avoid the circular dependency between mirage-crypto-ec
and the tables.
Or, keep the code but generate the 32-bit table from a 32-bit machine, and accept that the table is not re-generable from a 64-bit host.
Edit: or serialize to octets instead and de-serialize at startup, but that's hardly different from computing the whole table at startup, and I'm not even sure the serialization is compatible between 32- and 64-bit
From my point of view, it would be best to be able to generate the tables on a 64bit system. On the other hand, since they'll be generated once and only when we add new curves we'll need additional tables, it is fine to generate them on a 32bit system as well. [Also, it's not clear to me how many moons will pass until we remove the 32bit support here. So far I haven't seen many users, but quite some burden on maintenance.]
Thanks for your work, as said - if you can pre-compute and add the 32bit tables to make CI pass, that'd be neat.
Alright, thank you for your thoughts. I think we're good now. Both tables are in the tree, and gen_tables
can generate both, if executed from the correct architecture. It's still reasonably easy to generate the 32-bit tables on a 64-bit host, by using a 32-bit OCaml switch (I used 4.11.2+32bit
for example).
Thanks @Firobe I can confirm a big speed improvement (5x to 9x on current head vs this pr)!
* [ecdsa-generate] * [ecdsa-generate]
P224: 1458.572 ops per second (12667 iters in 8.685) P224: 11276.635 ops per second (99009 iters in 8.780)
P256: 1465.086 ops per second (11918 iters in 8.135) P256: 10154.861 ops per second (95969 iters in 9.451)
P384: 355.013 ops per second (3546 iters in 9.988) P384: 2926.304 ops per second (28360 iters in 9.691)
P521: 122.818 ops per second (1229 iters in 10.007) P521: 922.291 ops per second (9030 iters in 9.791)
Ed25519: 16974.709 ops per second (172413 iters in 10.157) Ed25519: 17102.883 ops per second (161290 iters in 9.431)
* [ecdsa-sign] * [ecdsa-sign]
P224: 1101.906 ops per second (10981 iters in 9.965) P224: 4708.182 ops per second (46816 iters in 9.944)
P256: 1026.197 ops per second (10250 iters in 9.988) P256: 4340.524 ops per second (43591 iters in 10.043)
P384: 320.084 ops per second (3201 iters in 10.000) P384: 1558.038 ops per second (15634 iters in 10.034)
P521: 97.783 ops per second (974 iters in 9.961) P521: 317.579 ops per second (3198 iters in 10.070)
Ed25519: 8371.510 ops per second (83333 iters in 9.954) Ed25519: 8432.891 ops per second (84317 iters in 9.999)
* [ecdsa-verify] * [ecdsa-verify]
P224: 406.745 ops per second (4070 iters in 10.006) P224: 929.569 ops per second (9278 iters in 9.981)
P256: 379.167 ops per second (3799 iters in 10.019) P256: 863.855 ops per second (8668 iters in 10.034)
P384: 114.519 ops per second (1143 iters in 9.981) P384: 264.519 ops per second (2641 iters in 9.984)
P521: 40.434 ops per second (404 iters in 9.992) P521: 94.140 ops per second (940 iters in 9.985)
Ed25519: 5550.770 ops per second (54288 iters in 9.780) Ed25519: 5553.066 ops per second (55493 iters in 9.993)
* [ecdh-secret] * [ecdh-secret]
P224: 1193.372 ops per second (11938 iters in 10.004) P224: 6963.960 ops per second (65876 iters in 9.460)
P256: 1110.876 ops per second (11101 iters in 9.993) P256: 6453.460 ops per second (61050 iters in 9.460)
P384: 338.256 ops per second (3380 iters in 9.992) P384: 2084.828 ops per second (20703 iters in 9.930)
P521: 118.926 ops per second (1194 iters in 10.040) P521: 738.680 ops per second (7534 iters in 10.199)
X25519: 9282.841 ops per second (92764 iters in 9.993) X25519: 9385.173 ops per second (92936 iters in 9.902)
* [ecdh-share] * [ecdh-share]
P224: 1264.362 ops per second (11933 iters in 9.438) P224: 1192.979 ops per second (11936 iters in 10.005)
P256: 1109.354 ops per second (11059 iters in 9.969) P256: 1114.436 ops per second (11150 iters in 10.005)
P384: 336.237 ops per second (3385 iters in 10.067) P384: 338.432 ops per second (3378 iters in 9.981)
P521: 119.199 ops per second (1165 iters in 9.774) P521: 120.097 ops per second (1200 iters in 9.992)
X25519: 9362.748 ops per second (93457 iters in 9.982) X25519: 9444.634 ops per second (90909 iters in 9.625)
Thanks for your work. I added two comments, apart from that this is ready to be merged.
Thank you for your review! Your comments should be addressed. If they are, feel free to squash/merge :)
Thanks a lot! I'll merge when CI is happy (or only shows temporary failures).
Using a sliding window method with pre-computed values of multiples of the generator point, we can obtain far more efficient performance for the special case where G = P in the scalar multiplication kP. By using a safe selection algorithm for pre-computed values and no branches in the main loop, the algorithm should leak no less information about its inputs than the current Montgomery ladder.
This change is at the cost of an algorithm that is slightly less simple than the current one, but I'd say it is still relatively easy to audit, in particular for constant-time behavior. In particular, this is not a NAF-based algorithm. The other cost is the pre-computation phase, which should be negligible for long-running services, and similar to the old algorithm for one-shot computations. It also does need more memory to store the values.
As mentioned in the code, the actual implementation is largely inspired from Go's own crypto library
All tests seem to pass.
Benchmark
On a simple benchmark repeatedly signing messages, and then generating keys, on 10_000 iterations each, with OCaml 4.14.1.
Cstruct
Current
main
branch:On this branch:
Bytes
I tried @dinosaure's branch replacing internal
cstruct
bybytes
, which turns out to greatly improve performance as well on this benchmark (and visibly relieves the GC).On the PR's base branch:
A merge of this branch and the PR's:
Purely on P-256 signatures, the combination of Cstruct replacement and this new algorithm increases the performance by a factor of 6 :)
Further work for more performance