Syscall support for elliptic curve operations

samkim-crypto commented 3 years ago

Problem

Currently, running elliptic curve operations on chain in Solana seems to be difficult with 200k BPF instruction counts. I experimented with Curve25519 operations and it seems like I am getting the following numbers: Addition: ~15k Decompress: 300k Multiplication: 3.4M The code I used to get these numbers: https://github.com/samkim-crypto/ristretto-bpf-count.

I played around with tweaking the algorithms by removing non-vital security mechanisms (constant time execution) and also adding in some precomputations to shave off instruction counts, but the numbers are just too big to run these on chain.

Proposed Solution

A cryptographic application would generally require multiple invocations of these operations, so some reliance on syscalls appears to be inevitable.

I wanted to propose and start a discussion on supporting certain elliptic curve operations such as the above operations as syscalls. I can run additional tests, provide more details on how reusable certain EC operations are, etc.

jackcmay commented 3 years ago

Thanks Sam, I think this is a good start. We'd like to find out how much time these operations take natively.

samkim-crypto commented 3 years ago

I ran some criterion benchmarks on my 2019 Macbook Pro and I am getting the following numbers: Addition: 193.76 ns Decompression: 3.8474 us Multiplication: 49.159 us

The curve25519-dalek library actually provides multiple backends for these operations, and I tested with both their u64 (above numbers) and simd backend, but it appears that I am getting similar numbers on my machine.

samkim-crypto commented 3 years ago

Ultimately, the most important operation for my particular application is the multiscalar multiplication operation which is a batched version of multiple addition and multiplication operations above. But there are multiple ways (algorithmically) to compute the multiscalar multiplication with different trade-offs in performance, so it was unclear to me what factors I should consider in the context of a syscall...

jon-chuang commented 3 years ago

To give some context, BPF programs in the form of Serum that are JIT compiled to x86 have been observed to have a total execution time ranging from 1-4ms.

> I need to clarify how much of this is in the form of context switches. I also need some clarification as to whether the compute budget of 200_000 bpf opcodes applies to individual instructions or are transaction wide. I believe they are the former. > > My guess is that the reason for Serum's long execution times are because the ops are memory bound. > > Skimming the Serum code, Serum's execution resource use is centered around a loop within the context of a single instruction, that does order matching, which I would hazard to guess is a memory-bound operation (`find_bbo`->`find_min_max`, which does a tree walk). As far as I could gather, Serum's state size exceeds 10MB, so the walk would probably cause cache shuffling.

In principle, this suggests that such syscalls would be quite reasonable, since they are primarily compute-bound and thus easy to upper bound in the cost with a suite of benchmarks, and as the numbers you provided suggest, a 1ms target time budget could probably fit an MSM of 200 group elements.

@samkim-crypto would you mind sharing how many group<->field elements you'd like to participate in the MSM? As a matter of curiosity, if it's not a secret, what kind of cryptography are you intending to deploy that requires an MSM?

ZK proofs based on pairing-based crypto are also popular in the blockchain space. Will you be investigating this as well? I think optimised implementations for pairings in popular curves take anywhere from 0.7ms - 2ms. Which still seem reasonable.

samkim-crypto commented 3 years ago

Yes, I am experimenting with some variants of Zether (https://eprint.iacr.org/2019/191) or MimbleWimble (https://eprint.iacr.org/2018/1039) for private transactions. All of these protocols require rangeproof validation(s), where each verification would require a multiscalar multiplication of group/field elements of about ~2.5x the number of bits to represent the range:

32-bit numbers: ~80
64-bit numbers: ~160

From a pure efficiency point of view, it would be faster to do the entire MSM (80/160 group-field pairs) all in one go, but it can of course be divided into smaller chunks to be invoked separately and then combined.

Supporting pairings via syscalls would definitely be nice, and it would definitely make a whole new range of cryptographic protocols possible. But right now, even operations on vanilla (pairing free) elliptic curves on chain seem difficult to run (without a syscall), which rules out pretty much any kind of crypto application on chain. I think it makes sense to start with simple pairing free curves and then explore pairing friendly curves subsequently.

jon-chuang commented 3 years ago

Hmm. That's per check, or a batch of range checks? It's fairly expensive - 800us per 64-bit range check.

More compute intensive txs may have to wait for the flexible tx cost model that's being developed (so, not a hard upper bound of 200000 compute units)

Is there a reason why you're not going for Zexe?

Have you heard of plookup? It's a fairly cheap way to do batched range proofs, at a verification cost dominated by 2 pairings.

samkim-crypto commented 3 years ago

That is actually per check, which can be quite expensive without MSM's. On my machine, I am getting the following numbers for fixed-base MSM's:

64 MSM: ~0.4 ms (~0.5 ms for variable-base)
128 MSM: ~0.85 ms (~1 ms for variable-base)
256 MSM: ~1.7 ms (~2 ms for variable-base) which I thought is not too bad compared to pairing friendly curves.

I am aware of Zexe and plookup though I have not experimented with actual code. I was working under the assumption that a multiplication in pairing friendly curves are ~2-3x and a pairing operation is ~8-10x of a multiplication operation in a pairing free elliptic curve, so in the end, I was thinking that using a pairing-free curve and relying on Bulletproofs verification would be comparable in performance to using a pairing-friendly curve and relying on a pairing-based range proof. Of course, this should be tested with actual code, which I will do so today.

My logic is that if the ultimate performance of transactions based on pairing-free vs pairing-friendly curves are comparable, then using a pairing-free protocol is the way to go as it is conceptually simpler, does not require any trusted/sophisticated setup, and is generally more efficient on the client-side proof generation.

samkim-crypto commented 3 years ago

Currently playing around with pairing implementations, but a quick benchmark with bls12-381 on my machine shows:

G1 mult: 350 us
G2 mult: 1.2 ms
Pairing: 1.4 ms

So the assumption that arithmetic operations on pairing friendly curves are a constant factor slower than on pairing free curves seem to be valid. For the case of 64-bit range-proofs, I don't yet see significant savings from relying on 2 pairings as opposed to relying on 128~256 MSMs on a pairing free curve, though pairings may provide better batching techniques for proof verification.

This will of course be a different story for >>128-bit range-proofs or for proofs over general circuit constraints as needed in more sophisticated applications like rollups, which may be a consideration when considering syscall support...

jon-chuang commented 3 years ago

How do you envision rollups could play a role in Solana, though?

> I suppose that compressing very long-running compute or memory-intensive computations could be one use case. solana's TX propagation model doesn't really like long-running computations, so I'm not sure it will ever accept computations spanning more than a few ms. Long-running TXs slow down an entire batch, whether of leader block construction/propagation or replay - and a long-running batch may cause linearisation congestion due to account deps - including those outside of the long-running tx. > > On a second note, rollups unfortunately have linear state. What would be truly interesting to me would be to somehow parallelise rollup state updates into a DAG. There was a recent [paper](https://arxiv.org/pdf/2105.11827.pdf) by Novi (Diem's wallet arm) that parallelises the mempool itself into a DAG. The objective in that scenario was to reduce the network bandwidth required to propagate TXs. In the case of a DAG-partitioned rollup, I suppose the purpose would be to parallelise the overall proof production and reduce the latency of state updates by allowing machines to safely generate proofs concurrently.

My logic is that if the ultimate performance of transactions based on pairing-free vs pairing-friendly curves are comparable, then using a pairing-free protocol is the way to go as it is conceptually simpler, does not require any trusted/sophisticated setup, and is generally more efficient on the client-side proof generation.

Yeah, I agree with that. Perhaps for generic circuits Halo2 could be your starting point - they implement a plookup/plonk circuit builder. Halo is the logarithmic version of bulletproofs, achieved via proof accumulation. Not sure if range proofs are available as a gadget.

Anw, just throwing out ideas.

The number I'd be more interested in is bpf execution time, though. Casting compute-intensive routines as precompiles/syscalls seems only a temporary measure to get around the 200000 compute unit limit. Whereas in the long-term, if JIT-compiled bpf versions perform decently, they could work just as well (although I agree this could be less efficient and result in duplication across applications on chain). Nevertheless, as a datapoint, would you be able to bench the above, both interpreted and jit compiled? An example of how to do this is here

Gunning for a standardised set of precompiles::crypto_primitives probably requires a design proposal to spell out how it ought to be organised, maintained and extended. An important metric here is probably binary size - though for crypto this in theory should be small, syscall programs must live in memory so it's at least worth finding out, especially if one intends to support many different curves, and possibly additional operations like pairings.

zhvng commented 2 years ago

Any updates on this? Would love to deploy ZK snarks on Solana.

solana-labs / solana

Syscall support for elliptic curve operations #17991

Problem

Proposed Solution