Research zkSNARKS blocker: Benchmark and optimize proof time

oskarth commented 4 years ago

Problem

Prove time for Semaphore (https://github.com/kobigurk/semaphore) zKSNARKs using circom, groth and snarkjs is currently way too long. It takes on the order of ~10m to generate a proof. With Websnark, it is likely to take 30s, which might still be too long. We should experiment with native code on mobile here.

Acceptance criteria

While we can potentially use precomputed proofs to some extent, this needs to be on the order of a seconds to be feasible for e.g. spam protection signalling normal messages. This means a performance improvement of a bit more than two orders of magnitude is necessary. Alternatively, perf can be a bit less if we get precomputed proofs and batching to work smoothly.

Details

See https://github.com/vacp2p/research/blob/master/zksnarks/semaphore/src/hello.js#L118 for proof generation. This is using snarkjs, circom and running through Node, which is likely one of the slower environments. Additionally, number of constraints for the snark are around ~60k.

Reducing the problem to a simpler circuit (Pedersen commitment, 5 rounds), see https://ethresear.ch/t/benchmark-circom-vs-bellman-wasm-in-chrome-on-mobile/5261, we note the previous benchmarks of Circom and Bellman in different environments such as Node/WASM/Native:

Node: ~50s
Wasm (Circom and Bellman, incl on mobile) ~1s
Native(Bellman, multicore somewhat useful) ~0.1s

I.e. wasm 50x faster, and native 500x faster. There appears to be WIP and room for improvement for the wasm benchmarks.

Circom and Bellman both have around ~3.5k constraints, i.e. 5% of our original snark.

MacBook Core-i7 @ 2.6Ghz
circom (node.js):        53685ms
circom (chrome):         53266ms
circom (websnark):       1408ms
bellman (webassembly):   1151ms
bellman (rust, 1 core):  235ms
bellman (rust, 8 cores): 85ms

iPhone XS
bellman (webassembly):   1126ms

For our initial problem, this would mean roughly:

3.5k -> 60k; 20x constraints

> circom (node.js):        53685ms
> circom (websnark):       1408ms
> bellman (rust, 8 cores): 85ms

=> node ~15m => wasm ~30s => native ~2s

Possible solutions

1. Use a better environment (WASM or native).

No-brainer. Either use Websnarks with existing Circom snarks or Bellman to WASM.

WASM might not be enough though, especially to get to subsecond, and requires more optimization.

Native would be useful, especially for resource restricted devices and for spam protection. Additional benchmarks on mobile would be good to have. A big issue is in terms of deployment - currently native would mean something like Bellman, which right now doesn't have a library like Circom for writing SNARKs. This seems to be an ongoing effort:

We currently work on Circom -> Bellman export so that we can combine the ease of use of Circom language and performance and portability of Bellman. The goal is to make a toolchain that allows you to write snarks in Circom and get super fast .wasm + js bindings as compilation result, hiding the complexity of Bellman under the hood.

2. Decrease number of constraints.

60k could possible be reduce quite a lot, but require additional research (e.g. merkle tree structure, number of rounds for various hash functions, etc). If this could be halved, then Webnarks might be enough.

3. Use a different type of ZKP

Other types of ZKP come with lower proof time, but with other trade offs (such as proof size). The main reasons to explore this would, from my current POV, be related to: proof time; security assumptions (cryptographic assumptions and trusted setup); and proof time.

4. Precompute proofs

It's possible we can precompute proofs for e.g. spam protection. This means you can instantly signal, and then 'save up' messages. How this would work in practice requires further research.

5. Prover as a service?

I don't know what this would look like, but perhaps proof generation could be outsourced. This appears to be STARKware business model if I understand it correctly: https://starkware.co/the-road-ahead-in-2019/

6. Batching of signals

A bit similar to 'precompute case', but it might be doable to take many messages in a 30s period and batch them together. In the human chat case, this would impact UX. Worth exploring though.

Note

The benchmark in problem descrption and ethresear.ch might be incorrect, so worth checking them again.

barryWhiteHat commented 4 years ago

Semphore currently uses blake2 hash function (quantum forward secracy) and pedersen hash (harden mimc collision resistant) we also use eddsa signature for similar crypto UX, this could be replaced with knoledge of preimage as signature.

An example minimal circuit that is lacking the quantum forward secrecy is https://github.com/peppersec/tornado-mixer/ its possibly a better benchmark candidate.

You would need to add shamir secret share of the leaf but that is for both.

mmaller commented 4 years ago

Regarding Point 4: Precompute proofs

Using Groth16 (and possibly any other ZKP system that doesn't rely on random oracles) I believe this is possible provided that small changes to the input does not drastically change the witness values.

For example, consider two very small Merkle trees Hash(Hash(a), Hash(b)) and Hash(Hash(a), Hash(c)). Then the witness components required to verify that Hash(a) is computed correctly are the same for both, so the provers work in verifying this value does not necessarily need to be repeated.

In Groth16, I believe some of the work for computing the first two proof elements A and B can be precomputed for these types of circuits. Regarding the third proof element C, anything related to the output of the FFT will need to be computed every time from scratch (so a guestimated 1/8 of the total work cannot be aided with precomputation, not including FFT costs).

kobigurk commented 4 years ago

Some of these are avenues that definitely can be explored.

About reducing the number of constraints: we use some methods that are possibly safer, though come at a cost - we use Pedersen hashes for commitments and Blake2s for nullifier computation. If you believe the algebraic hash functions, such as MiMC, are strong enough to provide sufficient hiding, post-quantum security and not reveal anything about the hashes inputs - you could use those instead and lower the amount of constraints considerably.

About WASM vs JS: I'm not sure you'll get to subsecond with the current tooling, either JS or WASM. That said, a native Zcash spend proof in Bellman takes on a good laptop, as far as I remember, about a second. I've also done some benchmarks in the past with single-threaded Bellman: https://github.com/kobigurk/wasm_proof. With websnark, the situation is better since it's multi-threaded and doesn't have the overhead of Rust -> WASM, though I don't have numbers.

As to prover as a service, this isn't trivial. In the StarkWare case it makes sense - there's no privacy involved. The StarkWare prover, roughly, aggregates a bunch of in-the-clear transactions and proves it was done correctly. Anyone could do it, since there's no secret data involved. In this case, the user has private data they have to input to the proof. There could be interesting methods to explore here, though it's not as direct as just offloading.

Regarding proof precomputation, I agree with @mmaller that some of A and B, and even parts of C (e.g., some of the sA + rB part of it) can be precomputed. One example is moderately expensive and doesn't seem to be able to be precomputed - the H coefficients, which is the where FFT happens.

I think your best bet would be to use multicore Bellman on native mobile. I believe this would be by far the most efficient method and would also alleviate the need for having to download circuit.json, as its equivalent be efficiently compiled into the program binary. The proving key might also be more efficiently represented, though I'd have to look at it again to be sure. proving_key.bin is already binary, though maybe points are decompressed or there are zero points saved, etc.

It would require the rewrite of the Semaphore circuit in Bellman. With the gadgets already existing in Bellman and sapling-crypto, it shouldn't be a huge undertaking, though not trivial.

barryWhiteHat commented 4 years ago

It would require the rewrite of the Semaphore circuit in Bellman. With the gadgets already existing in Bellman and sapling-crypto, it shouldn't be a huge undertaking, though not trivial.

Could we use the circom importer that you wrote for the trusted setup to bypass the need to rewrite?

barryWhiteHat commented 4 years ago

Its currently taking us 0.672 sec to make a proof on mobile. Estimate 2.6 seconds on mobile.

mratsim commented 4 years ago

Regarding WASM, looking at the available backends in https://github.com/kobigurk/wasm_proof, I see that it's using Rust pairing https://github.com/kobigurk/wasm_proof/tree/db43012b/pairing

Unfortunately I expect this backend to be significantly slower than state-of-art. In particular for WASM, Herumi's MCL can be used https://github.com/herumi/mcl#how-to-build-for-wasmwebassembly or Barrentenberg (warning GPL) https://github.com/AztecProtocol/barretenberg

MCL is also undergoing a security audit at the moment sponsored by the Ethereum Foundation.

For instance, assuming you use BN254 Snarks here are a couple of implementation issues that will slow down the code by over 2x:

Fp::mul at https://github.com/kobigurk/wasm_proof/blob/db43012b/ff/ff_derive/src/lib.rs#L742-L786
- It could use a fused multiplication scheme (FIPS or CIOS)](https://www.microsoft.com/en-us/research/wp-content/uploads/1998/06/97Acar.pdf), integrating multiplication and reduction for a 10% boost
- Pairing friendly curves tend to have extra bits available and can use a no carry Montgomery Multiplication: https://hackmd.io/@zkteam/modular_multiplication#goff-vs-others for a 15% perf boost
- The Consensys BN curve base field FQ implementation is 50% faster than Zcash, I suspect other inefficiencies are due to Rust and compiler limitations regarding add with carry expressed via uint128 https://github.com/kobigurk/wasm_proof/blob/db43012b/ff/src/lib.rs#L375-L395
Scalar multiplication
- Can use a window optimization of 4 or 5 to accelerate by about 20% compare to naive double-add: https://github.com/kobigurk/wasm_proof/blob/db43012b/pairing/src/bn256/ec.rs#L538-L557 or even wNAF recoding (but without window): https://github.com/kobigurk/wasm_proof/blob/db43012b/pairing/src/wnaf.rs#L49-L71
- Can use Endomorphism acceleration to accelerate by 1.8x on G1 and 2.3x on G2
- Bench in my own library of various scalar multiplication (note, it's all constant-time, addition formulas can be made 40% faster if variable time is allowed like in Snarks as I implement the paper https://eprint.iacr.org/2015/1060)
Pairing
- The final exponentiation step is not using cyclotomic squaring at https://github.com/kobigurk/wasm_proof/blob/db43012b/pairing/src/bn256/mod.rs#L127-L223, this would increase total pairing speed by 26% https://github.com/zkcrypto/bls12_381/pull/19
- As noted directly in the code https://github.com/kobigurk/wasm_proof/blob/db43012b0fd1134fd0d4f94df77d6644c2e42dd7/pairing/src/bn256/fq12.rs#L52-L67 the result of line evaluation is sparse and several operations can be saved, see also Grewal2012 to implement Sparse Formula https://eprint.iacr.org/2012/408.pdf According to Aranha2010 (which has more operations than Grewal2012) this already saves 2.5x multiplications https://eprint.iacr.org/2010/526.pdf
Addition chains:
- Square-root can use addition chains, this would impact hash-to-curve routines
- The Miller loop can use addition chains as well instead of wnaf recoding, the saving on BLS12-381 would be significant since the Miller loop is very sparse, for BN254, I'm unsure
- Inversion can use addition chains but AFAIK, Inversion by Euclid is acceptable here and does not need to be constant-time and so would likely be the fastest besides hard-to-implement https://eprint.iacr.org/2019/266 see https://github.com/bitcoin-core/secp256k1/pull/767
Towering:
- Lastly, towering does not use Lazy reduction, as mentioned in Aranha2010 this contributed to 20% speed improvement. And even just Fp2 lazy reduction increase speed by about 20% on G2 arithmetic in Beuchat2010 https://eprint.iacr.org/2010/354.pdf

I can't comment on the implementation of the ZK-Snarks on top of the pairing primitives as I didn't implement them (yet) but AFAIK parallel multi-scalar point multiplication seemed quite important (in particular Pippenger: https://ethresear.ch/t/simple-guide-to-fast-linear-combinations-aka-multiexponentiations/7238)

barryWhiteHat commented 4 years ago

Do we have to use wasm for this ? Can we not use the native bellman prover because it does not need to run in browser ?

That is what we use in our estimates above. So i was hoping it would just be these times directly.

oskarth commented 4 years ago

We want to be able to run in the browser as well, at least eventually. For mobile and desktop, I believe bellman native can be used as is (with wrapper in Nim e.g.). Unless there are potential issues with using different backends for mobile with WASM etc when it comes to compatibility? Haven't looked into this or thought about it in detail.

2.6s on mobile seems a tad high, especially if this means higher variance for low end devices. As you said before @barrywhitehat, more benchmarks for mobile would be useful. Though this can probably be complemented with the precomputed proofs and batching idea?

barryWhiteHat commented 4 years ago

We want to be able to run in the browser as well, at least eventually. For mobile and desktop, I believe bellman native can be used as is (with wrapper in Nim e.g.). Unless there are potential issues with using different backends for mobile with WASM etc when it comes to compatibility? Haven't looked into this or thought about it in detail.

That should not be a problem.

2.6s on mobile seems a tad high, especially if this means higher variance for low end devices. As you said before @barryWhiteHat, more benchmarks for mobile would be useful. Though this can probably be complemented with the precomputed proofs and batching idea?

I think this idea may not work because the work needed to combine proofs is much bigger than the work we do in teh whole circuit. So if we want to batch proofs together it ends up being more expensive.

barryWhiteHat commented 4 years ago

We have some preliminary benchmarks on mobile

Out constraint size with 32, 24, 16 merkle depth. 24 ~= 1 Million users. 16 ~= 65k users

iphone se 2016 32: 1.615 24: 1.381 16: 0.925

iphone 11 32: 0.438 24: 0.365 16: 0.247

We are confident we can get this bellow 1 second for these devices. What is the minimum power device you want to support ? Also how many users do you want to support right now ?

@jacqueswww perhaps you have a good idea on this.

barryWhiteHat commented 4 years ago

More benchmarks thanks @kilic

Results are generated with browserstack.com

Device	Security Level	Merkle Depth	Prover Time (seconds)
Pixel 4	128	16	0.541
Pixel 4	128	24	0.795
Pixel 4	128	32	0.917
Pixel 4	80	16	0.464
Pixel 4	80	24	0.542
Pixel 4	80	32	0.749
Pixel 2	128	16	0.62
Pixel 2	128	24	0.986
Pixel 2	128	32	1.061
Pixel 2	80	16	0.492
Pixel 2	80	24	0.624
Pixel 2	80	32	0.842
Galaxy S8	128	16	0.538
Galaxy S8	128	24	0.775
Galaxy S8	128	32	0.808
Galaxy S8	80	16	0.412
Galaxy S8	80	24	0.551
Galaxy S8	80	32	0.677
Iphone 11	128	16	0.247
Iphone 11	128	24	0.364
Iphone 11	128	32	0.447
Iphone 11	80	16	0.200
Iphone 11	80	24	0.249
Iphone 11	80	32	0.354
Iphone 8	128	16	0.304
Iphone 8	128	24	0.458
Iphone 8	128	32	0.553
Iphone 8	80	16	0.252
Iphone 8	80	24	0.336
Iphone 8	80	32	0.465
Iphone SE	128	16	0.921
Iphone SE	128	24	1.391
Iphone SE	128	32	1.584
Iphone SE	80	16	0.758
Iphone SE	80	24	0.948
Iphone SE	80	32	1.322

jacqueswww commented 4 years ago

I reckon Android 7/8 with 1GB of RAM. And Iphone 6 I think is the equivalent (Iphones is only for some countries, but Iphones stay in the hands of users much longer). Is what you should as your minimal requirement with maximum reach worldwide.

From above benchmarks; the speed will be manageable on these devices by the looks of it? Which is great! As long as it works on lower (1GB RAM) devices.

For interesting view check this composition for an app listed on google play store here for the Southern African Region (the app is targeted at the low to mid income market segment).

oskarth commented 4 years ago

Awesome benchmarks @barryWhiteHat and @kilic!

And thanks for jumping in @jacqueswww :)

There are some specific architectures and versions the Core app wants to support, but I can't find the resource for this right now. I recall that a standard model used for testing is/was Galaxy S6 and iPhone 6s, but this might have changed. In general, Android is likely to be more of a problem than iOS.

@cammellos speaking for Core:

What is the current minimum set of device(s) the app aims to support?
For a an anti-spam solution such as this, is subsecond/1s to generate a message an acceptable target from a UX POV? This is before potential precomputed proofs, etc.

oskarth commented 4 years ago

Question on the merkle tree size: Can this easily be upgraded? Start with 16 and then upgrade to 32 with backwards compatibility (e.g.)

barryWhiteHat commented 4 years ago

Question on the merkle tree size: Can this easily be upgraded? Start with 16 and then upgrade to 32 with backwards compatibility (e.g.)

Yes this can be done easily. Just need to upgrade the user app and replace the snark proving key. Plan is to optimize the prover more as the tree grows.

cammellos commented 4 years ago

Thanks for moving this forward and the benchmark, really insightful stuff.

What is the current minimum set of device(s) the app aims to support?

The lower end device we support is Galaxy Note 4, though the app is barely usable on it as I understand, but you can transact and have a limited number of chats opened.

For a an anti-spam solution such as this, is subsecond/1s to generate a message an acceptable target from a UX POV? This is before potential precomputed proofs, etc.

1 second is fairly high in terms of responsiveness, as we need to add network delay on top, but not a deal breaker for some solutions, it all depends on the UX imposition on the users and how the solution is used in the app.

barryWhiteHat commented 4 years ago

1 second is fairly high in terms of responsiveness, as we need to add network delay on top, but not a deal breaker for some solutions, it all depends on the UX imposition on the users and how the solution is used in the app.

This happens per message that is sent. So it ends up being a little bit of lag in the chat. But it was my understnading that that overall network lag was enough so you did not really notice this. What would be a more acceptable number ?

Also this varies with device, with newer devices we are down at 2/10ths of a second. Also we plan to optimize this more so that more with a ARM assembly implementation of the prove. Sub second is more of a place to start out.

cammellos commented 4 years ago

This happens per message that is sent. So it ends up being a little bit of lag in the chat. But it was my understnading that that overall network lag was enough so you did not really notice this. What would be a more acceptable number ?

Hard to say, currently for a message to show on the recipient side is roughly under a second (more or less), but pretty much instant. Ideally building a message is 0.25/0.5, so overall we are around 1s with network latency. But if it's a CPU expensive operation battery usage is going to be the driving factor here likely on slow devices (that's though just my guess).

Regardless I would not worry too much about latency at this point, 1 second might be ok and definitely not a show stopper, it matters more about how we are going to use this and what we can build around at the product level (from core point of view that is).

The way I see it is that having a 1 second delay for messages is more acceptable UX wise than for example asking users to stake/deposit some funds to send messages, so it might be that when thinking about a comprehensive spam solution the UX degradation might have to be addressed and made palatable somewhere else to give user a buy in, at which point the delay in sending message will be absolutely fine.

We can always have a free-for-all user run, permission-less network (as it is now), and have other parts of the network that are spam free but requires deposit/staking etc, at the cost of higher latency, but that give access to something valuable to the user (curated content, anonymous voting etc).

Thanks again for the input!

mratsim commented 4 years ago

Note that 1 second is with the current state of optimization. From @kilic's code at https://github.com/kilic/bls12-381 here are the main optimizations done/available:

[ ] Fp: ARM assembly
[x] Fp: Fused Montgomery multiplication
[x] Fp: Montgomery Multiplication with no-carry optimization
[ ] Towers: Lazy reduction
[x] EC: Window scalar multiplication (on G2 but not G1)
[ ] EC: Endomorphism accelerated scalar multiplication
[ ] EC: Multithreaded multi scalar multiplication
[ ] H2C: Endomorphism accelerated clear cofactor
[ ] H2C: Fusing sqrt and legendre symbol or addition-chain based SQRT
[x] Pairing: Fused line evaluation
[x] Pairing: Multi-pairing
[x] Pairing: Cyclotomic squaring

So if we can reach 1s now this is likely a reasonable worst-case baseline.

Furthermore if we reach the limits of compute there might be other avenues to explore: can proofs be aggregated? Or even can a market be created with SNT so that lighter nodes delegate to heavier nodes for a fee?

barryWhiteHat commented 4 years ago

Short term plan is what we do until we hit the limit of compute.

Medium term i would like to move to plonk which allows us to use custom constraints to create the same circuit using much less constraints which would allow us to reduce the proving time alot i hope.

Long term we can even invest in some custom crypto that does what RLN does. But seems this might be hard to do in a way that slashings can happen quickly. I know @kobigurk has some ideas on this. Also @mmaller would be able to advise on this.

kilic commented 4 years ago

@jacqueswww, @barryWhiteHat I have got results from Iphone 6 for various cases.

Device	Security Level	Merkle Depth	Prover Time (seconds)
Iphone 6	128	16	1.143
Iphone 6	128	24	1.885
Iphone 6	128	32	2.554
Iphone 6	80	16	0.924
Iphone 6	80	24	1.326
Iphone 6	80	32	1.627

barryWhiteHat commented 4 years ago

Here is our write up about incentives for RLN https://ethresear.ch/t/rln-incentives-for-p2p-networks/8085 now

At the moment we are trying to decide between reducing proving time on mobile or moving forward with building incentive contracts.

On your side is the iphone 6 1.3 second proving time a problem? I feel as a lower bond this is really gr8 and perhaps yall or the broader community can build a faster prover that we can use here and in a bunch of other projects.

mratsim commented 4 years ago

Looking into Rust implementations, @jon-chuang has done great optimizations to Zexe those past months, for example this one https://github.com/scipr-lab/zexe/pull/296.

I expect Zexe is currently the fastest Rust Snarks library currently.

Also given the flurry of activity I see on highly-optimized Snarks (see also https://github.com/ConsenSys/gnark), there might be enough people and research/implementation directions to make a common Discord channel?

kobigurk commented 4 years ago

Looking into Rust implementations, @jon-chuang has done great optimizations to Zexe those past months, for example this one scipr-lab/zexe#296.

I expect Zexe is currently the fastest Rust Snarks library currently.

Also given the flurry of activity I see on highly-optimized Snarks (see also https://github.com/ConsenSys/gnark), there might be enough people and research/implementation directions to make a common Discord channel?

I can confirm this is very fast. On BW6 with batch inversion, GLV and optimized assembly you get 3-4x.

jon-chuang commented 4 years ago

Do note that most of the optimisations only apply to scalar muls and not MSMs. Currently, our MSMs for bn254 are roughly on par with barretenberg, lagging by around 5%

vacp2p / research