status-im / nim-blscurve

Nim implementation of BLS signature scheme (Boneh-Lynn-Shacham) over Barreto-Lynn-Scott (BLS) curve BLS12-381
Apache License 2.0
26 stars 11 forks source link

Parallel and Serial batch verification #101

Closed mratsim closed 3 years ago

mratsim commented 3 years ago

This adds batched multi-signature verification to the library. Both with serial or parallel backend with OpenMP. The algorithm chosen is recursive divide-and-conquer instead of the usual parallel for loop, this makes it easily portable to any threadpool that supports task spawning.

Disambiguations

Unfortunately it is very easy to get confused with BLS signatures, signature aggregation and multi signature.

BLS signature level 1

We have a triplet (pubkey, message, signature) to verify. This is the base case verify https://github.com/status-im/nim-blscurve/blob/b0a1d149e97142649bd99c4acebcf551b7b05dda/blscurve/bls_sig_min_pubkey.nim#L108-L114

BLS signature level 2

We have N public keys verifying the same message (attesting for a block for example) This is fastAggregateVerify. https://github.com/status-im/nim-blscurve/blob/b0a1d149e97142649bd99c4acebcf551b7b05dda/blscurve/bls_sig_min_pubkey.nim#L230-L253

As shown in benchmark there is no use in parallelizing this case as by exploiting pairings curve properties can accelerate verification of an aggregate signature against 128 public keys only take an extra 5.6% image

BLS signature level 3

We have N public keys verifying N messages which could be:

not counting deposits signatures which may be invalid. Furthermore we can apply that to B blocks to collect N*B signature sets.

This is the aggregate(var Signature, Signature) procedures and aggregateVerify procedures: https://github.com/status-im/nim-blscurve/blob/b0a1d149e97142649bd99c4acebcf551b7b05dda/blscurve/blst/blst_min_pubkey_sig_core.nim#L141-L186 https://github.com/status-im/nim-blscurve/blob/b0a1d149e97142649bd99c4acebcf551b7b05dda/blscurve/bls_sig_min_pubkey.nim#L153-L156 https://github.com/status-im/nim-blscurve/blob/b0a1d149e97142649bd99c4acebcf551b7b05dda/blscurve/bls_sig_min_pubkey.nim#L177-L179

Those appear in the spec https://github.com/ethereum/eth2.0-specs/blob/v1.0.0/specs/phase0/beacon-chain.md#bls-signatures but are never used hence as we kept almost literally to the spec we didn't use them.

The estimated perf improvement on a single block verification for the serial and parallel implementation are in the code at https://github.com/status-im/nim-blscurve/blob/b0a1d149e97142649bd99c4acebcf551b7b05dda/blscurve/blst/blst_min_pubkey_sig_core.nim#L401-L422 In summary by batching even on a single processor we have 2x faster single block verification. If batching 10 blocks, instead of a naive cost of 1000, the cost becomes 20*10 (blinding 64-bit) + 50 (Miller loop) + 50 (final exp) = 300 for a 3.33x acceleration factor

The performance improvements are superlinear and increase with both the number of processors and the workload.

Batching signatures

The existing aggregateVerify functions are impractical as we need to save state for batching. Instead we can pass a collector object to collect all (publicKey(s), message, signature) triplets (called Signature Sets here, as in Lighthouse and Prysm) in a block and then run verification once everything as been collected.

We actually do not core which verification fails, if any fails we need to drop the block anyway (unless some precise attacker accountability on invalid signatures is introduced).

This PR implements such collector object BatchBLSVerifier with a serial and parallel backend using OpenMP.

Parallel implementation

Contrary to BLST example which uses a simple parallel for loop followed by serial for loop for merging partial pairings, we use a recursive divide-and-conquer strategy, this doesn't change the cost of computing the partial pairings but significantly improve merging them (logarithmic vs linear) and merging partial pairings is a costly operation.

Reference merging code: https://github.com/supranational/blst/blob/7cda6fa09bfa9d789bd30b31dc1ae91656ee2f88/bindings/rust/src/lib.rs#L756-L759

As an example on Pyrmont 2 weeks old, 100000 blocks, a linear merge on my machine would have an estimated cost of 0.3s per 20 blocks which lead to 100000/20 * 0.3 = 1500s spent, 20 weeks would be 15000s so 4.1 hours.

With a divide-and-conquer approach on a 6 core CPU as in the code example we save half that https://github.com/status-im/nim-blscurve/blob/b0a1d149e97142649bd99c4acebcf551b7b05dda/blscurve/bls_batch_verifier.nim#L242-L245

https://github.com/status-im/nim-blscurve/blob/b0a1d149e97142649bd99c4acebcf551b7b05dda/blscurve/bls_batch_verifier.nim#L204-L262

Nimbus unknowns

  1. A RFC will be submitted to discuss the following details

    • Architecture change in Nimbus process_block to allow batching of all signatures within a block. This only affects consensus.
    • Changing sync and/or introducing batch_process_blocks to batch all signatures of many blocks.
  2. We will likely need 1 or 2 nodes in our fleet to test this PR on Pyrmont and the consequent Nimbus refactoring that will likely be complex to rebase and will live in a branch for a long time.

mratsim commented 3 years ago

Compiling with sanitizers

nim c -d:openmp -d:danger --cc:clang --passC:"-fsanitize=undefined" --passC:"-fsanitize=address" --passL:"-fsanitize=address" --passL:"-fsanitize=undefined" --outdir:build -r  --verbosity:0 --hints:off --warnings:off --passC:"-g" benchmarks/bench_all.nim
arnetheduck commented 3 years ago

So, to move forward with this, I'd go with splitting out the batch verification into a separate PR - this looks like a completely orthogonal concern - then we can rework nbc to work with batch verification - once that's done, we can proceed with the rest.

I guess it's not possible to do batch deserialization as well?

mratsim commented 3 years ago

On splitting the BatchVerifier in another PR, this can be done however:

mratsim commented 3 years ago

Benchmark on a Pi4 running 32-bit Raspbian, thi shows that with low core count (4 cores) we already get 4x speedup on just 6 pubkeys/messages image

mratsim commented 3 years ago

benches on x86, raw with no attempt to tune partitioning depending on the ratio or work/cores image

Investigating the slowdown on with only 6 signatures, it seems like if we set the number of threads to the number of hardware threads (18 here but 36 logical thread due to hyperthreading) you don't experience slowdown even if there are way more threads than work. Command OMP_NUM_THREADS=18 nimble bench image

My suspicion is that HyperThreading is hurting performance. On the other hand, if there are more work than threads hyperthreading is beneficial.

mratsim commented 3 years ago

Actually I was repeating the benchmark 1 time (artifact from a debugging session), so the time measured was unreliable, there is no need for tuning image