Parallel and Serial batch verification

mratsim commented 3 years ago

This adds batched multi-signature verification to the library. Both with serial or parallel backend with OpenMP. The algorithm chosen is recursive divide-and-conquer instead of the usual parallel for loop, this makes it easily portable to any threadpool that supports task spawning.

Disambiguations

Unfortunately it is very easy to get confused with BLS signatures, signature aggregation and multi signature.

BLS signature level 1

We have a triplet (pubkey, message, signature) to verify. This is the base case verify https://github.com/status-im/nim-blscurve/blob/b0a1d149e97142649bd99c4acebcf551b7b05dda/blscurve/bls_sig_min_pubkey.nim#L108-L114

BLS signature level 2

We have N public keys verifying the same message (attesting for a block for example) This is fastAggregateVerify. https://github.com/status-im/nim-blscurve/blob/b0a1d149e97142649bd99c4acebcf551b7b05dda/blscurve/bls_sig_min_pubkey.nim#L230-L253

As shown in benchmark there is no use in parallelizing this case as by exploiting pairings curve properties can accelerate verification of an aggregate signature against 128 public keys only take an extra 5.6%

BLS signature level 3

We have N public keys verifying N messages which could be:

block proposals signatures
randao reveal signatures
proposer slashings signatures
attester slashings signatures
attestations signatures (needs level 2 as well)
validator exits signatures

not counting deposits signatures which may be invalid. Furthermore we can apply that to B blocks to collect N*B signature sets.

This is the aggregate(var Signature, Signature) procedures and aggregateVerify procedures: https://github.com/status-im/nim-blscurve/blob/b0a1d149e97142649bd99c4acebcf551b7b05dda/blscurve/blst/blst_min_pubkey_sig_core.nim#L141-L186 https://github.com/status-im/nim-blscurve/blob/b0a1d149e97142649bd99c4acebcf551b7b05dda/blscurve/bls_sig_min_pubkey.nim#L153-L156 https://github.com/status-im/nim-blscurve/blob/b0a1d149e97142649bd99c4acebcf551b7b05dda/blscurve/bls_sig_min_pubkey.nim#L177-L179

Those appear in the spec https://github.com/ethereum/eth2.0-specs/blob/v1.0.0/specs/phase0/beacon-chain.md#bls-signatures but are never used hence as we kept almost literally to the spec we didn't use them.

The estimated perf improvement on a single block verification for the serial and parallel implementation are in the code at https://github.com/status-im/nim-blscurve/blob/b0a1d149e97142649bd99c4acebcf551b7b05dda/blscurve/blst/blst_min_pubkey_sig_core.nim#L401-L422 In summary by batching even on a single processor we have 2x faster single block verification. If batching 10 blocks, instead of a naive cost of 1000, the cost becomes 20*10 (blinding 64-bit) + 50 (Miller loop) + 50 (final exp) = 300 for a 3.33x acceleration factor

The performance improvements are superlinear and increase with both the number of processors and the workload.

Batching signatures

The existing aggregateVerify functions are impractical as we need to save state for batching. Instead we can pass a collector object to collect all (publicKey(s), message, signature) triplets (called Signature Sets here, as in Lighthouse and Prysm) in a block and then run verification once everything as been collected.

We actually do not core which verification fails, if any fails we need to drop the block anyway (unless some precise attacker accountability on invalid signatures is introduced).

This PR implements such collector object BatchBLSVerifier with a serial and parallel backend using OpenMP.

Parallel implementation

Contrary to BLST example which uses a simple parallel for loop followed by serial for loop for merging partial pairings, we use a recursive divide-and-conquer strategy, this doesn't change the cost of computing the partial pairings but significantly improve merging them (logarithmic vs linear) and merging partial pairings is a costly operation.

Reference merging code: https://github.com/supranational/blst/blob/7cda6fa09bfa9d789bd30b31dc1ae91656ee2f88/bindings/rust/src/lib.rs#L756-L759

As an example on Pyrmont 2 weeks old, 100000 blocks, a linear merge on my machine would have an estimated cost of 0.3s per 20 blocks which lead to 100000/20 * 0.3 = 1500s spent, 20 weeks would be 15000s so 4.1 hours.

With a divide-and-conquer approach on a 6 core CPU as in the code example we save half that https://github.com/status-im/nim-blscurve/blob/b0a1d149e97142649bd99c4acebcf551b7b05dda/blscurve/bls_batch_verifier.nim#L242-L245

https://github.com/status-im/nim-blscurve/blob/b0a1d149e97142649bd99c4acebcf551b7b05dda/blscurve/bls_batch_verifier.nim#L204-L262

Nimbus unknowns

A RFC will be submitted to discuss the following details
- Architecture change in Nimbus process_block to allow batching of all signatures within a block. This only affects consensus.
- Changing sync and/or introducing batch_process_blocks to batch all signatures of many blocks.
We will likely need 1 or 2 nodes in our fleet to test this PR on Pyrmont and the consequent Nimbus refactoring that will likely be complex to rebase and will live in a branch for a long time.

mratsim commented 3 years ago

Compiling with sanitizers

nim c -d:openmp -d:danger --cc:clang --passC:"-fsanitize=undefined" --passC:"-fsanitize=address" --passL:"-fsanitize=address" --passL:"-fsanitize=undefined" --outdir:build -r  --verbosity:0 --hints:off --warnings:off --passC:"-g" benchmarks/bench_all.nim

arnetheduck commented 3 years ago

So, to move forward with this, I'd go with splitting out the batch verification into a separate PR - this looks like a completely orthogonal concern - then we can rework nbc to work with batch verification - once that's done, we can proceed with the rest.

I guess it's not possible to do batch deserialization as well?

mratsim commented 3 years ago

On splitting the BatchVerifier in another PR, this can be done however:

the "core" primitives left are internal API and so would have no test in the PR.
the "core" primitives are just a copy-paste of https://github.com/status-im/nim-blscurve/blob/920cf5a4a171227c15e16a71db78208b798af408/blscurve/blst/blst_min_pubkey_sig_core.nim#L278-L383 which was already audited. The differences are: added blinding support, public exports of commit and merge and finalverify for multithreading support.
I did not refactor the old type as they were audited. That said they are also unused in NBC so refactoring has no downstream impact.

mratsim commented 3 years ago

Benchmark on a Pi4 running 32-bit Raspbian, thi shows that with low core count (4 cores) we already get 4x speedup on just 6 pubkeys/messages

mratsim commented 3 years ago

benches on x86, raw with no attempt to tune partitioning depending on the ratio or work/cores

Investigating the slowdown on with only 6 signatures, it seems like if we set the number of threads to the number of hardware threads (18 here but 36 logical thread due to hyperthreading) you don't experience slowdown even if there are way more threads than work. Command OMP_NUM_THREADS=18 nimble bench

My suspicion is that HyperThreading is hurting performance. On the other hand, if there are more work than threads hyperthreading is beneficial.

mratsim commented 3 years ago

Actually I was repeating the benchmark 1 time (artifact from a debugging session), so the time measured was unreliable, there is no need for tuning

status-im / nim-blscurve