`approval-voting`: implement parallel processing of signature checks.

sandreim commented 1 year ago

I've been experimenting with using a thread pool to handle VRF signature checks which appear to be most expensive operation that we are doing in approval voting. After running some benchmarks I got these results on AMD EPYC 7601 32-Core Processor:

check/no-pool           time:   [208.94 ms 209.10 ms 209.31 ms]
                        thrpt:  [4.7777 Kelem/s 4.7825 Kelem/s 4.7861 Kelem/s]
check/pool_size_1       time:   [267.76 ms 271.14 ms 276.34 ms]
                        thrpt:  [3.6187 Kelem/s 3.6881 Kelem/s 3.7346 Kelem/s]
check/pool_size_2       time:   [162.28 ms 163.93 ms 165.20 ms]
                        thrpt:  [6.0532 Kelem/s 6.1001 Kelem/s 6.1621 Kelem/s]
check/pool_size_4       time:   [111.01 ms 112.44 ms 113.99 ms]
                        thrpt:  [8.7728 Kelem/s 8.8934 Kelem/s 9.0084 Kelem/s]
check/pool_size_8       time:   [84.792 ms 85.514 ms 85.961 ms]
                        thrpt:  [11.633 Kelem/s 11.694 Kelem/s 11.794 Kelem/s]

I expect this change to work very well with https://github.com/paritytech/polkadot-sdk/issues/732 because it will allow us to multiplex all the CPU intensive work of the subsystem to multiple CPU cores, improving our current single threaded design.

Important note: The number of blocking threads used needs to be bounded and we would also need an upper limit at which we add backpressure.

burdges commented 1 year ago

We do have a batch verification for VRFs in https://github.com/w3f/schnorrkel/blob/master/src/vrf.rs#L536 which likely saves 40%, which works across multiple signers, but slightly increases gossip overhead by 32 bytes per message. I've an unimplemented variant that avoids this 32 byte overhead even.

We could merge all the tranche zero VRFs by the same signer too. We've two options:

We keep the individual outputs for RelayVrfModulo but produce only one signature/proof using https://github.com/w3f/schnorrkel/blob/master/src/vrf.rs#L433 This permits authorities to semi-secretly refuse assignments, but each additional assignment has some marginal cost, maybe 75% savings. These marginal costs do not stack with batch verification.
We just do only one RelayVrfModulo per authority and derive multiple assignments from the output. We'll need an explicit bitfield if we want authorities to refuse some assignments, but this saves us maybe 95%, and this stacks with batch verification.

I doubt being secretive about refused assignments matters much. I doubt either 1 or 2 helps RelayVrfDelay much, but we should tune parameters so that RelayVrfModulo represents maybe 90% or 85% of assignments. Batch verification helps RelayVrfDelay just fine.

All told, we should save over 80% by doing 2, double checking parameters, and maybe doing batch verifications.

sandreim commented 1 year ago

This issue (as well https://github.com/paritytech/polkadot-sdk/issues/732) are focused on improving performance from an engineering point of view, like solving the bottleneck of having a single threaded approach for processing distribution and import of assignments and votes.

We do have a batch verification for VRFs in https://github.com/w3f/schnorrkel/blob/master/src/vrf.rs#L536 which likely saves 40%, which works across multiple signers, but slightly increases gossip overhead by 32 bytes per message. I've an unimplemented variant that avoids this 32 byte overhead even.

IIUC in our case we have a single signer and this would mean that we could batch it's own RelayVrfDelay assignments for the same tranche (different candidates). If my understanding correct ?

We could merge all the tranche zero VRFs by the same signer too. We've two options:

We keep the individual outputs for RelayVrfModulo but produce only one signature/proof using https://github.com/w3f/schnorrkel/blob/master/src/vrf.rs#L433 This permits authorities to semi-secretly refuse assignments, but each additional assignment has some marginal cost, maybe 75% savings. These marginal costs do not stack with batch verification.

We just do only one RelayVrfModulo per authority and derive multiple assignments from the output. We'll need an explicit bitfield if we want authorities to refuse some assignments, but this saves us maybe 95%, and this stacks with batch verification.

I doubt being secretive about refused assignments matters much. I doubt either 1 or 2 helps RelayVrfDelay much, but we should tune parameters so that RelayVrfModulo represents maybe 90% or 85% of assignments. Batch verification helps RelayVrfDelay just fine.

All told, we should save over 80% by doing 2, double checking parameters, and maybe doing batch verifications.

2 sounds very good to me, but I am not cryptography guy. Can you detail a bit the pros and cons of having RelayVrfModulo represent 85% of assignments in tranche 0?

I will create a ticket for further discussion of these improvements.

burdges commented 1 year ago

Answered in the other thread.

sandreim commented 1 year ago

FWIW we could go even further by sharding the state and input by (BlockHash, CandidateIndex) and have 2-4 workers that truly work in parallel for importing assignments/votes. We would need to query each worker to be able respond to GetApprovalSignaturesForCandidate and ApprovedAncestor subsystem messages.

rphmeier commented 1 year ago

Yeah, I think something along those lines is possible. I don't remember all the details, but I think candidates have to be specifically approved under each fork, right? If so, we can shard by (BlockHash, CandidateIndex) without any issues, except contended DB access.

sandreim commented 1 year ago

Since the assignments can also claim more candidates, (BlockHash, ValidatorIndex) makes sense to use. Yes, we track approval of candidates across forks. The DB is structured into BlockEntries, CandidateEntries containing ApprovalEntry. To handle multiple core assignments (as of https://github.com/paritytech/polkadot/pull/6782) assignments from same validator are duplicated in all the CandidateEntries they claim, so we cannot really shard these per candidate. IMO it should be easy for each worker to have it's own DB storage so I assume there is no additional contention.

I expect more latency when handling ApprovedAncestor andGetApprovalSignaturesForCandidate messages but we would just use unbounded sends to workers to prioritise against imports.

burdges commented 1 year ago

We approve candidates under each fork because assignment VRFs are seeded by relay chain VRFs. We could move assignments and votes across forks when relay chain block producers equivocate though, which maybe useful.

You might've bigger fish to fry after you merge the tranche 0 assignments, but conversely all those delay assignments add up quickly whenever many no-shows happen.

At a high level, we process gossip messages containing assignments and votes, which result in database writes and deduplication checks, and then our approvals loop reads this database. We should not afaik spend too much time in the approvals loop itself, so assignment VRF signatures could be checked by workers who then push valid assignments into a queue for insertion into the database. At the extreme this could be made no blocking, no?

paritytech / polkadot-sdk

`approval-voting`: implement parallel processing of signature checks. #731