paritytech / polkadot

Polkadot Node Implementation
GNU General Public License v3.0
7.12k stars 1.58k forks source link

Slow dispute coordinator #5359

Closed eskimor closed 2 years ago

eskimor commented 2 years ago

As can be seen on ToF boards, the dispute coordinator is quite slow (>5s for handling some messages).

Investigation:

Amount of messages

On versi we do have a message rate in the ballpark of 60 to <100/s.

Interestingly, we handle them at approximately the same rate, which results in the channel almost never getting filled, better shown here.

So where do those long ToFs come from? Can they even be correct?

sandreim commented 2 years ago

Interestingly, we handle them at approximately the same rate, which results in the channel almost never getting filled, better shown here.

I think we can't be really sure how often it gets fully filled, as the metrics are sampled once every 30s. The rate you mention is averaged over 1 hour, so it can really be bursty and we just never see that because of the same polling interval .

If we zoom in we can see it can go to 150/s (2m average).

Screenshot 2022-04-21 at 18 52 22
eskimor commented 2 years ago

Yeah, I checked shorter polling intervals as well, in any case receives seem to keep up with sends, which is interesting given the rate and the ToF - adding some metrics.

eskimor commented 2 years ago

Some performance debugging: https://github.com/paritytech/polkadot/issues/4793

eskimor commented 2 years ago

Summary: db access is most likely the culprit, both read & write.

eskimor commented 2 years ago

One strategy I was thinking of, was batch vote imports together. Especially in approval voting there are many votes for the same candidate. If we collect votes per candidate before doing the actual import for those collected votes, instead of each one individually, that could speed things up by a factor (almost 30).

eskimor commented 2 years ago

Longer term, we should also see whether this preemptive vote import in the current form is really necessary. A node could for example only keep track of its own votes in the normal case and only distribute them to other nodes via dispute-distribution in case of an actual dispute. Like instead of doing nothing if already voted in backing/approval - just send your backing/approval vote and participate in dispute-distribution.

We might or might not keep the vote import from chain (chain scraping) as this should be rather efficient.

I specifically marked this longer term, as this is no change to be done lightly as there are some details and security implications that must be taken into account, but at the same time it would be nice if the disputes system won't cause any noticeable load if there are no disputes happening and definitely it should not be an obstacle in scaling.

burdges commented 2 years ago

I've forgotten how the pre-emptive import works, so can you remind me?

Are you sure batching votes together speed them up? It'd save 40% of the signature verification time if we pre-batch enough Schnorr signatures, so 45 μs becomes 30 μs, but not sure what the extra complexity costs.

We cannot afaik pre-batch disputes, which may or may not matter, but even batching valid votes costs latency, and maybe adds no-shows.

eskimor commented 2 years ago

I've forgotten how the pre-emptive import works, so can you remind me?

Both backing and approval voting are sending all votes to the dispute coordinator all the time, so it can keep track of them on disk. This way it has backing and approval votes ready when an actual dispute happens for the candidate. This is quite costly, considering that most candidates will never get disputed.

Are you sure batching votes together speed them up? It'd save 40% of the signature verification time if we pre-batch enough Schnorr signatures, so 45 μs becomes 30 μs, but not sure what the extra complexity costs.

Yes batching up will be a noticeable speedup, but it is another kind of batching than you have in mind. Right now when importing votes in the dispute coordinator (which does not involve any signature checking) we are reading all existing votes, adding the one imported vote and then writing all the votes again - this is quite costly. If we batch up, it would be like: Read all votes, add 30, write back.

We cannot afaik pre-batch disputes, which may or may not matter, but even batching valid votes costs latency, and maybe adds no-shows.

The batching only concerns the import in the dispute coordinator, so it would have no effect on no-shows.