Avoid immediately writing assignments and approvals to disk in approval-voting

rphmeier commented 2 years ago

in the vein of: https://github.com/paritytech/polkadot/issues/3437

We can see empirically that approval-distribution takes around 100-150ms to process a significant proportion of its messages. It processes messages one at a time. Duplicate or irrelevant messages are handled quickly, but new unique messages are sent to the approval-voting subsystem to be imported. This import process blocks on writing to the underlying database.

We could improve the performance of the approval-distribution gossip subsystem substantially by ensuring that import of assignments and approvals only ever writes to memory, not disk. We can introduce a buffer and a background flushing task which writes to disk every 10-15 seconds as necessary. We should ensure that locally-generated assignments and approvals are fully written to disk before distribtion, and we should aim to flush to disk on shutdown.

rphmeier commented 2 years ago

@sandreim tried out a basic change in paritytech/polkadot#5235 to only flush once every 1000 assignments/approvals are written and it didn't help much. That seems to disprove the original thesis that we were getting backpressured by DB flushes on every message.

So where do these high ToFs come from?

sandreim commented 2 years ago

@sandreim tried out a basic change in paritytech/polkadot#5235 to only flush once every 1000 assignments/approvals are written and it didn't help much. That seems to disprove the original thesis that we were getting backpressured by DB flushes on every message.

The change I tested (https://github.com/paritytech/polkadot/pull/5236/commits/6d29474d4dc67610c168bd1e0e028670920a0a19) improved the db transaction metrics, but had no overall effect on the ToFs.

So where do these high ToFs come from?

I just checked how many messages are in the approval distribution bounded channel, and it looks pretty bad across all validators:

sandreim commented 2 years ago

I don't really think these tell the full story, as we are only seeing the values at every metric scraping, so anything bursty that goes on in between it is never seen here. It might be even worse...

sandreim commented 2 years ago

Similar investigation for dispute-coordinator: https://github.com/paritytech/polkadot/issues/5359

alexggh commented 1 year ago

Being an old issue, I did a bit of digging on versi_v1_10, it seems the problem persists, this is how ToF looks like for the approval-distribution bound queue:

I diged a bit into the metrics and indeed it seems there are situations where approval distribution takes longer, because it is waiting after the approval-voting system, see the: polkadot_parachain_time_awaiting_approval_voting.

And it seems to be correlated with the approval-voting flushing it's db operations, polkadot_parachain_time_approval_db_transaction. See how the histogram values look pretty much the same

So, I think it is worth retrying to move the db flushes out of the main loop and see if there is any impact from that optimization.

sandreim commented 1 year ago

Interesting, there are samples above 1s. How many of them are there? (it's hard to tell from the color coding). One difference compared to last investigation is the size of Versi.

burdges commented 1 year ago

An mmap supposedly ensures updates wind up written eventually, even if you never flush manually: https://stackoverflow.com/questions/5902629/mmap-msync-and-linux-process-termination

I've heard Windows works similarly, but not sure about macOS. We've many subsystems which could exploit this given we now assume Linux: assignments, approval votes, grandpa votes, etc.

alexggh commented 1 year ago

Interesting, there are samples above 1s. How many of them are there? (it's hard to tell from the color coding). One difference compared to last investigation is the size of Versi.

There are just a few of them, it might be that there just some single machines that are slower, I'm trying to figure that out from the dashboards.

An mmap supposedly ensures updates wind up written eventually, even if you never flush manually: https://stackoverflow.com/questions/5902629/mmap-msync-and-linux-process-termination

I haven't diged into it, but that seems to depend on how our database it is implemented. I think at first we need to profile where we spend most of our time in approval-voting, flushing is one of the possible culprits, but it might be that even the reads are slowish(and I see a lot of reads happening on the paths where we process a message from approval-distribution subsystem).

burdges commented 1 year ago

We do reread records of all approvals announcements and votes for every unapproved candidate every second, well currently every 1/2 second, but doing so should not involve any cryptography, not even hashing, just checking in memory data structures. in fact, there is no reason for an authenticated db here anyways, so no hashing to maintain a root hash or whatever.

I suppose mmap always requires working directly with serialized structures, like using offsets instead of real pointers or whatever, so yeah that's some complexity.

alexggh commented 1 year ago

Looked a bit more closely and it seems we actually go to the database every time for reads when an event is received in approval-voting, the reason for that is, I think, a bug/unintended consequence of our usage of OverlayedBackend here.

So, because we instantiate OverlayedBackend every time in the main processing loop we will never hit the caching strategy from here.

sandreim commented 1 year ago

Yeah, but likely this is still cached in memory at the DB/page cache level.

alexggh commented 1 year ago

Something weird is happening in versi in the past 6h, I see some high ToF for approval-distribution-subsystem, but without being correlated with high ToF for approval-voting-subsystem, any theories why that might happen ?

I would have expected so see high ToF for approval-voting-subsystem as well, if that's what slows it down.

Looking at a single node this is how approval-distribution looks like.

This is how approval-voting looks like.

Additionally, looking at the tracing for import assignment and approval the time both operation take is in the order of micro-seconds.

So, I'm starting to think that the slow-down we are seeing in approval-distribution is not triggered by approval-voting being slow, but by some other logic subsystem, potentially by network-bridge-tx-subsystem when we run send_assignments_batched & send_approvals_batched

Just looking at the rate of messages sent to each of the subsystem, approval-distribution-system seems to be receiving a steady flow of 3k messages per second, and approval-voting around 500, so there is a delta of 2.5k messages which doesn't have to go to approval voting and it might be where the slow-down manifest itself.

Any thoughts/metrics I could look at to validate these assumptions ?

paritytech / polkadot-sdk

Avoid immediately writing assignments and approvals to disk in approval-voting #841