Open rphmeier opened 2 years ago
@sandreim tried out a basic change in paritytech/polkadot#5235 to only flush once every 1000 assignments/approvals are written and it didn't help much. That seems to disprove the original thesis that we were getting backpressured by DB flushes on every message.
So where do these high ToFs come from?
@sandreim tried out a basic change in paritytech/polkadot#5235 to only flush once every 1000 assignments/approvals are written and it didn't help much. That seems to disprove the original thesis that we were getting backpressured by DB flushes on every message.
The change I tested (https://github.com/paritytech/polkadot/pull/5236/commits/6d29474d4dc67610c168bd1e0e028670920a0a19) improved the db transaction metrics, but had no overall effect on the ToFs.
So where do these high ToFs come from?
I just checked how many messages are in the approval distribution
bounded channel, and it looks pretty bad across all validators:
I don't really think these tell the full story, as we are only seeing the values at every metric scraping, so anything bursty that goes on in between it is never seen here. It might be even worse...
Similar investigation for dispute-coordinator: https://github.com/paritytech/polkadot/issues/5359
Being an old issue, I did a bit of digging on versi_v1_10
, it seems the problem persists, this is how ToF
looks like for the approval-distribution bound queue:
I diged a bit into the metrics and indeed it seems there are situations where approval distribution takes longer, because it is waiting after the approval-voting system, see the: polkadot_parachain_time_awaiting_approval_voting
.
And it seems to be correlated with the approval-voting
flushing it's db operations, polkadot_parachain_time_approval_db_transaction
. See how the histogram values look pretty much the same
So, I think it is worth retrying to move the db flushes out of the main loop and see if there is any impact from that optimization.
Interesting, there are samples above 1s. How many of them are there? (it's hard to tell from the color coding). One difference compared to last investigation is the size of Versi.
An mmap supposedly ensures updates wind up written eventually, even if you never flush manually: https://stackoverflow.com/questions/5902629/mmap-msync-and-linux-process-termination
I've heard Windows works similarly, but not sure about macOS. We've many subsystems which could exploit this given we now assume Linux: assignments, approval votes, grandpa votes, etc.
Interesting, there are samples above 1s. How many of them are there? (it's hard to tell from the color coding). One difference compared to last investigation is the size of Versi.
There are just a few of them, it might be that there just some single machines that are slower, I'm trying to figure that out from the dashboards.
An mmap supposedly ensures updates wind up written eventually, even if you never flush manually: https://stackoverflow.com/questions/5902629/mmap-msync-and-linux-process-termination
I haven't diged into it, but that seems to depend on how our database it is implemented. I think at first we need to profile where we spend most of our time in approval-voting, flushing is one of the possible culprits, but it might be that even the reads are slowish(and I see a lot of reads happening on the paths where we process a message from approval-distribution subsystem).
We do reread records of all approvals announcements and votes for every unapproved candidate every second, well currently every 1/2 second, but doing so should not involve any cryptography, not even hashing, just checking in memory data structures. in fact, there is no reason for an authenticated db here anyways, so no hashing to maintain a root hash or whatever.
I suppose mmap always requires working directly with serialized structures, like using offsets instead of real pointers or whatever, so yeah that's some complexity.
Looked a bit more closely and it seems we actually go to the database every time for reads when an event is received in approval-voting, the reason for that is, I think, a bug/unintended consequence of our usage of OverlayedBackend here.
So, because we instantiate OverlayedBackend every time in the main processing loop we will never hit the caching strategy from here.
Yeah, but likely this is still cached in memory at the DB/page cache level.
Something weird is happening in versi in the past 6h
, I see some high ToF for approval-distribution-subsystem
, but without being correlated with high ToF for approval-voting-subsystem
, any theories why that might happen ?
I would have expected so see high ToF for approval-voting-subsystem as well, if that's what slows it down.
Looking at a single node this is how approval-distribution
looks like.
This is how approval-voting
looks like.
Additionally, looking at the tracing for import assignment and approval the time both operation take is in the order of micro-seconds.
So, I'm starting to think that the slow-down we are seeing in approval-distribution is not triggered by approval-voting
being slow, but by some other logic subsystem, potentially by network-bridge-tx-subsystem
when we run send_assignments_batched
& send_approvals_batched
Just looking at the rate of messages sent to each of the subsystem, approval-distribution-system
seems to be receiving a steady flow of 3k messages per second, and approval-voting
around 500, so there is a delta of 2.5k messages which doesn't have to go to approval voting and it might be where the slow-down manifest itself.
Any thoughts/metrics I could look at to validate these assumptions ?
in the vein of: https://github.com/paritytech/polkadot/issues/3437
We can see empirically that approval-distribution takes around 100-150ms to process a significant proportion of its messages. It processes messages one at a time. Duplicate or irrelevant messages are handled quickly, but new unique messages are sent to the approval-voting subsystem to be imported. This import process blocks on writing to the underlying database.
We could improve the performance of the approval-distribution gossip subsystem substantially by ensuring that import of assignments and approvals only ever writes to memory, not disk. We can introduce a buffer and a background flushing task which writes to disk every 10-15 seconds as necessary. We should ensure that locally-generated assignments and approvals are fully written to disk before distribtion, and we should aim to flush to disk on shutdown.