Investigate slow parachains

ggwpez commented 1 month ago

Discussions in the Fellowship chat did not lead to any actionable outcome. Not sure whom to ping here or what the expected behaviour is.

Polkadot parachains are not producing as many blocks as they should. You can use this script to plot the skew (but it needs adjustments per chain).

The relay itself is fine: Screenshot%202024-05-22%20at%2015 04 45

But PAH is missing 30 days over the year: Screenshot%202024-05-22%20at%2014 47 49

And collectives: Screenshot%202024-05-22%20at%2014 27 47

bkchr commented 1 month ago

I think we are still not at rock stable 12s blocks for parachains, which is what this plot also reveals. @sandreim you are monitoring this as well or?

sandreim commented 1 month ago

Yes indeed: https://grafana.teleport.parity.io/goto/TA1mEjPIR?orgId=1 some older data. Currently the introspector deployment is down, will be fixed tomorrow ASAP. (CC @AndreiEres)

sandreim commented 1 month ago

Regarding why this is happening, AFAIR parachains not always get stuff backed (collations not getting to backing group fast enough or at all) , but I am expecting this to be way better when they switch to async backing.

sandreim commented 1 month ago

Here is one that did, but yeah, not rock solid. I think we should invest some time into understanding what is causing them to miss slots:

alexggh commented 1 month ago

But PAH is missing 30 days over the year:

Using such a long period is tricky, because you have to exclude all past solved incidents on both polkadot and the collator set which would contribute to this skewing.

I think we are still not at rock stable 12s blocks for parachains, which is what this plot also reveals.

Running on a network fully under our control where validators and collator have perfect connectivity get us to almost stable 12s and 6s(almost because availability cores are always cleared at session boundaries, so your going to miss a slot there).

Our real networks are not in the perfect state, so blocks could be missed for a variety of reason(not saying this is what happens):

Collators did not advertise their block in time.
Collators can't connect to validators in their backing group.(For example if I run https://github.com/lexnv/subp2p-explorer I only am able to reach around 86% of the validator set, Collators connecting to validators might have the same success rate)
```
Discovered 287/331 authorities
```
Valid statements weren't gossiped in time and did not reach the block author.
One or more of the collators not producing blocks because of whatever reasons(it is behind or has some other issues)
Some unknown bugs or known bugs(https://github.com/paritytech/polkadot-sdk/issues/3590)

Regarding why this is happening, AFAIR parachains not always get stuff backed (collations not getting to backing group fast enough or at all) , but I am expecting this to be way better when they switch to async backing.

Yes, switching to async backing should alleviate some of the problems above, but I don't think all of them.

I think we should invest some time into understanding what is causing them to miss slots

I will try to allocate some time in the following weeks, but I wouldn't expect a quick resolution on this one.

KarimJedda commented 1 week ago

Some more context/data for this, hoping it's helpful:

Polkadot BridgeHub

Screenshot from 2024-06-27 12-28-27

Took 24 seconds: https://bridgehub-polkadot.subscan.io/block/2883345 Took 18 seconds: https://bridgehub-polkadot.subscan.io/block/2883344

Kusama

Screenshot from 2024-06-27 10-24-04

Took 12 seconds: https://kusama.subscan.io/block/23792945

alexggh commented 1 week ago

Thank you @KarimJedda, I spent some time on this one as well, the approach I took here is with a modified version of the script presented here: https://github.com/paritytech/polkadot-sdk/issues/4569#issue-2315380369, I looked at all blocks that parity collators built for proposal on polkadot-asset-hub and checked them for the following invariants, that they have been included in the main chain and that they were built at no more than 12s difference from the parent. As far as I can tell from 5000 blocks produced between 19th and 23th of June and I found the following failure modes, in the order of impact/encounters.

In 1.6%(80) cases the difference between the block built by parity collator and its parent was 18s instead of 12s. This is happening because of two reasons:
- 1a) 0.94%(57) the parent parachain block misses a block on backing and because our collators waits for that to be backed + included on the relay chain we miss the extra 6s. I think with the way our sync backing protocol is designed this is expected to happen occasionally ,when the chain gets migrated to async backing I think this will be mitigated Some parachain blocks where this is happening 0x88521946d587b2ffbdcf12ce9aad07d480f6f0d6efbf32fb1ad908b8ecb0b82d, 0x0441f65d8aa11632bb2139cdc1c1ad320623e23b19772d842ba38e1e23698472 and 0x2170af48b1ca8c499a04169bc624744a2e0d82e22de2667ae555f1f0d5fa7c63

1b) 0.66%(33 occurrences) the relay chain skips a block so we miss the opportunity of building because we always get triggered when a new relay chain block gets created. I'm not sure why we are skipping relay chain blocks it might be something off with the validators that should create the block(At this rate you need just 1 or 2 validators from the 300 to malfunction)

Some relay chain blocks where this is happening: 0xb5c60621b1cd8edd9321437153d1b377d3b08736f3399214243cf3a35eab8c07 & 0xe4e3e0986e74ea88983aa890908c11b06ddb4ef40bb5c0eae8ea269910de4f64
1. 0.16%(8 occurrences) the blocks proposed by parity collator do not end up on chain, this is happening very rarely so I don't think there is something we need to spent time on, but I did some investigation the reasons here are diverse:
2a) There is as session change and because the availability cores are always cleared between session, the block does not get included.
2b) The node falls behind on importing relaychain blocks and it produces the parachain block too late: E.g: 0x27e314aba25919a66f32ad49df4e96d3ee606fc2737c0fdf41ac66edaf63edfe https://grafana.teleport.parity.io/goto/DbPNWoQSR?orgId=1

If you add all those numbers at least for this snapshot with parity's collators the average for polkadot-asset-hub is 12.12s, but that's just our collators, the other collators might not be that well behaving, but it does correlates roughly with the polkadot asset hub clock skew in the past 10 days. AssetHub_25June_15June

Conclusions

Looking at all the data I've got available I don't think we've got any serious problem on polkadot, it is just that with sync-backing the 12s perfect block time it is something you hit in perfect conditions(validators, collators and network), the 2% delta we've got from perfect conditions is explained by the fact that a real network is not perfect. This will be slightly mitigated by the switch to async backing, but even there I don't think perfect 6s will be achieved, but there is a bit more redundancy.

Some thoughts

Moving forward, at this margins(single digits divergence from the maximum theoretical performance) I think more close monitoring system needs to be implemented, to always find the mis-behaving validators/collators and notified them or kicked them out of the active set, it would be great if we would have some programmable tooling that could answer the following questions:

Which validator should have produced the missing relay chain block.
Which collator should have produced the missing parachain block time.
Stats about the peformance of each validator and collator on this dimensions and more.
What are the addresses P2P DHT of the miss-behaving validators/collators.
What are the backing validators at a given parachain block.
Where on this link Collator -> Backing validators -> Author, we end up with the parachain block being delayed or skipped altogether.

All of this data exists either on-chain, in the logs or the DHT, but it is not aggregated in a way where you can easily query it to get a full view of the network.

For example, this one https://apps.turboflakes.io/?chain=polkadot#/insights is a good start of what I'm taking about, but it lacks the programability and is missing a lot of information for the above questions.

burdges commented 1 week ago

Collators can't connect to validators in their backing group.(For example if I run https://github.com/lexnv/subp2p-explorer I only am able to reach around 86% of the validator set, Collators connecting to validators might have the same success rate)

Any idea why this happened?

It's likely prioritization on the validator side? Any idea if all the validators are talking to one another though? 287/331 is way above what we'd notice in availability, but maybe we could've some query that says "validator x never puts validator y into its availability bitfields" Over a long enojugh time would say x cannot really talk to y, or at least their connection is very slow..

alexggh commented 1 week ago

Collators can't connect to validators in their backing group.(For example if I run https://github.com/lexnv/subp2p-explorer I only am able to reach around 86% of the validator set, Collators connecting to validators might have the same success rate)

Any idea why this happened?

It's likely prioritization on the validator side? Any idea if all the validators are talking to one another though? 287/331 is way above what we'd notice in availability.

When` I wrote that comment it was just a theory, but while looking at parity's collators for blocks in https://github.com/paritytech/polkadot-sdk/issues/4569#issuecomment-2194842358, there doesn't seem to be any connectivity issue, candidates are always advertised. Given the backing group is 5 on polkadot, I think we've got enough redundancy to alleviate any problems there.

Connecting to 287 out of 331 authorithies is what subp2p-explorer running on my local machine reports, the current theory I discussed with @lexnv is that those are behind a NAT/Firewall and can't be reached from outside, the good thing is that does not mean they are not connected to other validators at all, because normally at each session change any two validators A & B will each try to connect to the other, so even if A can't reach B, B will be able to reach A.

Prioritization could be an explanation as well.

"validator x never puts validator y into its availability bitfields" Over a long enojugh time would say x cannot really talk to y, or at least their connection is very slow..

Yeah, that's one of those things we should have tooling for and monitor if we want to reach the maximum parachain block throughput/

bkchr commented 1 week ago

notified them or kicked them out of the active set,

If you can not proof that they have done any misbehavior, they will not get kicked out of the set. I mean you can create social pressure by having a list of shame, but that's it.

alexggh commented 1 week ago

notified them or kicked them out of the active set,

If you can not proof that they have done any misbehavior, they will not get kicked out of the set. I mean you can create social pressure by having a list of shame, but that's it.

Yes, that's what I was thinking, when I said kick them out. At the end of the day our protocols have the assumption that they work in the presence of 1/3 not doing their job or doing it badly, but that does not mean it will work at full throttle. Ideally, the points system would be hooked to all parts of our protocols and validators not performing would be kicked out, by not getting rewards and then they won't be nominated anymore, but even with that I still think we need automated tooling to pin point problems about the health of the network, so that you can help honest validators fix their nodes.

burdges commented 1 week ago

Just fyi, there is an availability tit-for-tat game in my rewards proposal, which actually would reduce rewards for nodes behind NATs, so that's quite nice.

paritytech / polkadot-sdk