Parachain block time degradation since v0.9.16 on Kusama

crystalin commented 2 years ago

We are observing pretty impactful degradation on Moonriver since the deployment of client v0.9.16 on Kusama. Additionally, we have upgraded few days after Moonriver to support v0.9.16 but it had no impact.

We have been able to reproduce that on our public Alphanet (20 controlled validators, on v0.9.16) Here are the logs for a given block not being included #1690751 alphanet-block-degradation-block-1690751.log

Here is the graph showing the average time block over the last 7 days: (The spike is only the restart of our nodes, ignore it)

crystalin commented 2 years ago

On our Alphanet, where we also observe the bug, we can see some important changes in the Validator <-> Collator peer system:

crystalin commented 2 years ago

Looking more deeply into the connections, here are 2 of the validators (modified to have 500 collation peer set). The surprising part is that it establish 100+ connection instantly and removes it

crystalin commented 2 years ago

I've done some test, and I suspect the nodes are getting overloaded with connections in critical times. Using unoptimized binary is making the result worse also. That would explain why Kusama network is doing worse than Polkadot at the moment

crystalin commented 2 years ago

It might be related to some optimizations @eskimor has done around collation networking

rphmeier commented 2 years ago

Yeah, we should try a branch with those commits removed.

5b74841ea76195fb804a02c0336d05d9a2d3091b , 3ae71a552256273c6a82c3824451fb73423b2232, 7311a3aa9f83730a079b0899c34bb813dd96c4e1 all seem like viable culprits.

rphmeier commented 2 years ago

https://github.com/paritytech/polkadot/pull/4640/files#r805223564 Looks suspicious.

I pushed a branch reverting this fix to https://github.com/paritytech/polkadot/tree/rh-revert-connectivity-fix if you'd like to try it out.

crystalin commented 2 years ago

@rphmeier Ok I'm trying it now, I'll see if this can be deployed now or will have to wait tomorrow morning

crystalin commented 2 years ago

@rphmeier Can it be applied only to validator or do the collator also need it ?

rphmeier commented 2 years ago

@crystalin Specifically the collators need it. This should only affect collator logic.

That might be an argument against this being the underlying cause, because AFAIK it began even before the collators were upgraded.

crystalin commented 2 years ago

@rphmeier Yes it began while we were still on 0.9.13 collators. It also begin precisely when Kusama validator started to use v0.9.16 Also we can't easily upgrade all our collators to a custom version. Purestake only control 8 collators out of 80.

xlc commented 2 years ago

We noticed increased block time since Kusama nodes upgrades to 0.9.16. We released Karura with 0.9.16 2 days ago and see block number increased more but unsure if it is related to our upgrade or more Kusama nodes have upgraded.

crystalin commented 2 years ago

Thank you @xlc , We also observed on some of our networks a small increase when the collators upgraded to support v0.9.16, but it was very small compared to the increase of block time since Kusama upgraded to v0.9.16

(This is on Moonbeam/Polkadot, where some collators upgraded to support v0.9.16), The spike is the upgrade. There is a small increase in the average after it

crystalin commented 2 years ago

@rphmeier I tried to look more at the charts. I confirm the connection spikes happen when the validator get in the parachain core set. I had initially the idea that it could have been triggered by other parachains migrating to v0.9.16 on Kusama, but I don't think that is the case anymore. One thing that I'd like to verify but I don't have the full data is if the validator/collator established connection started to rise/spike when the validator was upgraded on Kusama (to confirm if it is a relay or parachain issue). Let me know if you can check that

stechu commented 2 years ago

We can confirm that we observe a similar issue in Calamari.

rphmeier commented 2 years ago

https://github.com/paritytech/substrate/compare/polkadot-v0.9.15...polkadot-v0.9.16

These are all Substrate changes from that period

https://github.com/paritytech/polkadot/compare/v0.9.15...v0.9.16

It doesn't look like lip2p changed at all; both versions Cargo.lock point to v0.40.0

These are all Polkadot changes.

It seems that we're looking for stuff that is relevant to the validator nodes and excluding the runtime. That narrows things down a decent bit.

There are a few changes to PVF execution that could be investigated.
There was a sync change to better support the disputes fork-choice rule. That might have had unintended side effects
Changes to peer-set management for full/light nodes
Changes to substrate-executor stack guards
Sync changes relevant to fast/warp sync, possibly introducing bugs
Maybe other stuff I'm missing

crystalin commented 2 years ago

@rphmeier I've double checked again today, but on our Alphanet, the degradation didn't start on the relay client upgrade v0.9.16 but the day after when we upgraded our collators to v0.9.16. This contradicts with what was observed on Moonriver/Kusama (the degradation happened 1 week before we upgraded collators to v0.9.16)

I'm not sure what could trigger those 2 scenarios. Either some networking conditions that could have been triggered on Kusama earlier due to its size, and was re-enforced by upgraded parachain clients to v0.9.16 (which would have triggered it on Alphanet).

@rphmeier is it safe to revert validators to v0.9.15 on Alphanet with runtime 9160 ? (I know there is a new host, but I'm not sure if/when it is actually used)

rphmeier commented 2 years ago

is it safe to revert validators to v0.9.15 on Alphanet with runtime 9160 ? (I know there is a new host, but I'm not sure if/when it is actually used

I don't think so.

crystalin commented 2 years ago

@rphmeier also are there any log (we could enable debug for all modules for a short amount of time) on our validators that would help to identify the issue ?

rphmeier commented 2 years ago

Can we get a graph on Grafana indicating the amount of reorgs?

crystalin commented 2 years ago

Those are counts per day for alphanet. The relay v0.9.16 was deployed on Feb 09, and the parachain v0.9.16 was deployed on Feb 10

Graph of all re-orgs from the collators on alphanet Nothing seems unusual

crystalin commented 2 years ago

@rphmeier I realize... that actually the re-org message only applies in some cases, but the problematic one, where mutliple blocks of the same height (best) is received is not triggering a reorg:

Feb 15 01:50:49 moonzone alphanet[715669]: 2022-02-15 01:50:49 [🌗] 💤 Idle (50 peers), best: #1703428 (0x7662…d0a3), finalized #1703428 (0x7662…d0a3), ⬇ 2.6kiB/s ⬆ 3.4kiB/s
Feb 15 01:50:49 moonzone alphanet[715669]: 2022-02-15 01:50:49 [Relaychain] 💤 Idle (49 peers), best: #3765514 (0x6a92…ce67), finalized #3765511 (0xc386…804b), ⬇ 45.5kiB/s ⬆ 42.0kiB/s
Feb 15 01:50:50 moonzone alphanet[715669]: 2022-02-15 01:50:50 [🌗] ✨ Imported #1703429 (0x45fe…8833)
Feb 15 01:50:51 moonzone alphanet[715669]: 2022-02-15 01:50:51 [🌗] ✨ Imported #1703429 (0x8031…1937)
Feb 15 01:50:54 moonzone alphanet[715669]: 2022-02-15 01:50:54 [Relaychain] ✨ Imported #3765515 (0xecaf…b642)
Feb 15 01:50:54 moonzone alphanet[715669]: 2022-02-15 01:50:54 [Relaychain] Could not retrieve session info from rolling session window relay_parent=0xecaf63d126f6be576673de500912b82f8838505294d4acaf1d6097af218cb642
Feb 15 01:50:54 moonzone alphanet[715669]: 2022-02-15 01:50:54 [🌗] 💤 Idle (50 peers), best: #1703428 (0x7662…d0a3), finalized #1703428 (0x7662…d0a3), ⬇ 293.7kiB/s ⬆ 1.2MiB/s
Feb 15 01:50:54 moonzone alphanet[715669]: 2022-02-15 01:50:54 [Relaychain] 💤 Idle (49 peers), best: #3765515 (0xecaf…b642), finalized #3765512 (0x13e4…e778), ⬇ 63.1kiB/s ⬆ 30.1kiB/s

As you can see here, there are 2 Imported #1703429 with different hash, usually prevent the parachain to get the block produced/included in time. I'll see if I can count them

EDIT: I computed the number of relay import per day and it didn't change neither:

gilescope commented 2 years ago

Here's the block times for statemine - on the left is before the upgrade of relay chain and statemine runtimes and on the right is after. It's pretty clear that block production has taken a hit: Screenshot(29)

(There's probably a better graph out there but this is the one I have to hand)

ordian commented 2 years ago

We'd like to include https://github.com/paritytech/polkadot/pull/4914 into the next release (and collators as well), even though it doesn't explain all of the issues we're observing here.

crystalin commented 2 years ago

Adding additional information.

We can't reproduce it with small networks (<20 collators). Here are the affected networks selected collators count (there are more collators but those are the one producing blocks):

Alphanet: 100 (internal relay, 21 validators)
Moonriver: 64 (kusama)
Moonbeam 53 (polkadot) (Each network has 300+ nodes)

Here is Moonbeam block time over the last days. It has also degradated: (spikes are just node reboots, to be ignored)

It matches with the collators deploying v0.9.16 support. We are asking them to revert to previous version (v0.9.13) to see if it impacts (until runtime 9160 is deployed on Polkadot)

crystalin commented 2 years ago

Reverting many (not all) collators to v0.19.2 (using substrate v0.9.13) had an initial impact but it came back to the same degradation level as it was before: (The vertical bar is when our nodes downgraded)

We are testing to reduce active collator from 100 to 30 on Alphanet to see if there is an impact

crystalin commented 2 years ago

Waiting a bit more. This is Moonbeam data

crystalin commented 2 years ago

Reducing Alphanet selected Collator from 100 to 30 didn't change the block time degradation.

crystalin commented 2 years ago

We've tweaked one of our internal network (stagenet) to have 50 collators (on v0.9.16 support) but with only 8 selected to produce blocks (always the same). (The relay is 0.9.16 with 9160 westend composed of 8 validators)

The block time increased: (Block time. The spike are simply the deployment of all the 46 added collators)

That means that even when collators are not producing blocks they are impacting the block time (@rphmeier)

koute commented 2 years ago

@crystalin Just to make absolutely sure we're measuring the same thing can you please share your Grafana query you're using to produce that "Block Time [60m]" graph?

gilescope commented 2 years ago

Screenshot(31)

It looks like the rollback of that PR seems to improve block times on westmint (unless I have messed up the query). This is looking at the last 7 days on westmint. The PR rolls back a noisy warning as well so the right hand warning graph makes it clear when the rolled back PR was introduced.

ordian commented 2 years ago

From the data on Rococo, we can see a degradation in the collator protocol:

with Collation wasn't advertised to any validator. logs increase on collators. Inclusion times seem unaffected. Our data suggests that applying https://github.com/paritytech/polkadot/pull/4925 with https://github.com/paritytech/polkadot/pull/4925#discussion_r807000553 to validators and https://github.com/paritytech/cumulus/pull/990 to collators will bring the parablock times to the previous value.

Mitchell-Grimes commented 2 years ago

It looks like the rollback of that PR seems to improve block times on westmint (unless I have messed up the query). This is looking at the last 7 days on westmint. The PR rolls back a noisy warning as well so the right hand warning graph makes it clear when the rolled back PR was introduced.

Hey, the query is sum(1/rate(substrate_block_height{pod=~"$workload",chain=~"(moonbeam-|)$network(net|)" ,endpoint="$chain",status="best"}[60m])) by (pod)

crystalin commented 2 years ago

Thank you @Mitchell-Grimes. So it confirms the rollback to previous version is working for Moonbeam/Polkadot . But I still have doubts about why this would work on Moonriver/Kusama as it started many days before we upgraded our collators.

sander2 commented 2 years ago

Just as an additional datapoint: these are the block production times on kintsugi. You can see a big increase on feb 8, even though we did not upgrade our nodes until feb 11th.

(graph created with avg(5*60/increase(substrate_block_height{kubernetes_namespace="interbtc-kusama",status="finalized"}[5m])))

rphmeier commented 2 years ago

Based on the data and experiments so far we have 2 further tests planned to confirm the fix:

Deploy https://github.com/paritytech/polkadot/pull/4925#discussion_r807000553 on Rococo with a parachain that has 50 validators.
Deploy to Westend

rphmeier commented 2 years ago

I dug into Statemine's performance. There are sharp differences observable in the behavior of the collators.

Collator Metrics

The green line is the number of advertisements made by collators. The yellow line is the number of advertised collations requested by validators. And the blue line is the number of collations generated.

Collator Versions

Initial Comments

All graphs exhibit 3 points of decline: a slow decline starting Feb 4, a subsequent drop on Feb 7, and a final drop on Feb 10. After this point, behavior improves. 0.9.16 was released on Feb 4
All graphs exhibit a subsequent increase on Feb 23-25. 0.9.17 was released on Feb 23.
The advertisements / requests graphs have dropped after the collators upgraded on Feb 28/Mar 1 despite number of collations remaining steady.
The amount of advertisements made overall has an expected reduction by 1/2 after changing the behavior of collators not to advertise their collations to the validators of the next validator group. That is, collators now only advertise to one group and not two.
The number of produced collations has recovered somewhat to pre-0.9.16 levels after the upgrade to 0.9.17. This might be explained by longer availability distribution times on Kusama (as discussed in #5021)

Rate of Progression of Statemine

This graph shows the rate of progression of the statemine chain. The artifacts around Feb 8 and Feb 10 represent places where the metric format changed and I had to stitch the graphs together with a few queries.

We can see that prior to 0.9.16 on Feb 4, statemine produced around 0.074 blocks/s or about 1 block every 13.5 seconds. This is close to the ideal of 12 with a few issues.

From that point until around Feb 11, statemine gradually degraded to around .05 block/s and then continued to degrade over time to around .045 blocks per second on Feb 28. After Feb 28, block production rates have recovered to around .048.

Overall, we're looking at a reduction in block production rate of around 35%-40%.

Proportion of Advertisements Requested

We can see a sharp jump up to ~90% after statemine collators were upgraded to 0.9.16 on Feb 10. Before then, half of advertisements were redundant. With that in mind, the real ratio of requests:advertisements was about 100-105% prior to Feb 4. This indicates that something is causing validators to be 10-15% less likely on average to request an advertised collation.

Normalized to Validator Group Sizes

Here is a normalized version of the first graph where the advertisements and requests graphs are normalized by a factor of 5, to account for the fact that Kusama validator groups have 5 validators in them. So this represents "advertisements/requests per validator in group" as compared to collations, as opposed to "total advertisements/requests".

Again, we see the same drop-off where previous to 0.9.16 on Feb 10, only half of advertisements were requested and this is because only half of advertisements were to validators that would act on them.

Ignoring the extra advertisements to validators in an irrelevant group, the number of advertisements per validator per collation has degraded from around 2/3 to 1/2 and the number of advertisements per validator have degraded by 1/2 in absolute terms. This might indicate issues for collators to either discover or connect to validators.

Advertisements per Validator per Collation

The green line assumes 5 validators per core and the yellow assumes 10, which accounts for the advertisements to 2 groups.

Comparing the left side of the yellow line to the right side of the green line, we can see that the absolute amount of advertisements per validator per collation has dropped from around .65 to .55, or a ~15% drop-off (this is a more accurate claim than the 2/3->1/2 claim made above). Also, this is worryingly far from the expected '1' in either case. This has dropped off further after the upgrade of statemine collators on Feb 28.

Conclusions

Block production overall has dropped by around 35-40%.
0.9.16 on validators hurt, and on collators even more so
0.9.17 on validators helped with collation metrics but not with actual block production.
0.9.17 on collators hurt collation metrics but seems to have improved block production slightly again.
The amount of collations produced has dropped by about 33%
The proportion of collations advertised to validators has dropped about 15% and the proportion of requested collations out of those advertised has dropped another 15%. There may be issues in collators either discovering or connecting to validators.

Open questions: what proportion of the reduction in collation production is explained by more time spent in availability distribution?

rphmeier commented 2 years ago

I had a bit of time so I took a sample of the amount of time spent in availability (this was on Moonriver) on Kusama before 0.9.16 and from earlier today, and found that this changed from 1.0569 blocks to 1.8742 blocks, which is a 77.3% increase. This explains 6*.773=4.64s of reduced block time.

Block times have changed from around 1 block per 13.5s to 1 block per 21s for a difference of 7.5s so additional time spent in availability explains about 62% of the performance degradation.

Availability cores being occupied for 77.3% more blocks is also possibly a reasonable explanation for a reduction in the number of collations produced, as collators don't produce collations on top of relay parents where the core is occupied.

Lldenaurois commented 2 years ago

UPDATE: Versi collators seem to be down: https://grafana.parity-mgmt.parity.io/goto/LdoXPYL7k?orgId=1. We are producing blocks, so I imagine that the logs are hooked up to the wrong collators. This might be exacerbating the issue we are observing or even causing a completely different issue.

After aligning with the Purestake team, we were able to reproduce lapses in block times on Versi.

Originally, we added 200ms latency to the chaos network for Versi and immediately experienced large lapses in parachain block times. In addition, relay chain block times were regularly 12s and spike to 18s at times.

The largest lapse in parachain blocks was approximately 3 mins.

In order to further our hypothesis, we removed the latency entirely and observed the relay chain block times. To our surprise, we still observed similar relay chain block times (though more infrequent than previously).

Specifically, we can observe large lapses in bitfield-signing as can be confirmed here: https://grafana.parity-mgmt.parity.io/goto/eAnc-LYnz?orgId=1 (flip results order to observe large time-spans of false availability for the first core).

Looking into the composition of the bitfield array, the computation of the availability boolean for a particular core evaluates to false here if and only if one of the following holds:

The core is Available
The core is Scheduled
The core is Occupied and the Chunk is unavailable for the node keystore's particular validator_idx

Example (flip results order)

We observe large spurts of the signing-bitfield log line reporting false for the first core (the one Moonbeam is assigned to) before updating to true as the core gets scheduled and the candidate becomes available.

On the other hand, if we verify the TRACE statements in this same time-frame (see here) for the bitfield-signing subsystem, we observe no trace lines whatsoever.

Therefore, we must conclude that the core is not Occupied (it is either Scheduled or Fee) during the large lapses in parachain blocks, as we would otherwise expect to see a log line for the QueryChunkAvailability response.

More importantly, we observe no candidate backing or candidate inclusion throughout this outage.

Looking closely, we observe that candidate 0x239621fd41d37d845adfbfae4690a523c55bf2e335a4c113204502ac83c344ec was the last block to get backed before the outage and it actually got included prior to the outage at 18:38:43.

This indicates that there may be a problem in

Candidate backing, i.e. candidates are being produced by collators but block producers are unable to produce valid backed-candidates
Collator Protocol, i.e. Collators believe the core is Occupied or are unable to successfully advertise and subsequently get their collation seconded

If the core is not occupied, it stands to reason that the collators are in some way misinterpreting this information and not producing collations even though the core is being reported as free for all validators.

During this time, we observe large outages in approval-distribution noting that an assignment from a peer is out of view: (https://grafana.parity-mgmt.parity.io/goto/st34YLLnz?orgId=1).

An important observation from this analysis is that the long relay-chain block times were not fixed by removing the chaos network latency.

A noteworthy mention is that one of the validators seems to have crashed at some point (but is now running fine): https://grafana.parity-mgmt.parity.io/goto/-yrMYYL7k?orgId=1

rphmeier commented 2 years ago

^ average time-in-flight for intra-node messages being sent to the approval-distribution subsystem. This appears to cause a cascading effect causing other network subsystems to slow down.

^ Nodes are receiving massive amounts of intra-node messages on the approval-distribution subsystem, far beyond the expected amounts. These probably originate in the network, but might originate in the approval-voting subsystem. Nodes are sending tons of messages to the network-bridge.

^ This shows that the number of notifications received on the network corresponds very closely to the approval-distribution subsystem, indicating that the messages originate in the network. The gap represents messages from other subsystems.

Optimizing approval-distribution will probably help a lot.

gilescope commented 2 years ago

Can we close this issue now?

crystalin commented 2 years ago

Well, I don't think the issue is fully resolved. @rphmeier , Currently on Moonriver, block average time is around 14s (as opposed to 12.5s on Moonbeam), so I think there is still improvement to be done, what do you think ?

crystalin commented 2 years ago

FYI: This is what it looks for the last 90 days:

rphmeier commented 2 years ago

Going to close this; the next round of improvements will come with asynchronous backing.

paritytech / polkadot