Closed crystalin closed 2 years ago
On our Alphanet, where we also observe the bug, we can see some important changes in the Validator <-> Collator peer system:
Looking more deeply into the connections, here are 2 of the validators (modified to have 500 collation peer set). The surprising part is that it establish 100+ connection instantly and removes it
I've done some test, and I suspect the nodes are getting overloaded with connections in critical times. Using unoptimized binary is making the result worse also. That would explain why Kusama network is doing worse than Polkadot at the moment
It might be related to some optimizations @eskimor has done around collation networking
Yeah, we should try a branch with those commits removed.
5b74841ea76195fb804a02c0336d05d9a2d3091b , 3ae71a552256273c6a82c3824451fb73423b2232, 7311a3aa9f83730a079b0899c34bb813dd96c4e1 all seem like viable culprits.
https://github.com/paritytech/polkadot/pull/4640/files#r805223564 Looks suspicious.
I pushed a branch reverting this fix to https://github.com/paritytech/polkadot/tree/rh-revert-connectivity-fix if you'd like to try it out.
@rphmeier Ok I'm trying it now, I'll see if this can be deployed now or will have to wait tomorrow morning
@rphmeier Can it be applied only to validator or do the collator also need it ?
@crystalin Specifically the collators need it. This should only affect collator logic.
That might be an argument against this being the underlying cause, because AFAIK it began even before the collators were upgraded.
@rphmeier Yes it began while we were still on 0.9.13 collators. It also begin precisely when Kusama validator started to use v0.9.16 Also we can't easily upgrade all our collators to a custom version. Purestake only control 8 collators out of 80.
We noticed increased block time since Kusama nodes upgrades to 0.9.16. We released Karura with 0.9.16 2 days ago and see block number increased more but unsure if it is related to our upgrade or more Kusama nodes have upgraded.
Thank you @xlc , We also observed on some of our networks a small increase when the collators upgraded to support v0.9.16, but it was very small compared to the increase of block time since Kusama upgraded to v0.9.16
(This is on Moonbeam/Polkadot, where some collators upgraded to support v0.9.16), The spike is the upgrade. There is a small increase in the average after it
@rphmeier I tried to look more at the charts. I confirm the connection spikes happen when the validator get in the parachain core set. I had initially the idea that it could have been triggered by other parachains migrating to v0.9.16 on Kusama, but I don't think that is the case anymore. One thing that I'd like to verify but I don't have the full data is if the validator/collator established connection started to rise/spike when the validator was upgraded on Kusama (to confirm if it is a relay or parachain issue). Let me know if you can check that
We can confirm that we observe a similar issue in Calamari.
https://github.com/paritytech/substrate/compare/polkadot-v0.9.15...polkadot-v0.9.16
These are all Substrate changes from that period
https://github.com/paritytech/polkadot/compare/v0.9.15...v0.9.16
It doesn't look like lip2p changed at all; both versions Cargo.lock
point to v0.40.0
These are all Polkadot changes.
It seems that we're looking for stuff that is relevant to the validator nodes and excluding the runtime. That narrows things down a decent bit.
substrate-executor
stack guards@rphmeier I've double checked again today, but on our Alphanet, the degradation didn't start on the relay client upgrade v0.9.16 but the day after when we upgraded our collators to v0.9.16. This contradicts with what was observed on Moonriver/Kusama (the degradation happened 1 week before we upgraded collators to v0.9.16)
I'm not sure what could trigger those 2 scenarios. Either some networking conditions that could have been triggered on Kusama earlier due to its size, and was re-enforced by upgraded parachain clients to v0.9.16 (which would have triggered it on Alphanet).
@rphmeier is it safe to revert validators to v0.9.15 on Alphanet with runtime 9160 ? (I know there is a new host, but I'm not sure if/when it is actually used)
is it safe to revert validators to v0.9.15 on Alphanet with runtime 9160 ? (I know there is a new host, but I'm not sure if/when it is actually used
I don't think so.
@rphmeier also are there any log (we could enable debug for all modules for a short amount of time) on our validators that would help to identify the issue ?
Can we get a graph on Grafana indicating the amount of reorgs?
1511 - 07
1720 - 08
1846 - 09
1776 - 10
1739 - 11
1736 - 12
1690 - 13
1716 - 14
Those are counts per day for alphanet. The relay v0.9.16 was deployed on Feb 09, and the parachain v0.9.16 was deployed on Feb 10
Graph of all re-orgs from the collators on alphanet Nothing seems unusual
@rphmeier I realize... that actually the re-org message only applies in some cases, but the problematic one, where mutliple blocks of the same height (best) is received is not triggering a reorg:
Feb 15 01:50:49 moonzone alphanet[715669]: 2022-02-15 01:50:49 [🌗] 💤 Idle (50 peers), best: #1703428 (0x7662…d0a3), finalized #1703428 (0x7662…d0a3), ⬇ 2.6kiB/s ⬆ 3.4kiB/s
Feb 15 01:50:49 moonzone alphanet[715669]: 2022-02-15 01:50:49 [Relaychain] 💤 Idle (49 peers), best: #3765514 (0x6a92…ce67), finalized #3765511 (0xc386…804b), ⬇ 45.5kiB/s ⬆ 42.0kiB/s
Feb 15 01:50:50 moonzone alphanet[715669]: 2022-02-15 01:50:50 [🌗] ✨ Imported #1703429 (0x45fe…8833)
Feb 15 01:50:51 moonzone alphanet[715669]: 2022-02-15 01:50:51 [🌗] ✨ Imported #1703429 (0x8031…1937)
Feb 15 01:50:54 moonzone alphanet[715669]: 2022-02-15 01:50:54 [Relaychain] ✨ Imported #3765515 (0xecaf…b642)
Feb 15 01:50:54 moonzone alphanet[715669]: 2022-02-15 01:50:54 [Relaychain] Could not retrieve session info from rolling session window relay_parent=0xecaf63d126f6be576673de500912b82f8838505294d4acaf1d6097af218cb642
Feb 15 01:50:54 moonzone alphanet[715669]: 2022-02-15 01:50:54 [🌗] 💤 Idle (50 peers), best: #1703428 (0x7662…d0a3), finalized #1703428 (0x7662…d0a3), ⬇ 293.7kiB/s ⬆ 1.2MiB/s
Feb 15 01:50:54 moonzone alphanet[715669]: 2022-02-15 01:50:54 [Relaychain] 💤 Idle (49 peers), best: #3765515 (0xecaf…b642), finalized #3765512 (0x13e4…e778), ⬇ 63.1kiB/s ⬆ 30.1kiB/s
As you can see here, there are 2 Imported #1703429
with different hash, usually prevent the parachain to get the block produced/included in time. I'll see if I can count them
EDIT: I computed the number of relay import per day and it didn't change neither:
15413 07
18184 08
18283 09
18374 10
18329 11
18261 12
18244 13
18228 14
Here's the block times for statemine - on the left is before the upgrade of relay chain and statemine runtimes and on the right is after. It's pretty clear that block production has taken a hit:
(There's probably a better graph out there but this is the one I have to hand)
We'd like to include https://github.com/paritytech/polkadot/pull/4914 into the next release (and collators as well), even though it doesn't explain all of the issues we're observing here.
Adding additional information.
We can't reproduce it with small networks (<20 collators). Here are the affected networks selected collators count (there are more collators but those are the one producing blocks):
Here is Moonbeam block time over the last days. It has also degradated: (spikes are just node reboots, to be ignored)
It matches with the collators deploying v0.9.16 support. We are asking them to revert to previous version (v0.9.13) to see if it impacts (until runtime 9160 is deployed on Polkadot)
Reverting many (not all) collators to v0.19.2 (using substrate v0.9.13) had an initial impact but it came back to the same degradation level as it was before: (The vertical bar is when our nodes downgraded)
We are testing to reduce active collator from 100 to 30 on Alphanet to see if there is an impact
Waiting a bit more. This is Moonbeam data
Reducing Alphanet selected Collator from 100 to 30 didn't change the block time degradation.
We've tweaked one of our internal network (stagenet) to have 50 collators (on v0.9.16 support) but with only 8 selected to produce blocks (always the same). (The relay is 0.9.16 with 9160 westend composed of 8 validators)
The block time increased: (Block time. The spike are simply the deployment of all the 46 added collators)
That means that even when collators are not producing blocks they are impacting the block time (@rphmeier)
@crystalin Just to make absolutely sure we're measuring the same thing can you please share your Grafana query you're using to produce that "Block Time [60m]" graph?
It looks like the rollback of that PR seems to improve block times on westmint (unless I have messed up the query). This is looking at the last 7 days on westmint. The PR rolls back a noisy warning as well so the right hand warning graph makes it clear when the rolled back PR was introduced.
From the data on Rococo, we can see a degradation in the collator protocol:
with Collation wasn't advertised to any validator.
logs increase on collators. Inclusion times seem unaffected.
Our data suggests that applying https://github.com/paritytech/polkadot/pull/4925 with https://github.com/paritytech/polkadot/pull/4925#discussion_r807000553 to validators and https://github.com/paritytech/cumulus/pull/990 to collators will bring the parablock times to the previous value.
It looks like the rollback of that PR seems to improve block times on westmint (unless I have messed up the query). This is looking at the last 7 days on westmint. The PR rolls back a noisy warning as well so the right hand warning graph makes it clear when the rolled back PR was introduced.
Hey, the query is sum(1/rate(substrate_block_height{pod=~"$workload",chain=~"(moonbeam-|)$network(net|)" ,endpoint="$chain",status="best"}[60m])) by (pod)
Thank you @Mitchell-Grimes. So it confirms the rollback to previous version is working for Moonbeam/Polkadot . But I still have doubts about why this would work on Moonriver/Kusama as it started many days before we upgraded our collators.
Just as an additional datapoint: these are the block production times on kintsugi. You can see a big increase on feb 8, even though we did not upgrade our nodes until feb 11th.
(graph created with avg(5*60/increase(substrate_block_height{kubernetes_namespace="interbtc-kusama",status="finalized"}[5m]))
)
Based on the data and experiments so far we have 2 further tests planned to confirm the fix:
I dug into Statemine's performance. There are sharp differences observable in the behavior of the collators.
The green line is the number of advertisements made by collators. The yellow line is the number of advertised collations requested by validators. And the blue line is the number of collations generated.
This graph shows the rate of progression of the statemine chain. The artifacts around Feb 8 and Feb 10 represent places where the metric format changed and I had to stitch the graphs together with a few queries.
We can see that prior to 0.9.16 on Feb 4, statemine produced around 0.074 blocks/s or about 1 block every 13.5 seconds. This is close to the ideal of 12 with a few issues.
From that point until around Feb 11, statemine gradually degraded to around .05 block/s and then continued to degrade over time to around .045 blocks per second on Feb 28. After Feb 28, block production rates have recovered to around .048.
Overall, we're looking at a reduction in block production rate of around 35%-40%.
We can see a sharp jump up to ~90% after statemine collators were upgraded to 0.9.16 on Feb 10. Before then, half of advertisements were redundant. With that in mind, the real ratio of requests:advertisements was about 100-105% prior to Feb 4. This indicates that something is causing validators to be 10-15% less likely on average to request an advertised collation.
Here is a normalized version of the first graph where the advertisements and requests graphs are normalized by a factor of 5, to account for the fact that Kusama validator groups have 5 validators in them. So this represents "advertisements/requests per validator in group" as compared to collations, as opposed to "total advertisements/requests".
Again, we see the same drop-off where previous to 0.9.16 on Feb 10, only half of advertisements were requested and this is because only half of advertisements were to validators that would act on them.
Ignoring the extra advertisements to validators in an irrelevant group, the number of advertisements per validator per collation has degraded from around 2/3 to 1/2 and the number of advertisements per validator have degraded by 1/2 in absolute terms. This might indicate issues for collators to either discover or connect to validators.
The green line assumes 5 validators per core and the yellow assumes 10, which accounts for the advertisements to 2 groups.
Comparing the left side of the yellow line to the right side of the green line, we can see that the absolute amount of advertisements per validator per collation has dropped from around .65 to .55, or a ~15% drop-off (this is a more accurate claim than the 2/3->1/2 claim made above). Also, this is worryingly far from the expected '1' in either case. This has dropped off further after the upgrade of statemine collators on Feb 28.
Open questions: what proportion of the reduction in collation production is explained by more time spent in availability distribution?
I had a bit of time so I took a sample of the amount of time spent in availability (this was on Moonriver) on Kusama before 0.9.16 and from earlier today, and found that this changed from 1.0569 blocks to 1.8742 blocks, which is a 77.3% increase. This explains 6*.773=4.64s of reduced block time.
Block times have changed from around 1 block per 13.5s to 1 block per 21s for a difference of 7.5s so additional time spent in availability explains about 62% of the performance degradation.
Availability cores being occupied for 77.3% more blocks is also possibly a reasonable explanation for a reduction in the number of collations produced, as collators don't produce collations on top of relay parents where the core is occupied.
UPDATE: Versi collators seem to be down: https://grafana.parity-mgmt.parity.io/goto/LdoXPYL7k?orgId=1. We are producing blocks, so I imagine that the logs are hooked up to the wrong collators. This might be exacerbating the issue we are observing or even causing a completely different issue.
After aligning with the Purestake team, we were able to reproduce lapses in block times on Versi.
Originally, we added 200ms latency to the chaos network for Versi and immediately experienced large lapses in parachain block times. In addition, relay chain block times were regularly 12s and spike to 18s at times.
The largest lapse in parachain blocks was approximately 3 mins.
In order to further our hypothesis, we removed the latency entirely and observed the relay chain block times. To our surprise, we still observed similar relay chain block times (though more infrequent than previously).
Specifically, we can observe large lapses in bitfield-signing as can be confirmed here: https://grafana.parity-mgmt.parity.io/goto/eAnc-LYnz?orgId=1 (flip results order to observe large time-spans of false
availability for the first core).
Looking into the composition of the bitfield array, the computation of the availability boolean for a particular core evaluates to false
here if and only if one of the following holds:
Example (flip results order)
We observe large spurts of the signing-bitfield
log line reporting false
for the first core (the one Moonbeam is assigned to) before updating to true
as the core gets scheduled and the candidate becomes available.
On the other hand, if we verify the TRACE
statements in this same time-frame (see here) for the bitfield-signing subsystem, we observe no trace lines whatsoever.
Therefore, we must conclude that the core is not Occupied (it is either Scheduled or Fee) during the large lapses in parachain blocks, as we would otherwise expect to see a log line for the QueryChunkAvailability
response.
More importantly, we observe no candidate backing or candidate inclusion throughout this outage.
Looking closely, we observe that candidate 0x239621fd41d37d845adfbfae4690a523c55bf2e335a4c113204502ac83c344ec
was the last block to get backed before the outage and it actually got included prior to the outage at 18:38:43.
This indicates that there may be a problem in
If the core is not occupied, it stands to reason that the collators are in some way misinterpreting this information and not producing collations even though the core is being reported as free for all validators.
During this time, we observe large outages in approval-distribution noting that an assignment from a peer is out of view: (https://grafana.parity-mgmt.parity.io/goto/st34YLLnz?orgId=1).
An important observation from this analysis is that the long relay-chain block times were not fixed by removing the chaos network latency.
A noteworthy mention is that one of the validators seems to have crashed at some point (but is now running fine): https://grafana.parity-mgmt.parity.io/goto/-yrMYYL7k?orgId=1
^ average time-in-flight for intra-node messages being sent to the approval-distribution subsystem. This appears to cause a cascading effect causing other network subsystems to slow down.
^ Nodes are receiving massive amounts of intra-node messages on the approval-distribution subsystem, far beyond the expected amounts. These probably originate in the network, but might originate in the approval-voting subsystem. Nodes are sending tons of messages to the network-bridge.
^ This shows that the number of notifications received on the network corresponds very closely to the approval-distribution subsystem, indicating that the messages originate in the network. The gap represents messages from other subsystems.
Optimizing approval-distribution will probably help a lot.
Can we close this issue now?
Well, I don't think the issue is fully resolved. @rphmeier , Currently on Moonriver, block average time is around 14s (as opposed to 12.5s on Moonbeam), so I think there is still improvement to be done, what do you think ?
FYI: This is what it looks for the last 90 days:
Going to close this; the next round of improvements will come with asynchronous backing.
We are observing pretty impactful degradation on Moonriver since the deployment of client v0.9.16 on Kusama. Additionally, we have upgraded few days after Moonriver to support v0.9.16 but it had no impact.
We have been able to reproduce that on our public Alphanet (20 controlled validators, on v0.9.16) Here are the logs for a given block not being included #1690751 alphanet-block-degradation-block-1690751.log
Here is the graph showing the average time block over the last 7 days: (The spike is only the restart of our nodes, ignore it)