[Root cause] Finality lag and slow parachain block production immediately after runtime upgrade

alexggh commented 1 week ago

What happened

Immediately after runtime was updated to 1.3.0 which was enacted at block, finality started lagging and parachains blocks weren't being produced as usual.

The reason for that was that a significant number of validators were crashing with the following error:

Sep 17 08:03:13 dot1 taskset[688103]: 2024-09-17 08:03:13 Overseer exited with error err=Generated(SubsystemStalled("candidate-backing-subsystem", "signal", "polkadot_node_subsystem_types::OverseerSignal"))

Sep 17 08:04:13 dot1 taskset[688103]: 2024-09-17 08:04:13 Task "polkadot-runtime-api-request" (Group: runtime-api) was still running with 3 instances after waiting 60 seconds to finish.
Sep 17 08:04:13 dot1 taskset[688103]: Error:
Sep 17 08:04:13 dot1 taskset[688103]:    0: Other: Essential task failed.
Sep 17 08:04:13 dot1 taskset[688103]: Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Sep 17 08:04:13 dot1 taskset[688103]: Run with RUST_BACKTRACE=full to include source snippets.

After restart the validators worked as expected but it took around ~40min for finality to catch:

Additionally, because parachain collators froze or crashed as well, some of them had to be restarted, for examples on asset-hub it can be seen here the collators that are still not producing any blocks: https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Fdot-rpc.stakeworld.io%2Fassethub#/collators and need manual intervation.

Root cause

All failure modes seem to be the same candidate-backing-subsystem does not process signals and after 64 blocks the size of the subsystem signals channels the overseers decides that it is stuck and everything is terminated and restarted.

Overseer exited with error err=Generated(SubsystemStalled("candidate-backing-subsystem", "signal", "polkadot_node_subsystem_types::OverseerSignal"))

candidate-backing-subsystem seems to be waiting on calls from runtime-api subsystem which seems to be taking a long long time:

Sep 17 08:04:13 dot1 taskset[688103]: 2024-09-17 08:04:13 Task "polkadot-runtime-api-request" (Group: runtime-api) was still running with 3 instances after waiting 60 seconds to finish.

This data leads to the runtime call fetch_claim_queue which is a new runtime call that was emitted after the runtime upgrade, because the parachainHost api was bumped to pub const CLAIM_QUEUE_RUNTIME_REQUIREMENT: u32 = 11.

A probable explanation for what happened is that we were affected by this limitation https://github.com/paritytech/polkadot-sdk/issues/64 and the fact the new runtime API claim_queue is using storage data that is created by a migration parachains_configuration::migration::v12::MigrateToV12 included in the same runtime.

So, the first time claim_queue got called it used garbage data that made it take a very long time and the candidate-backing subsystem got stucked waiting on the api calls, that also explains why after restart on the following blocks things recovered and worked as expected.

An additional data point is the fact the subsystem got declared stalled by the overseer exactly 64 blocks after the block containing the runtime was imported.

Testing with chopsticks at block 22572435 confirmed that claim_queue is taking a lot of time to complete because of ParachainHost corruption.

Conlusion

Root-cause is fairly well understood and checked, so no further impact is expected.

Action items

[x] Gather more data to confirm this is the root-cause.
[x] Track the network to make sure after everyone recovered their validators and collators the block production of the relaychain and parachain is the same as before.
[x] Why wasn't this a problem on Kusama runtime upgrade ? Corruption happened on kusama as well at block 24,786,390, but we got lucky/unlucky with the corruption and claim_queue api returned faster, so there was no impact.
[x] Paseo seems to have been impacted by this, so we should've had a way to delay the runtime upgrade.
[x] Postmortem.
[x] Make the subsystem more robust so they don't fail hard on this type of errors, if the subsystem wouldn't have crashed and would just move on to next block, then we wouldn't have seen any impact.

ordian commented 1 week ago

Make the subsystem more robust so they don't fail hard on this type of errors, if the subsystem wouldn't have crashed and would just move on to next block, then we wouldn't have seen any impact.

Apart from fixing #64, we could add timeouts for runtime APIs that are (slightly) lower than subsystem stall timeout. Although this sounds brittle, it might be a bit better than letting the node hang and shutdown.

burdges commented 1 week ago

Aside from issues that caused this, we've a back pressure limitation there too right? If we've 1/3rd bad backers and 1/3rd bad relay chain block producers, then we'd still have 1/9th the candidates of a full chain, plus whatever we consider essential, so that's considerable throughput.

Availabiltiy bitfields could be subject to back pressure too maybe? I suppose that's a topic for another thread.

bkchr commented 1 week ago

Although this sounds brittle, it might be a bit better than letting the node hang and shutdown.

Generally the runtime is seen as trust source code. Also aborting is still not that easy to do with wasmtime.

alexggh commented 1 week ago

Aside from issues that caused this, we've a back pressure limitation there too right? If we've 1/3rd bad backers and 1/3rd bad relay chain block producers, then we'd still have 1/9th the candidates of a full chain, plus whatever we consider essential, so that's considerable throughput.

In this case I estimate almost all the validators restarted at the same time, the recovery itself was slower than expected because of two things:

Parachain collators got stuck as well from the same reason, parachain would not produce blocks
When all nodes got restarted, all of them had to go through DHT peer discovery, which is around 10s of minutes, PVF recompilation, and probably other stuff we do at restart that I'm missing, so it took awhile until everything was in placed for nodes to finalize blocks.

kianenigma commented 1 week ago

Thanks for the great writeup!

Q1:

Why wasn't this a problem on Kusama runtime upgrade ? Corruption happened on kusama as well at block 24,786,390, but we got lucky/unlucky with the corruption and claim_queue api returned faster, so there was no impact.

I didn't quite get why Kusama survived? The same migration is there as well.

Q2: Why does unmigrated data cause a runtime API to take an arbitrarily long amount of time? If the data is unmigrated, I would expect the API to still return immediately, and perhaps return None or Default or Err or something. But I can't see why it would cause it to just take a long time to respond?

sandreim commented 1 week ago

Q2: Why does unmigrated data cause a runtime API to take an arbitrarily long amount of time? If the data is unmigrated, I would expect the API to still return immediately, and perhaps return None or Default or Err or something. But I can't see why it would cause it to just take a long time to respond?

The claim_queue runtime API will use a corrupted host configuration with this piece of code: https://github.com/paritytech/polkadot-sdk/blob/ba38d31d8e7c12025e4de385c2a4d6372ee7c23b/polkadot/runtime/parachains/src/scheduler.rs#L611

Checkout the values used by the code below in the HostConfiguration snippet.

"HostConfiguration"{
   "approval_voting_params":"ApprovalVotingParams"{
      "max_approval_coalesce_count":655360
   },
   "scheduler_params":"SchedulerParams"{
      "group_rotation_frequency":131072,
      "paras_availability_period":83951616,
      "max_validators_per_core":"None",
      "lookahead":100663296,
      "num_cores":1476395008,
      "max_availability_timeouts":50331650,
      "on_demand_queue_max_size":1493172224,
      "on_demand_target_queue_utilization":0%,
      "on_demand_fee_variability":50.331648%,
      "on_demand_base_fee":2658455997759531942378630683232829440,
      "ttl":134217728
   }
}

On Kusama we had different values there which made the API run faster.

alexggh commented 1 week ago

I didn't quite get why Kusama survived? The same migration is there as well.

Most likely it matters what's the garbage and on kusama the API is still returning, you can test it with chopsticks on kusama.

On polkadot it gets stuck because of those really high number values @sandreim posted here: https://github.com/paritytech/polkadot-sdk/issues/5738#issuecomment-2358156679, lookahead and num_cores translate to a lot of loops.

Polkadot-Forum commented 1 week ago

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/2024-09-17-polkadot-finality-lag-slow-parachain-production-immediately-after-runtime-upgrade-post-mortem/10057/1

alexggh commented 4 days ago

Closing this now, the root-cause has been found and subsequent follow-up items have their own issues:

paritytech / polkadot-sdk