paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.8k stars 652 forks source link

Make subsystems more robust on runtime-apis calls taking long time/hanging #5818

Open alexggh opened 4 days ago

alexggh commented 4 days ago

Postmortemt https://github.com/paritytech/polkadot-sdk/issues/5738 showed that node can crash and restart if a runtime api hangs, the danger here is that if one API is hanging/taking a long time the behaviour is similar on all nodes, in this case all nodes crashed and restarted at the same time.

That's not good for the network so we should explore ideas for reducing the blast radius, on possible method is to timeout on runtime api calls and make sure the subsystems graciously handle this type of errors.

One thing to take into consideration here is that even if the subsystem call timed-out the runtime could still have that API running in the background and burning CPUs time so we need to make sure we graciously cancel kill tasks that are not needed anymore.