Westend validators crash with v0.9.38 update

aardbol commented 1 year ago

Node: validator DB: RocksDB OS: official polkadot container image in kubernetes

westend-validator-d-0 log:

2023-02-22 09:08:46.023 ERROR tokio-runtime-worker polkadot_overseer: subsystem exited with error subsystem="candidate-validation-subsystem" err=FromOrigin { origin: "candidate-validation", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-02-22 09:08:46.023 ERROR tokio-runtime-worker overseer: Overseer exited with error err=Generated(SubsystemStalled("chain-selection-subsystem"))
2023-02-22 09:08:46.023 ERROR tokio-runtime-worker sc_service::task_manager: Essential task `overseer` failed. Shutting down service.
2023-02-22 09:08:46.023 ERROR tokio-runtime-worker polkadot_overseer: subsystem exited with error subsystem="pvf-checker-subsystem" err=FromOrigin { origin: "pvf-checker", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-02-22 09:08:46.024 ERROR tokio-runtime-worker polkadot_overseer: subsystem exited with error subsystem="runtime-api-subsystem" err=Generated(Context("Signal channel is terminated and empty."))
2023-02-22 09:08:46.023 ERROR tokio-runtime-worker polkadot_overseer: subsystem exited with error subsystem="bitfield-signing-subsystem" err=FromOrigin { origin: "bitfield-signing", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-02-22 09:08:46.081 ERROR tokio-runtime-worker polkadot_overseer: subsystem exited with error subsystem="chain-api-subsystem" err=FromOrigin { origin: "chain-api", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-02-22 09:09:46.024 ERROR                 main sc_cli::runner: Detected running(potentially stalled) tasks on shutdown:
2023-02-22 09:09:46.024 ERROR                 main sc_cli::runner: Task "candidate-backing-subsystem" (Group: candidate-backing) was still running after waiting 60 seconds to finish.
Error:
   0: Other: Essential task failed.

  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ BACKTRACE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
                                 ⋮ 1 frame hidden ⋮
   2: polkadot::main::h87a6103b330d8302
      at <unknown source file>:<unknown line>
   3: std::sys_common::backtrace::__rust_begin_short_backtrace::h3487996e71d20555
      at <unknown source file>:<unknown line>
   4: main<unknown>
      at <unknown source file>:<unknown line>
   5: __libc_start_main<unknown>
      at <unknown source file>:<unknown line>
   6: _start<unknown>
      at <unknown source file>:<unknown line>

Run with COLORBT_SHOW_HIDDEN=1 environment variable to disable frame filtering.
Run with RUST_BACKTRACE=full to include source snippets.

westend-validator-d-1 log:

2023-02-20 15:08:12.037  INFO tokio-runtime-worker substrate: ✨ Imported #14726419 (0x8984…53c8)
2023-02-20 15:08:12.410 ERROR tokio-runtime-worker polkadot_overseer: subsystem exited with error subsystem="bitfield-signing-subsystem" err=FromOrigin { origin: "bitfield-signing", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-02-20 15:08:12.410  WARN tokio-runtime-worker parachain::availability-store: err=ContextChannelClosed
2023-02-20 15:08:12.411 ERROR tokio-runtime-worker polkadot_overseer: subsystem exited with error subsystem="approval-voting-subsystem" err=FromOrigin { origin: "approval-voting", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-02-20 15:08:12.411 ERROR tokio-runtime-worker polkadot_overseer: subsystem exited with error subsystem="pvf-checker-subsystem" err=FromOrigin { origin: "pvf-checker", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-02-20 15:08:12.413 ERROR tokio-runtime-worker polkadot_overseer: subsystem exited with error subsystem="availability-distribution-subsystem" err=FromOrigin { origin: "availability-distribution", source: IncomingMessageChannel(Generated(Context("Signal channel is terminated and empty."))) }
2023-02-20 15:08:12.411 ERROR tokio-runtime-worker polkadot_overseer: subsystem exited with error subsystem="runtime-api-subsystem" err=Generated(Context("Signal channel is terminated and empty."))
2023-02-20 15:08:12.410 ERROR tokio-runtime-worker overseer: Overseer exited with error err=Generated(SubsystemStalled("chain-selection-subsystem"))
2023-02-20 15:08:12.410 ERROR tokio-runtime-worker polkadot_overseer: subsystem exited with error subsystem="candidate-validation-subsystem" err=FromOrigin { origin: "candidate-validation", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-02-20 15:08:12.413 ERROR tokio-runtime-worker sc_service::task_manager: Essential task `overseer` failed. Shutting down service.
2023-02-20 15:08:12.413 ERROR tokio-runtime-worker parachain::bitfield-distribution: Failed to receive a message from Overseer, exiting err=Generated(Context("Signal channel is terminated and empty."))
2023-02-20 15:08:12.411 ERROR tokio-runtime-worker polkadot_overseer: subsystem exited with error subsystem="chain-api-subsystem" err=FromOrigin { origin: "chain-api", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-02-20 15:08:12.411 ERROR tokio-runtime-worker polkadot_overseer: subsystem exited with error subsystem="dispute-coordinator-subsystem" err=FromOrigin { origin: "dispute-coordinator", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
2023-02-20 15:08:12.413 ERROR tokio-runtime-worker polkadot_overseer: subsystem exited with error subsystem="availability-recovery-subsystem" err=FromOrigin { origin: "availability-recovery", source: Generated(Context("Signal channel is terminated and empty.")) }
2023-02-20 15:08:12.411 ERROR tokio-runtime-worker polkadot_overseer: subsystem exited with error subsystem="statement-distribution-subsystem" err=FromOrigin { origin: "statement-distribution", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
2023-02-20 15:08:12.411  WARN tokio-runtime-worker grandpa: Encountered error finding best chain containing 0x78a622e30babf7894c1ce7fc7058e144d33fd61fd9b3678bed8a568a673d308b: couldn't find target block: Request for leaves from chain selection got canceled
2023-02-20 15:08:12.414  WARN tokio-runtime-worker grandpa: Could not cast prevote: previously known block 0x78a622e30babf7894c1ce7fc7058e144d33fd61fd9b3678bed8a568a673d308b has disappeared
2023-02-20 15:08:12.414 ERROR tokio-runtime-worker grandpa: GRANDPA voter error: safety invariant has been violated: `voter_commands_rx` was closed.
2023-02-20 15:08:12.414 ERROR tokio-runtime-worker sc_service::task_manager: Essential task `grandpa-voter` failed. Shutting down service.
2023-02-20 15:09:12.413 ERROR                 main sc_cli::runner: Detected running(potentially stalled) tasks on shutdown:
2023-02-20 15:09:12.413 ERROR                 main sc_cli::runner: Task "provisioner-subsystem" (Group: provisioner) was still running after waiting 60 seconds to finish.
2023-02-20 15:09:12.413 ERROR                 main sc_cli::runner: Task "candidate-backing-subsystem" (Group: candidate-backing) was still running after waiting 60 seconds to finish.
Error:
   0: Other: Essential task failed.

  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ BACKTRACE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
                                 ⋮ 1 frame hidden ⋮
   2: polkadot::main::h87a6103b330d8302
      at <unknown source file>:<unknown line>
   3: std::sys_common::backtrace::__rust_begin_short_backtrace::h3487996e71d20555
      at <unknown source file>:<unknown line>
   4: main<unknown>
      at <unknown source file>:<unknown line>
   5: __libc_start_main<unknown>
      at <unknown source file>:<unknown line>
   6: _start<unknown>
      at <unknown source file>:<unknown line>

Run with COLORBT_SHOW_HIDDEN=1 environment variable to disable frame filtering.
Run with RUST_BACKTRACE=full to include source snippets.

bkchr commented 1 year ago

Link to Loki: https://grafana.parity-mgmt.parity.io/explore?orgId=1&left=%7B%22datasource%22:%22P03E52D76DFE188C3%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22P03E52D76DFE188C3%22%7D,%22editorMode%22:%22builder%22,%22expr%22:%22%7Bchain%3D%5C%22westend%5C%22,%20node%3D%5C%22westend-validator-d-node-0%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22%7D%5D,%22range%22:%7B%22from%22:%22now-1h%22,%22to%22:%22now%22%7D%7D

eskimor commented 1 year ago

Looking at the logs, it apparently keeps crashing. Ok this is really bad. Does it work fine again when downgraded to 0.9.37?

@tdimitrov What are you using on Versi right now? If something older, please put some nodes to 0.9.38 so we can see whether this is reproducible on Versi.

eskimor commented 1 year ago

It does not seem to happen on all nodes - right? There were problem reports on Kusama, but no crashes so far.

Did the crashing start immediately with the upgrade or only later?

aardbol commented 1 year ago

The crashing happened immediately, but there are moments when the nodes are running fine for a while but then later crash. i.e. validator-d-0 keeps crashing very frequently, but since today no. 1, 2, 3 keep running for a while, but I expect them to crash again later today, just like they did yesterday. So the frequency is much lower.

Only westend validators seem to be affected. Versi nodes are running .37 as we speak.

eskimor commented 1 year ago

Also interesting: Rococo is also on 0.9.38 and seems to be running fine.

What about those bootnode errors on Westend? Nodes should not crash because of them, but fixing those bootnodes might still be a good idea.

eskimor commented 1 year ago

We don't seem to have debug logs on those Westend nodes for the parachain target - can we add that please?

aardbol commented 1 year ago

We don't seem to have debug logs on those Westend nodes for the parachain target - can we add that please?

Can you give me specific on what to configure?

tdimitrov commented 1 year ago

@tdimitrov What are you using on Versi right now? If something older, please put some nodes to 0.9.38 so we can see whether this is reproducible on Versi.

I haven't seen this issue on Versi. I deployed 4b0ae4e8 a few hours ago which is from Feb 21st. v9.3.8 is 72309a2b2e68413305a56dce1097041309bd29c6 which is from Feb 16th, so Versi is recent enough?

eskimor commented 1 year ago

add this to logLevels: "parachain=debug", example in versi deployment:

https://gitlab.parity.io/parity/infrastructure/parity-testnet/-/blob/master/versi/validators/helmfile.yaml#L22

aardbol commented 1 year ago

MR for enabling parachain debug logs

zedelson commented 1 year ago

Hi - i'm running into this on Rococo as well. Running a collator w/ v37 configured locally

aardbol commented 1 year ago

Update:

So, except for node 0 of the d group, all other nodes are running stable on v.38 now for some reason. Although that's no guarantee they will remain stable.

parachain debug logs have been enabled for a while now but are not providing any more useful details at first sight.

aardbol commented 1 year ago

The problem seems to be solved in v0.9.39

Edit: Nope, it's not the case. #0 is still crashing

eskimor commented 1 year ago

There is absolutely nothing useful in parachain debug logs: https://grafana.parity-mgmt.parity.io/goto/u5-bBIf4z?orgId=1

I think we should also enable debug logs for substrate, @bkchr what targets make the most sense to enable?

bkchr commented 1 year ago

There is absolutely nothing useful in parachain debug logs:

Not really true :P https://grafana.parity-mgmt.parity.io/goto/P1HE4HB4z?orgId=1

2023-03-10 03:46:32 | 2023-03-10 02:46:32.917 ERROR tokio-runtime-worker sc_service::task_manager: Essential task `overseer` failed. Shutting down service. 
2023-03-10 02:46:32.916 ERROR tokio-runtime-worker overseer: Overseer exited with error err=Generated(SubsystemStalled("chain-selection-subsystem"))

@aardbol do we have a restart timeout of 5 minutes after the node stopped?

aardbol commented 1 year ago

There is absolutely nothing useful in parachain debug logs:

Not really true :P https://grafana.parity-mgmt.parity.io/goto/P1HE4HB4z?orgId=1
2023-03-10 03:46:32 | 2023-03-10 02:46:32.917 ERROR tokio-runtime-worker sc_service::task_manager: Essential task `overseer` failed. Shutting down service. 
2023-03-10 02:46:32.916 ERROR tokio-runtime-worker overseer: Overseer exited with error err=Generated(SubsystemStalled("chain-selection-subsystem"))
@aardbol do we have a restart timeout of 5 minutes after the node stopped?

No.

In other news, the validator nodes have been running stable for a while now. Versions v0.9.39 & v0.9.41 are active.

bkchr commented 1 year ago

Closing it for now, please re-open or create a new one if this happens again.

paritytech / polkadot

Westend validators crash with v0.9.38 update #6759