Closed aardbol closed 1 year ago
Looking at the logs, it apparently keeps crashing. Ok this is really bad. Does it work fine again when downgraded to 0.9.37?
@tdimitrov What are you using on Versi right now? If something older, please put some nodes to 0.9.38 so we can see whether this is reproducible on Versi.
It does not seem to happen on all nodes - right? There were problem reports on Kusama, but no crashes so far.
Did the crashing start immediately with the upgrade or only later?
The crashing happened immediately, but there are moments when the nodes are running fine for a while but then later crash. i.e. validator-d-0 keeps crashing very frequently, but since today no. 1, 2, 3 keep running for a while, but I expect them to crash again later today, just like they did yesterday. So the frequency is much lower.
Only westend validators seem to be affected. Versi nodes are running .37 as we speak.
Also interesting: Rococo is also on 0.9.38 and seems to be running fine.
What about those bootnode errors on Westend? Nodes should not crash because of them, but fixing those bootnodes might still be a good idea.
We don't seem to have debug logs on those Westend nodes for the parachain target - can we add that please?
We don't seem to have debug logs on those Westend nodes for the parachain target - can we add that please?
Can you give me specific on what to configure?
@tdimitrov What are you using on Versi right now? If something older, please put some nodes to 0.9.38 so we can see whether this is reproducible on Versi.
I haven't seen this issue on Versi. I deployed 4b0ae4e8 a few hours ago which is from Feb 21st. v9.3.8 is 72309a2b2e68413305a56dce1097041309bd29c6 which is from Feb 16th, so Versi is recent enough?
add this to logLevels: "parachain=debug", example in versi deployment:
Hi - i'm running into this on Rococo as well. Running a collator w/ v37 configured locally
Update:
So, except for node 0 of the d group, all other nodes are running stable on v.38 now for some reason. Although that's no guarantee they will remain stable.
parachain debug logs have been enabled for a while now but are not providing any more useful details at first sight.
The problem seems to be solved in v0.9.39
Edit: Nope, it's not the case. #0 is still crashing
There is absolutely nothing useful in parachain debug logs: https://grafana.parity-mgmt.parity.io/goto/u5-bBIf4z?orgId=1
I think we should also enable debug logs for substrate, @bkchr what targets make the most sense to enable?
There is absolutely nothing useful in parachain debug logs:
Not really true :P https://grafana.parity-mgmt.parity.io/goto/P1HE4HB4z?orgId=1
2023-03-10 03:46:32 | 2023-03-10 02:46:32.917 ERROR tokio-runtime-worker sc_service::task_manager: Essential task `overseer` failed. Shutting down service.
2023-03-10 02:46:32.916 ERROR tokio-runtime-worker overseer: Overseer exited with error err=Generated(SubsystemStalled("chain-selection-subsystem"))
@aardbol do we have a restart timeout of 5 minutes after the node stopped?
There is absolutely nothing useful in parachain debug logs:
Not really true :P https://grafana.parity-mgmt.parity.io/goto/P1HE4HB4z?orgId=1
2023-03-10 03:46:32 | 2023-03-10 02:46:32.917 ERROR tokio-runtime-worker sc_service::task_manager: Essential task `overseer` failed. Shutting down service. 2023-03-10 02:46:32.916 ERROR tokio-runtime-worker overseer: Overseer exited with error err=Generated(SubsystemStalled("chain-selection-subsystem"))
@aardbol do we have a restart timeout of 5 minutes after the node stopped?
No.
In other news, the validator nodes have been running stable for a while now. Versions v0.9.39 & v0.9.41 are active.
Closing it for now, please re-open or create a new one if this happens again.
Node: validator DB: RocksDB OS: official polkadot container image in kubernetes
westend-validator-d-0 log:
westend-validator-d-1 log: