Kintsugi RPC node crashes and then hangs after restart

jonathanudd commented 1 year ago

Describe the bug I have deployed a Kintsugi RPC node which is still not fully synced. It syncs for a while and then the process crashes and when it's restarted it hangs for a long time before starting to sync again.

To Reproduce Steps to reproduce the behavior:

Download binary from releases in this project
Start the service with the following arguments --chain=kintsugi --execution wasm --wasm-execution=compiled --rpc-methods=safe --state-pruning=archive --ws-external --ws-port=9966 --ws-max-connections=1000 --rpc-external --rpc-port=9933 --rpc-cors=all --prometheus-external
Let it run for a while until it crashes. Let it restart and then wait for a time until it starts syncing again. a. The wait time before it starts to sync again have been between 1 to 12 hours.
I've seen this error a lot before crashes [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))

And things similar to this

Jun 11 22:59:21 juju-7be0f9-1 polkadot[673149]: 2023-06-11 22:59:21 [Relaychain] err=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
Jun 11 22:59:21 juju-7be0f9-1 polkadot[673149]: 2023-06-11 22:59:21 [Relaychain] Overseer exited with error err=Generated(SubsystemStalled("availability-store-subsystem"))
Jun 11 22:59:21 juju-7be0f9-1 polkadot[673149]: 2023-06-11 22:59:21 [Relaychain] subsystem exited with error subsystem="candidate-validation-subsystem" err=FromOrigin { origin: "candidate-validation", source: Generated(Context("Signal channel is terminated and empty.")) }
Jun 11 22:59:21 juju-7be0f9-1 polkadot[673149]: 2023-06-11 22:59:21 [Relaychain] Essential task `overseer` failed. Shutting down service.
Jun 11 22:59:21 juju-7be0f9-1 polkadot[673149]: 2023-06-11 22:59:21 [Relaychain] subsystem exited with error subsystem="approval-voting-subsystem" err=FromOrigin { origin: "approval-voting", source: Generated(Context("Signal channel is terminated and empty.")) }
Jun 11 22:59:21 juju-7be0f9-1 polkadot[673149]: 2023-06-11 22:59:21 [Relaychain] subsystem exited with error subsystem="chain-api-subsystem" err=FromOrigin { origin: "chain-api", source: Generated(Context("Signal channel is terminated and empty.")) }
Jun 11 22:59:21 juju-7be0f9-1 polkadot[673149]: 2023-06-11 22:59:21 [Relaychain] subsystem exited with error subsystem="runtime-api-subsystem" err=Generated(Context("Signal channel is terminated and empty."))
Jun 11 22:59:21 juju-7be0f9-1 polkadot[673149]: 2023-06-11 22:59:21 [Relaychain] subsystem exited with error subsystem="dispute-coordinator-subsystem" err=FromOrigin { origin: "dispute-coordinator", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
Jun 11 22:59:21 juju-7be0f9-1 polkadot[673149]: 2023-06-11 22:59:21 [Relaychain] subsystem exited with error subsystem="bitfield-signing-subsystem" err=FromOrigin { origin: "bitfield-signing", source: Generated(Context("Signal channel is terminated and empty.")) }
Jun 11 23:00:21 juju-7be0f9-1 polkadot[673149]: 2023-06-11 23:00:21 Detected running(potentially stalled) tasks on shutdown:
Jun 11 23:00:21 juju-7be0f9-1 polkadot[673149]: 2023-06-11 23:00:21 Task "cumulus-consensus" was still running after waiting 60 seconds to finish.
Jun 11 23:00:21 juju-7be0f9-1 polkadot[673149]: 2023-06-11 23:00:21 Task "txpool-notifications" (Group: transaction-pool) was still running after waiting 60 seconds to finish.

The last crash though contained more info and is maybe more related to your project

Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]: ====================
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]: Version: 1.23.2-33fabb23fae
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:    0: sp_panic_handler::set::{{closure}}
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:    1: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:              at rustc/b70baa4f922a1809d79caeaeb902800c3be283b9/library/alloc/src/boxed.rs:2032:9
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:       std::panicking::rust_panic_with_hook
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:              at rustc/b70baa4f922a1809d79caeaeb902800c3be283b9/library/std/src/panicking.rs:692:13
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:    2: std::panicking::begin_panic_handler::{{closure}}
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:              at rustc/b70baa4f922a1809d79caeaeb902800c3be283b9/library/std/src/panicking.rs:579:13
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:    3: std::sys_common::backtrace::__rust_end_short_backtrace
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:              at rustc/b70baa4f922a1809d79caeaeb902800c3be283b9/library/std/src/sys_common/backtrace.rs:137:18
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:    4: rust_begin_unwind
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:              at rustc/b70baa4f922a1809d79caeaeb902800c3be283b9/library/std/src/panicking.rs:575:5
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:    5: core::panicking::panic_fmt
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:              at rustc/b70baa4f922a1809d79caeaeb902800c3be283b9/library/core/src/panicking.rs:64:14
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:    6: core::panicking::panic_display
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:              at rustc/b70baa4f922a1809d79caeaeb902800c3be283b9/library/core/src/panicking.rs:135:5
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:    7: core::panicking::panic_str
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:              at rustc/b70baa4f922a1809d79caeaeb902800c3be283b9/library/core/src/panicking.rs:119:5
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:    8: core::option::expect_failed
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:              at rustc/b70baa4f922a1809d79caeaeb902800c3be283b9/library/core/src/option.rs:1923:5
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:    9: core::option::Option<T>::expect
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:              at rustc/b70baa4f922a1809d79caeaeb902800c3be283b9/library/core/src/option.rs:785:21
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:       <std::time::Instant as core::ops::arith::Add<core::time::Duration>>::add
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:              at rustc/b70baa4f922a1809d79caeaeb902800c3be283b9/library/std/src/time.rs:408:9
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   10: tokio::time::interval::Interval::poll_tick
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   11: libp2p_mdns::behaviour::timer::tokio::<impl futures_core::stream::Stream for libp2p_mdns::behaviour::timer::Timer<tokio::time::interval::Interval>>::poll_next
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   12: <libp2p_mdns::behaviour::Behaviour<P> as libp2p_swarm::behaviour::NetworkBehaviour>::poll
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   13: <sc_network::discovery::DiscoveryBehaviour as libp2p_swarm::behaviour::NetworkBehaviour>::poll
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   14: <sc_network::behaviour::Behaviour<B,Client> as libp2p_swarm::behaviour::NetworkBehaviour>::poll
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   15: libp2p_swarm::Swarm<TBehaviour>::poll_next_event
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   16: <sc_network::service::NetworkWorker<B,H,Client> as core::future::future::Future>::poll
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   17: sc_service::build_network_future::{{closure}}::{{closure}}::{{closure}}
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   18: <futures_util::future::poll_fn::PollFn<F> as core::future::future::Future>::poll
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   19: <sc_service::task_manager::prometheus_future::PrometheusFuture<T> as core::future::future::Future>::poll
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   20: <futures_util::future::select::Select<A,B> as core::future::future::Future>::poll
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   21: <tracing_futures::Instrumented<T> as core::future::future::Future>::poll
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   22: tokio::runtime::context::BlockingRegionGuard::block_on
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   23: tokio::runtime::handle::Handle::block_on
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   24: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   25: tokio::runtime::task::core::Core<T,S>::poll
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   26: tokio::runtime::task::harness::Harness<T,S>::poll
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   27: tokio::runtime::blocking::pool::Inner::run
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   28: std::sys_common::backtrace::__rust_begin_short_backtrace
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   29: core::ops::function::FnOnce::call_once{{vtable.shim}}
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   30: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:              at rustc/b70baa4f922a1809d79caeaeb902800c3be283b9/library/alloc/src/boxed.rs:2000:9
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:       <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:              at rustc/b70baa4f922a1809d79caeaeb902800c3be283b9/library/alloc/src/boxed.rs:2000:9
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:       std::sys::unix::thread::Thread::new::thread_start
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:              at rustc/b70baa4f922a1809d79caeaeb902800c3be283b9/library/std/src/sys/unix/thread.rs:108:17
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   31: <unknown>
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   32: <unknown>
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]: Thread 'tokio-runtime-worker' panicked at 'overflow when adding duration to instant', library/std/src/time.rs:408
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]: This is a bug. Please report it at:
Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:         https://github.com/interlay/interbtc/issues/new

Expected behavior The process not crashing and it starting to sync quicker after a crash.

Version / Commit (e.g. 1.0.0): I started out with version version 1.20.0-47bfef0d8b7 to avoid the issue mentioned here https://github.com/interlay/interbtc/issues/978 Then I upgraded to version 1.23.2-33fabb23fae and then version 1.24.0-rc5-76deccd2391 today.

Additional notes The node is almost in sync now. I could provide the whole log file from start until now but it's 262MB and cannot be attached here.

jonathanudd commented 1 year ago

It has not crashed for 24 hours now. Could it be related to how far it has synced? It's starting to get close to fully synced now.

jonathanudd commented 1 year ago

I can also see that the CPU usage is lower the last 24 hours. Could the crashes occur if the process gets limited CPU resources? Are there any recommended hardware specs to run the chain?

sander2 commented 1 year ago

Thank you for the report @jonathanudd . Point number 5 indicates that the problem likely originates upstream, would you be able to submit the report to substrate?

jonathanudd commented 1 year ago

Thank you for the report @jonathanudd . Point number 5 indicates that the problem likely originates upstream, would you be able to submit the report to substrate?

@sander2 They said that those problems should be fixed in later versions of Polkadot. See link to issue above.

interlay / interbtc

Kintsugi RPC node crashes and then hangs after restart #1088