interlay / interbtc

interBTC: Bitcoin Anywhere
https://app.interlay.io/
Apache License 2.0
247 stars 83 forks source link

Kintsugi RPC node crashes and then hangs after restart #1088

Closed jonathanudd closed 2 weeks ago

jonathanudd commented 1 year ago

Describe the bug I have deployed a Kintsugi RPC node which is still not fully synced. It syncs for a while and then the process crashes and when it's restarted it hangs for a long time before starting to sync again.

To Reproduce Steps to reproduce the behavior:

  1. Download binary from releases in this project
  2. Start the service with the following arguments --chain=kintsugi --execution wasm --wasm-execution=compiled --rpc-methods=safe --state-pruning=archive --ws-external --ws-port=9966 --ws-max-connections=1000 --rpc-external --rpc-port=9933 --rpc-cors=all --prometheus-external
  3. Let it run for a while until it crashes. Let it restart and then wait for a time until it starts syncing again. a. The wait time before it starts to sync again have been between 1 to 12 hours.
  4. I've seen this error a lot before crashes [Relaychain] error=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
  5. And things similar to this
    Jun 11 22:59:21 juju-7be0f9-1 polkadot[673149]: 2023-06-11 22:59:21 [Relaychain] err=Subsystem(Generated(Context("Signal channel is terminated and empty.")))
    Jun 11 22:59:21 juju-7be0f9-1 polkadot[673149]: 2023-06-11 22:59:21 [Relaychain] Overseer exited with error err=Generated(SubsystemStalled("availability-store-subsystem"))
    Jun 11 22:59:21 juju-7be0f9-1 polkadot[673149]: 2023-06-11 22:59:21 [Relaychain] subsystem exited with error subsystem="candidate-validation-subsystem" err=FromOrigin { origin: "candidate-validation", source: Generated(Context("Signal channel is terminated and empty.")) }
    Jun 11 22:59:21 juju-7be0f9-1 polkadot[673149]: 2023-06-11 22:59:21 [Relaychain] Essential task `overseer` failed. Shutting down service.
    Jun 11 22:59:21 juju-7be0f9-1 polkadot[673149]: 2023-06-11 22:59:21 [Relaychain] subsystem exited with error subsystem="approval-voting-subsystem" err=FromOrigin { origin: "approval-voting", source: Generated(Context("Signal channel is terminated and empty.")) }
    Jun 11 22:59:21 juju-7be0f9-1 polkadot[673149]: 2023-06-11 22:59:21 [Relaychain] subsystem exited with error subsystem="chain-api-subsystem" err=FromOrigin { origin: "chain-api", source: Generated(Context("Signal channel is terminated and empty.")) }
    Jun 11 22:59:21 juju-7be0f9-1 polkadot[673149]: 2023-06-11 22:59:21 [Relaychain] subsystem exited with error subsystem="runtime-api-subsystem" err=Generated(Context("Signal channel is terminated and empty."))
    Jun 11 22:59:21 juju-7be0f9-1 polkadot[673149]: 2023-06-11 22:59:21 [Relaychain] subsystem exited with error subsystem="dispute-coordinator-subsystem" err=FromOrigin { origin: "dispute-coordinator", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
    Jun 11 22:59:21 juju-7be0f9-1 polkadot[673149]: 2023-06-11 22:59:21 [Relaychain] subsystem exited with error subsystem="bitfield-signing-subsystem" err=FromOrigin { origin: "bitfield-signing", source: Generated(Context("Signal channel is terminated and empty.")) }
    Jun 11 23:00:21 juju-7be0f9-1 polkadot[673149]: 2023-06-11 23:00:21 Detected running(potentially stalled) tasks on shutdown:
    Jun 11 23:00:21 juju-7be0f9-1 polkadot[673149]: 2023-06-11 23:00:21 Task "cumulus-consensus" was still running after waiting 60 seconds to finish.
    Jun 11 23:00:21 juju-7be0f9-1 polkadot[673149]: 2023-06-11 23:00:21 Task "txpool-notifications" (Group: transaction-pool) was still running after waiting 60 seconds to finish.
  6. The last crash though contained more info and is maybe more related to your project
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]: ====================
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]: Version: 1.23.2-33fabb23fae
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:    0: sp_panic_handler::set::{{closure}}
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:    1: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:              at rustc/b70baa4f922a1809d79caeaeb902800c3be283b9/library/alloc/src/boxed.rs:2032:9
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:       std::panicking::rust_panic_with_hook
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:              at rustc/b70baa4f922a1809d79caeaeb902800c3be283b9/library/std/src/panicking.rs:692:13
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:    2: std::panicking::begin_panic_handler::{{closure}}
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:              at rustc/b70baa4f922a1809d79caeaeb902800c3be283b9/library/std/src/panicking.rs:579:13
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:    3: std::sys_common::backtrace::__rust_end_short_backtrace
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:              at rustc/b70baa4f922a1809d79caeaeb902800c3be283b9/library/std/src/sys_common/backtrace.rs:137:18
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:    4: rust_begin_unwind
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:              at rustc/b70baa4f922a1809d79caeaeb902800c3be283b9/library/std/src/panicking.rs:575:5
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:    5: core::panicking::panic_fmt
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:              at rustc/b70baa4f922a1809d79caeaeb902800c3be283b9/library/core/src/panicking.rs:64:14
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:    6: core::panicking::panic_display
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:              at rustc/b70baa4f922a1809d79caeaeb902800c3be283b9/library/core/src/panicking.rs:135:5
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:    7: core::panicking::panic_str
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:              at rustc/b70baa4f922a1809d79caeaeb902800c3be283b9/library/core/src/panicking.rs:119:5
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:    8: core::option::expect_failed
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:              at rustc/b70baa4f922a1809d79caeaeb902800c3be283b9/library/core/src/option.rs:1923:5
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:    9: core::option::Option<T>::expect
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:              at rustc/b70baa4f922a1809d79caeaeb902800c3be283b9/library/core/src/option.rs:785:21
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:       <std::time::Instant as core::ops::arith::Add<core::time::Duration>>::add
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:              at rustc/b70baa4f922a1809d79caeaeb902800c3be283b9/library/std/src/time.rs:408:9
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   10: tokio::time::interval::Interval::poll_tick
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   11: libp2p_mdns::behaviour::timer::tokio::<impl futures_core::stream::Stream for libp2p_mdns::behaviour::timer::Timer<tokio::time::interval::Interval>>::poll_next
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   12: <libp2p_mdns::behaviour::Behaviour<P> as libp2p_swarm::behaviour::NetworkBehaviour>::poll
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   13: <sc_network::discovery::DiscoveryBehaviour as libp2p_swarm::behaviour::NetworkBehaviour>::poll
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   14: <sc_network::behaviour::Behaviour<B,Client> as libp2p_swarm::behaviour::NetworkBehaviour>::poll
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   15: libp2p_swarm::Swarm<TBehaviour>::poll_next_event
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   16: <sc_network::service::NetworkWorker<B,H,Client> as core::future::future::Future>::poll
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   17: sc_service::build_network_future::{{closure}}::{{closure}}::{{closure}}
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   18: <futures_util::future::poll_fn::PollFn<F> as core::future::future::Future>::poll
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   19: <sc_service::task_manager::prometheus_future::PrometheusFuture<T> as core::future::future::Future>::poll
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   20: <futures_util::future::select::Select<A,B> as core::future::future::Future>::poll
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   21: <tracing_futures::Instrumented<T> as core::future::future::Future>::poll
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   22: tokio::runtime::context::BlockingRegionGuard::block_on
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   23: tokio::runtime::handle::Handle::block_on
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   24: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   25: tokio::runtime::task::core::Core<T,S>::poll
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   26: tokio::runtime::task::harness::Harness<T,S>::poll
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   27: tokio::runtime::blocking::pool::Inner::run
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   28: std::sys_common::backtrace::__rust_begin_short_backtrace
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   29: core::ops::function::FnOnce::call_once{{vtable.shim}}
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   30: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:              at rustc/b70baa4f922a1809d79caeaeb902800c3be283b9/library/alloc/src/boxed.rs:2000:9
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:       <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:              at rustc/b70baa4f922a1809d79caeaeb902800c3be283b9/library/alloc/src/boxed.rs:2000:9
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:       std::sys::unix::thread::Thread::new::thread_start
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:              at rustc/b70baa4f922a1809d79caeaeb902800c3be283b9/library/std/src/sys/unix/thread.rs:108:17
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   31: <unknown>
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:   32: <unknown>
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]: Thread 'tokio-runtime-worker' panicked at 'overflow when adding duration to instant', library/std/src/time.rs:408
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]: This is a bug. Please report it at:
    Jun 12 12:41:54 juju-7be0f9-1 polkadot[709357]:         https://github.com/interlay/interbtc/issues/new

Expected behavior The process not crashing and it starting to sync quicker after a crash.

Version / Commit (e.g. 1.0.0): I started out with version version 1.20.0-47bfef0d8b7 to avoid the issue mentioned here https://github.com/interlay/interbtc/issues/978 Then I upgraded to version 1.23.2-33fabb23fae and then version 1.24.0-rc5-76deccd2391 today.

Additional notes The node is almost in sync now. I could provide the whole log file from start until now but it's 262MB and cannot be attached here.

jonathanudd commented 1 year ago

It has not crashed for 24 hours now. Could it be related to how far it has synced? It's starting to get close to fully synced now.

jonathanudd commented 1 year ago

I can also see that the CPU usage is lower the last 24 hours. Could the crashes occur if the process gets limited CPU resources? Are there any recommended hardware specs to run the chain?

sander2 commented 1 year ago

Thank you for the report @jonathanudd . Point number 5 indicates that the problem likely originates upstream, would you be able to submit the report to substrate?

jonathanudd commented 1 year ago

Thank you for the report @jonathanudd . Point number 5 indicates that the problem likely originates upstream, would you be able to submit the report to substrate?

@sander2 They said that those problems should be fixed in later versions of Polkadot. See link to issue above.