paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.81k stars 661 forks source link

Thread 'tokio-runtime-worker' panicked at 'SinkImpl::poll_ready called after error. #4934

Closed Anastasiia-Khab closed 2 months ago

Anastasiia-Khab commented 3 months ago

Is there an existing issue?

Experiencing problems? Have you tried our Stack Exchange first?

Description of bug

Noticed node failure on Polkadot and Kusama with Error:

Version: 1.14.0 0: sp_panic_handler::set::{{closure}} 1: std::panicking::rust_panic_with_hook 2: std::panicking::begin_panic::{{closure}} 3: std::sys_common::backtrace::__rust_end_short_backtrace 4: std::panicking::begin_panic 5: <quicksink::SinkImpl<S,F,T,A,E> as futures_sink::Sink>::poll_ready 6: futures_util::future::either::if_std::<impl futures_io::if_std::AsyncWrite for futures_util::future::either::Either<A,B>>::poll_write 7: <libp2p_noise::io::framed::NoiseFramed<T,S> as futures_sink::Sink<&alloc::vec::Vec>>::poll_ready 8: <libp2p_noise::io::Output as futures_io::if_std::AsyncWrite>::poll_write 9: <yamux::frame::io::Io as futures_sink::Sink<yamux::frame::Frame<()>>>::poll_ready 10: yamux::connection::Connection::poll_next_inbound 11: <libp2p_core::muxing::boxed::Wrap as libp2p_core::muxing::StreamMuxer>::poll 12: <libp2p_core::muxing::boxed::Wrap as libp2p_core::muxing::StreamMuxer>::poll 13: libp2p_swarm::connection::Connection::poll 14: libp2p_swarm::connection::pool::task::new_for_established_connection::{{closure}} 15: <tracing_futures::Instrumented as core::future::future::Future>::poll 16: tokio::runtime::task::raw::poll 17: tokio::runtime::scheduler::multi_thread::worker::Context::run_task 18: tokio::runtime::scheduler::multi_thread::worker::run 19: tokio::runtime::task::raw::poll 20: std::sys_common::backtrace::rust_begin_short_backtrace 21: core::ops::function::FnOnce::call_once{{vtable.shim}} 22: std::sys::pal::unix::thread::Thread::new::thread_start 23: start_thread 24: clone3 Thread 'tokio-runtime-worker' panicked at 'SinkImpl::poll_ready called after error.', /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/quicksink-0.1.2/src/lib.rs:158 This is a bug. Please report it at: https://github.com/paritytech/polkadot-sdk/issues/new

Steps to reproduce

No response

bkchr commented 3 months ago

Upstream issue: https://github.com/libp2p/rust-libp2p/issues/5471

CC @paritytech/networking

paulormart commented 3 months ago

We are experiencing the same on all our relay westend/kusama/polkadot RPC/Bootnodes since the upgrade to v1.14.0

Jul 05 00:21:30 polkadot-rpc-1 polkadot[7113]: 2024-07-05 00:21:30 🏆 Imported #21506881 (0xe25e…f3d2 → 0xd681…0ebf)
Jul 05 00:21:33 polkadot-rpc-1 polkadot[7113]: 2024-07-05 00:21:33 💤 Idle (43 peers), best: #21506881 (0xd681…0ebf), finalized #21506879 (0xdbe0…86e2), ⬇ 1005.4kiB/s ⬆ 1.3MiB/s
Jul 05 00:21:36 polkadot-rpc-1 polkadot[7113]: ====================
Jul 05 00:21:36 polkadot-rpc-1 polkadot[7113]: Version: 1.14.0-364961445b7
Jul 05 00:21:36 polkadot-rpc-1 polkadot[7113]:    0: sp_panic_handler::set::{{closure}}
Jul 05 00:21:36 polkadot-rpc-1 polkadot[7113]:    1: std::panicking::rust_panic_with_hook
Jul 05 00:21:36 polkadot-rpc-1 polkadot[7113]:    2: std::panicking::begin_panic::{{closure}}
Jul 05 00:21:36 polkadot-rpc-1 polkadot[7113]:    3: std::sys_common::backtrace::__rust_end_short_backtrace
Jul 05 00:21:36 polkadot-rpc-1 polkadot[7113]:    4: std::panicking::begin_panic
Jul 05 00:21:36 polkadot-rpc-1 polkadot[7113]:    5: <quicksink::SinkImpl<S,F,T,A,E> as futures_sink::Sink<A>>::poll_ready
Jul 05 00:21:36 polkadot-rpc-1 polkadot[7113]:    6: futures_util::future::either::if_std::<impl futures_io::if_std::AsyncWrite for futures_util::future::either::Either<A,B>>::poll_write
Jul 05 00:21:36 polkadot-rpc-1 polkadot[7113]:    7: <libp2p_noise::io::framed::NoiseFramed<T,S> as futures_sink::Sink<&alloc::vec::Vec<u8>>>::poll_ready
Jul 05 00:21:36 polkadot-rpc-1 polkadot[7113]:    8: <libp2p_noise::io::Output<T> as futures_io::if_std::AsyncWrite>::poll_write
Jul 05 00:21:36 polkadot-rpc-1 polkadot[7113]:    9: <yamux::frame::io::Io<T> as futures_sink::Sink<yamux::frame::Frame<()>>>::poll_ready
Jul 05 00:21:36 polkadot-rpc-1 polkadot[7113]:   10: yamux::connection::Connection<T>::poll_next_inbound
Jul 05 00:21:36 polkadot-rpc-1 polkadot[7113]:   11: <libp2p_core::muxing::boxed::Wrap<T> as libp2p_core::muxing::StreamMuxer>::poll
Jul 05 00:21:36 polkadot-rpc-1 polkadot[7113]:   12: <libp2p_core::muxing::boxed::Wrap<T> as libp2p_core::muxing::StreamMuxer>::poll
Jul 05 00:21:36 polkadot-rpc-1 polkadot[7113]:   13: libp2p_swarm::connection::Connection<THandler>::poll
Jul 05 00:21:36 polkadot-rpc-1 polkadot[7113]:   14: libp2p_swarm::connection::pool::task::new_for_established_connection::{{closure}}
Jul 05 00:21:36 polkadot-rpc-1 polkadot[7113]:   15: <tracing_futures::Instrumented<T> as core::future::future::Future>::poll
Jul 05 00:21:36 polkadot-rpc-1 polkadot[7113]:   16: tokio::runtime::task::raw::poll
Jul 05 00:21:36 polkadot-rpc-1 polkadot[7113]:   17: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
Jul 05 00:21:36 polkadot-rpc-1 polkadot[7113]:   18: tokio::runtime::scheduler::multi_thread::worker::run
Jul 05 00:21:36 polkadot-rpc-1 polkadot[7113]:   19: tokio::runtime::task::raw::poll
Jul 05 00:21:36 polkadot-rpc-1 polkadot[7113]:   20: std::sys_common::backtrace::__rust_begin_short_backtrace
Jul 05 00:21:36 polkadot-rpc-1 polkadot[7113]:   21: core::ops::function::FnOnce::call_once{{vtable.shim}}
Jul 05 00:21:36 polkadot-rpc-1 polkadot[7113]:   22: std::sys::pal::unix::thread::Thread::new::thread_start
Jul 05 00:21:36 polkadot-rpc-1 polkadot[7113]:   23: <unknown>
Jul 05 00:21:36 polkadot-rpc-1 polkadot[7113]:   24: <unknown>
Jul 05 00:21:36 polkadot-rpc-1 polkadot[7113]: Thread 'tokio-runtime-worker' panicked at 'SinkImpl::poll_ready called after error.', /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/quicksink-0.1.2/src/lib.rs:158
Jul 05 00:21:36 polkadot-rpc-1 polkadot[7113]: This is a bug. Please report it at:
Jul 05 00:21:36 polkadot-rpc-1 polkadot[7113]:         https://github.com/paritytech/polkadot-sdk/issues/new
Jul 05 00:21:36 polkadot-rpc-1 systemd[1]: polkadot-rpc-1.service: Main process exited, code=exited, status=1/FAILURE
Jul 05 00:21:36 polkadot-rpc-1 systemd[1]: polkadot-rpc-1.service: Failed with result 'exit-code'.
Jul 05 00:21:36 polkadot-rpc-1 systemd[1]: polkadot-rpc-1.service: Consumed 37min 38.635s CPU time.
Jul 05 00:22:06 polkadot-rpc-1 systemd[1]: polkadot-rpc-1.service: Scheduled restart job, restart counter is at 9.
Jul 05 00:22:07 polkadot-rpc-1 systemd[1]: Started polkadot-rpc-1.service - Polkadot RPC (Archive).
Jul 05 00:22:07 polkadot-rpc-1 polkadot[7500]: 2024-07-05 00:22:07 Parity Polkadot
Jul 05 00:22:07 polkadot-rpc-1 polkadot[7500]: 2024-07-05 00:22:07 ✌️  version 1.14.0-364961445b7
Jul 05 00:22:07 polkadot-rpc-1 polkadot[7500]: 2024-07-05 00:22:07 ❤️  by Parity Technologies <admin@parity.io>, 2017-2024
Jul 05 00:22:07 polkadot-rpc-1 polkadot[7500]: 2024-07-05 00:22:07 📋 Chain specification: Polkadot
Jul 05 00:22:07 polkadot-rpc-1 polkadot[7500]: 2024-07-05 00:22:07 🏷  Node name: turboflakes-rpc-1

Are there any particular flags in the logs we should add to help troubleshoot this?

Below is the substrate_sub_libp2p_peers_count and we can clearly see the frequency of crashes the node acting as a bootnode experienced over the last 48h.

Screenshot 2024-07-05 at 10 36 52

lexnv commented 3 months ago

I was able to reproduce this after approx 34h running libp2p backend:

====================

Version: 1.13.0-4f46b6e9ffd

   0: sp_panic_handler::set::{{closure}}
   1: std::panicking::rust_panic_with_hook
   2: std::panicking::begin_panic::{{closure}}
   3: std::sys_common::backtrace::__rust_end_short_backtrace
   4: std::panicking::begin_panic
   5: <quicksink::SinkImpl<S,F,T,A,E> as futures_sink::Sink<A>>::poll_ready
   6: <rw_stream_sink::RwStreamSink<S> as futures_io::if_std::AsyncWrite>::poll_write
   7: <libp2p_noise::io::framed::NoiseFramed<T,S> as futures_sink::Sink<&alloc::vec::Vec<u8>>>::poll_ready
   8: <libp2p_noise::io::Output<T> as futures_io::if_std::AsyncWrite>::poll_write
   9: <yamux::frame::io::Io<T> as futures_sink::Sink<yamux::frame::Frame<()>>>::poll_ready
  10: yamux::connection::Connection<T>::poll_next_inbound
  11: <libp2p_yamux::Muxer<C> as libp2p_core::muxing::StreamMuxer>::poll
  12: <libp2p_core::muxing::boxed::Wrap<T> as libp2p_core::muxing::StreamMuxer>::poll
  13: <libp2p_core::muxing::boxed::Wrap<T> as libp2p_core::muxing::StreamMuxer>::poll
  14: libp2p_swarm::connection::pool::task::new_for_established_connection::{{closure}}
  15: <sc_service::task_manager::prometheus_future::PrometheusFuture<T> as core::future::future::Future>::poll
  16: <futures_util::future::select::Select<A,B> as core::future::future::Future>::poll
  17: <tracing_futures::Instrumented<T> as core::future::future::Future>::poll
  18: std::panicking::try
  19: tokio::runtime::task::harness::Harness<T,S>::poll
  20: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
  21: tokio::runtime::scheduler::multi_thread::worker::Context::run
  22: tokio::runtime::context::set_scheduler
  23: tokio::runtime::context::runtime::enter_runtime
  24: tokio::runtime::scheduler::multi_thread::worker::run
  25: tokio::runtime::task::core::Core<T,S>::poll
  26: tokio::runtime::task::harness::Harness<T,S>::poll
  27: std::sys_common::backtrace::__rust_begin_short_backtrace
  28: core::ops::function::FnOnce::call_once{{vtable.shim}}
  29: std::sys::pal::unix::thread::Thread::new::thread_start
  30: <unknown>
  31: <unknown>

Thread 'tokio-runtime-worker' panicked at 'SinkImpl::poll_ready called after error.', /home/ubuntu/.cargo/registry/src/index.crates.io-6f17d22bba15001f/quicksink-0.1.2/src/lib.rs:158

This is a bug. Please report it at:

        https://github.com/paritytech/polkadot-sdk/issues/new

Will have a look into it

Curu24 commented 3 months ago

Hi, getting the same error on some Polkadot and Kusama nodes, but also on parachains like Bridgehub-Polkadot

Jul 04 16:09:39 polkadot-rpc1 polkadot[636343]: 2024-07-04 16:09:39 💤 Idle (18 peers), best: #21502031 (0x2103…8bc7), finalized #21502027 (0x9e4d…abd4), ⬇ 866.9kiB/s ⬆ 983.9kiB/s
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]: ====================
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]: Version: 1.14.0-364961445b7
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:    0: sp_panic_handler::set::{{closure}}
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:    1: std::panicking::rust_panic_with_hook
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:    2: std::panicking::begin_panic::{{closure}}
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:    3: std::sys_common::backtrace::__rust_end_short_backtrace
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:    4: std::panicking::begin_panic
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:    5: <quicksink::SinkImpl<S,F,T,A,E> as futures_sink::Sink<A>>::poll_ready
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:    6: futures_util::future::either::if_std::<impl futures_io::if_std::AsyncWrite for futures_util::future::either::Either<A,B>>::poll_write
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:    7: <libp2p_noise::io::framed::NoiseFramed<T,S> as futures_sink::Sink<&alloc::vec::Vec<u8>>>::poll_ready
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:    8: <libp2p_noise::io::Output<T> as futures_io::if_std::AsyncWrite>::poll_write
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:    9: <yamux::frame::io::Io<T> as futures_sink::Sink<yamux::frame::Frame<()>>>::poll_ready
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:   10: yamux::connection::Connection<T>::poll_next_inbound
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:   11: <libp2p_core::muxing::boxed::Wrap<T> as libp2p_core::muxing::StreamMuxer>::poll
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:   12: <libp2p_core::muxing::boxed::Wrap<T> as libp2p_core::muxing::StreamMuxer>::poll
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:   13: libp2p_swarm::connection::Connection<THandler>::poll
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:   14: libp2p_swarm::connection::pool::task::new_for_established_connection::{{closure}}
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:   15: <tracing_futures::Instrumented<T> as core::future::future::Future>::poll
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:   16: tokio::runtime::task::raw::poll
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:   17: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:   18: tokio::runtime::scheduler::multi_thread::worker::run
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:   19: tokio::runtime::task::raw::poll
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:   20: std::sys_common::backtrace::__rust_begin_short_backtrace
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:   21: core::ops::function::FnOnce::call_once{{vtable.shim}}
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:   22: std::sys::pal::unix::thread::Thread::new::thread_start
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:   23: <unknown>
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:   24: <unknown>
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]: Thread 'tokio-runtime-worker' panicked at 'SinkImpl::poll_ready called after error.', /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/quicksink-0.1.2/src/lib.rs:158
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]: This is a bug. Please report it at:
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:         https://github.com/paritytech/polkadot-sdk/issues/new
Jul 04 16:09:41 polkadot-rpc1 systemd[1]: polkadot.service: Main process exited, code=exited, status=1/FAILURE
Curu24 commented 3 months ago

Hi, getting the same error on some Polkadot and Kusama nodes, but also on parachains like Bridgehub-Polkadot

Jul 04 16:09:39 polkadot-rpc1 polkadot[636343]: 2024-07-04 16:09:39 💤 Idle (18 peers), best: #21502031 (0x2103…8bc7), finalized #21502027 (0x9e4d…abd4), ⬇ 866.9kiB/s ⬆ 983.9kiB/s
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]: ====================
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]: Version: 1.14.0-364961445b7
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:    0: sp_panic_handler::set::{{closure}}
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:    1: std::panicking::rust_panic_with_hook
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:    2: std::panicking::begin_panic::{{closure}}
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:    3: std::sys_common::backtrace::__rust_end_short_backtrace
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:    4: std::panicking::begin_panic
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:    5: <quicksink::SinkImpl<S,F,T,A,E> as futures_sink::Sink<A>>::poll_ready
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:    6: futures_util::future::either::if_std::<impl futures_io::if_std::AsyncWrite for futures_util::future::either::Either<A,B>>::poll_write
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:    7: <libp2p_noise::io::framed::NoiseFramed<T,S> as futures_sink::Sink<&alloc::vec::Vec<u8>>>::poll_ready
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:    8: <libp2p_noise::io::Output<T> as futures_io::if_std::AsyncWrite>::poll_write
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:    9: <yamux::frame::io::Io<T> as futures_sink::Sink<yamux::frame::Frame<()>>>::poll_ready
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:   10: yamux::connection::Connection<T>::poll_next_inbound
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:   11: <libp2p_core::muxing::boxed::Wrap<T> as libp2p_core::muxing::StreamMuxer>::poll
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:   12: <libp2p_core::muxing::boxed::Wrap<T> as libp2p_core::muxing::StreamMuxer>::poll
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:   13: libp2p_swarm::connection::Connection<THandler>::poll
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:   14: libp2p_swarm::connection::pool::task::new_for_established_connection::{{closure}}
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:   15: <tracing_futures::Instrumented<T> as core::future::future::Future>::poll
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:   16: tokio::runtime::task::raw::poll
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:   17: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:   18: tokio::runtime::scheduler::multi_thread::worker::run
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:   19: tokio::runtime::task::raw::poll
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:   20: std::sys_common::backtrace::__rust_begin_short_backtrace
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:   21: core::ops::function::FnOnce::call_once{{vtable.shim}}
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:   22: std::sys::pal::unix::thread::Thread::new::thread_start
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:   23: <unknown>
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:   24: <unknown>
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]: Thread 'tokio-runtime-worker' panicked at 'SinkImpl::poll_ready called after error.', /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/quicksink-0.1.2/src/lib.rs:158
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]: This is a bug. Please report it at:
Jul 04 16:09:40 polkadot-rpc1 polkadot[636343]:         https://github.com/paritytech/polkadot-sdk/issues/new
Jul 04 16:09:41 polkadot-rpc1 systemd[1]: polkadot.service: Main process exited, code=exited, status=1/FAILURE
lexnv commented 3 months ago

The tentative fix published at: https://github.com/libp2p/rust-libp2p/pull/5482

The panic is happening because the quicksink crate panics when polled after an error. The crate has long been unmaintained, and the latest version of libp2p has decided to inline the crate. (our current 0.52.4 still has quicksink as dependency in libp2p)

The fix is wrapping the quicksink and ensures that io::ErrorKind::Other is returned after further polling on errors. This can be done directly at the quicksink level, however I'd to include this in a point release on top of 0.52.4, which requires other patches.

This needs further testing to confirm everything is ok

lexnv commented 3 months ago

Triaging the fix over the weekend

Count      | Level      | Triage report
3356       | warn       | Notification block pinning limit reached. Unpinning block with hash = .*
506        | warn       | 💔 Error importing block .*: .*
314        | warn       | Re-finalized block #.* \(.*\) in the canonical chain, current best finalized is #.*
124        | warn       | Report .*: .* to .*. Reason: .*. Banned, disconnecting.
2          | warn       | Can't listen on .* because: .*
2          | warn       | Failed to run the random write disk benchmark: .*
2          | warn       | Failed to run the sequential write disk benchmark: .*
1          | error      | 🥩 Error: .*. Restarting voter.
1          | warn       | ❌ Error while dialing .*: .*

No panics where encountered