Beefy mpsc channels clogging after warp-sync

dmitry-markin commented 7 months ago

When warp-syncing kusama node I hit mpsc-notification-to-protocol-2-beefy clogging just after gap sync was finished:

2024-04-02 02:44:19 ⏩ Block history, #22547232 (9 peers), best: #22554774 (0x5c30…bba3), finalized #22554771 (0xae0d…61be), ⬇ 14.3MiB/s ⬆ 832.3kiB/s    
2024-04-02 02:44:24 Notification block pinning limit reached. Unpinning block with hash = 0x6c7f3c009b2a70b0b3a56c5ff2eb2ba136647bc1024a93f74684b8cfb2fb277c    
2024-04-02 02:44:24 ✨ Imported #22554775 (0x344d…d5ed)    
2024-04-02 02:44:24 ⏩ Block history, #22549344 (9 peers), best: #22554775 (0x344d…d5ed), finalized #22554772 (0xe6c4…bd9a), ⬇ 7.6MiB/s ⬆ 793.1kiB/s    
2024-04-02 02:44:24 ✨ Imported #22554775 (0x941a…868f)    
2024-04-02 02:44:25 Report 12D3KooWJhiXUbyqf4r6UrEZ4NM5JwvUYmGX8P7qiN7mnYSCcKGE: -2147483648 to -2147483648. Reason: Same block request multiple times. Banned, disconnecting.    
2024-04-02 02:44:26 Block history download is complete.    
2024-04-02 02:44:29 💤 Idle (9 peers), best: #22554775 (0x344d…d5ed), finalized #22554773 (0xec9b…c833), ⬇ 3.0MiB/s ⬆ 843.4kiB/s    
2024-04-02 02:44:30 Notification block pinning limit reached. Unpinning block with hash = 0xc4cb2b08f438b2263ef7e2ae5dd1d8dced7ef874cd76009b040c219d0d2246a6    
2024-04-02 02:44:30 ✨ Imported #22554776 (0x882a…1925)    
2024-04-02 02:44:34 💤 Idle (9 peers), best: #22554776 (0x882a…1925), finalized #22554773 (0xec9b…c833), ⬇ 1.3MiB/s ⬆ 834.8kiB/s    
2024-04-02 02:44:36 Notification block pinning limit reached. Unpinning block with hash = 0x4f3238009ca9ee8b4a9f46356148c58b438853d00a4b431de629aa4ed00fd05b    
2024-04-02 02:44:36 ✨ Imported #22554777 (0x47b2…94f5)    
2024-04-02 02:44:39 💤 Idle (9 peers), best: #22554777 (0x47b2…94f5), finalized #22554775 (0x344d…d5ed), ⬇ 1.4MiB/s ⬆ 1.0MiB/s    
2024-04-02 02:44:42 Notification block pinning limit reached. Unpinning block with hash = 0x18503eb048b773a2ffa6571af372c924591fcb1e513c2324eb5e7f24091b8673    
2024-04-02 02:44:42 ✨ Imported #22554778 (0x2bd0…737f)    
2024-04-02 02:44:43 🥩 Loading BEEFY voter state from genesis on what appears to be first startup. Starting voting rounds at block 21943872, genesis validator set ValidatorSet { <redacted out for brevity> }
2024-04-02 02:44:43 🥩 run BEEFY worker, best grandpa: #22550151.    
2024-04-02 02:44:43 🥩 Error: ConsensusReset. Restarting voter.    
2024-04-02 02:44:43 🥩 BEEFY gadget waiting for BEEFY pallet to become available...    
2024-04-02 02:44:43 🥩 BEEFY pallet available: block 22553555 beefy genesis 21943872    
2024-04-02 02:44:44 💤 Idle (9 peers), best: #22554778 (0x2bd0…737f), finalized #22554775 (0x344d…d5ed), ⬇ 1.4MiB/s ⬆ 635.6kiB/s    
2024-04-02 02:44:48 Notification block pinning limit reached. Unpinning block with hash = 0xb2846c1a20be7885b15fe22a2265f5f21151d397ef004f6f3f20e72724a1b46a    
2024-04-02 02:44:48 ✨ Imported #22554779 (0x7d3e…b57b)    
2024-04-02 02:44:49 💤 Idle (9 peers), best: #22554779 (0x7d3e…b57b), finalized #22554776 (0x882a…1925), ⬇ 1.4MiB/s ⬆ 684.1kiB/s    
2024-04-02 02:44:54 Notification block pinning limit reached. Unpinning block with hash = 0x762848ae55d60fdad297acba72208bf75ac0f1a3759dcaca303bc8ee92268870    
2024-04-02 02:44:54 ✨ Imported #22554780 (0x89f5…8836)    
2024-04-02 02:44:54 Report 12D3KooWJhiXUbyqf4r6UrEZ4NM5JwvUYmGX8P7qiN7mnYSCcKGE: -2147483648 to -2147483648. Reason: Same block request multiple times. Banned, disconnecting.    
2024-04-02 02:44:54 💤 Idle (10 peers), best: #22554780 (0x89f5…8836), finalized #22554777 (0x47b2…94f5), ⬇ 1.6MiB/s ⬆ 719.7kiB/s    
2024-04-02 02:44:59 💤 Idle (10 peers), best: #22554780 (0x89f5…8836), finalized #22554777 (0x47b2…94f5), ⬇ 1.3MiB/s ⬆ 359.4kiB/s    
2024-04-02 02:45:00 Notification block pinning limit reached. Unpinning block with hash = 0xf1a59c790a810679810ca53cc510ccad6492f69c09b414b4c2a797b635e191f7    
2024-04-02 02:45:00 ✨ Imported #22554781 (0xbca3…a18d)    
2024-04-02 02:45:01 ✨ Imported #22554781 (0xe665…8722)    
2024-04-02 02:45:04 💤 Idle (10 peers), best: #22554781 (0xbca3…a18d), finalized #22554779 (0x7d3e…b57b), ⬇ 1.6MiB/s ⬆ 1.2MiB/s    
2024-04-02 02:45:06 Notification block pinning limit reached. Unpinning block with hash = 0xd752290b0a114e5ecb2af6d871bc26a2c8e9c0198d4cb143a81e79b344d8d39c    
2024-04-02 02:45:06 ✨ Imported #22554782 (0xc1ff…8619)    
2024-04-02 02:45:09 💤 Idle (10 peers), best: #22554782 (0xc1ff…8619), finalized #22554779 (0x7d3e…b57b), ⬇ 2.8MiB/s ⬆ 631.5kiB/s    
2024-04-02 02:45:12 Notification block pinning limit reached. Unpinning block with hash = 0x6fd202b3fb03d79e79fdea8b8f2dc67a446af9cda3c9091a4d9cc9487936ab17    
2024-04-02 02:45:12 ✨ Imported #22554783 (0xd8ed…8d2f)    
2024-04-02 02:45:12 ✨ Imported #22554783 (0x09bf…fe08)    
2024-04-02 02:45:14 💤 Idle (10 peers), best: #22554783 (0xd8ed…8d2f), finalized #22554780 (0x89f5…8836), ⬇ 1.7MiB/s ⬆ 756.9kiB/s    
2024-04-02 02:45:18 ✨ Imported #22554784 (0x741a…b481)    
2024-04-02 02:45:19 💤 Idle (10 peers), best: #22554784 (0x741a…b481), finalized #22554780 (0x89f5…8836), ⬇ 1.5MiB/s ⬆ 669.2kiB/s    
2024-04-02 02:45:24 Notification block pinning limit reached. Unpinning block with hash = 0x7157ec115f0d86b416144a311ca6c4b38a0ec2f1ab3f797a1d8756ddaca327cb    
2024-04-02 02:45:24 ✨ Imported #22554785 (0xc631…c9d9)    
2024-04-02 02:45:24 💤 Idle (11 peers), best: #22554785 (0xc631…c9d9), finalized #22554781 (0xbca3…a18d), ⬇ 1.7MiB/s ⬆ 896.0kiB/s    
2024-04-02 02:45:25 Report 12D3KooWJhiXUbyqf4r6UrEZ4NM5JwvUYmGX8P7qiN7mnYSCcKGE: -2147483648 to -2147483648. Reason: Same block request multiple times. Banned, disconnecting.    
2024-04-02 02:45:29 Notification block pinning limit reached. Unpinning block with hash = 0x0171fa638feb19b2961c840e1d38a997326790976f3bad7a7076d61646fa0206    
2024-04-02 02:45:29 💤 Idle (10 peers), best: #22554785 (0xc631…c9d9), finalized #22554783 (0xd8ed…8d2f), ⬇ 1.6MiB/s ⬆ 480.1kiB/s    
2024-04-02 02:45:30 Notification block pinning limit reached. Unpinning block with hash = 0x8308c089d30c4a3c45c07da4185355a1c3f0bfea293b79d6611ae9422ecc4b81    
2024-04-02 02:45:30 ✨ Imported #22554786 (0x04e6…f3ee)    
2024-04-02 02:45:34 💤 Idle (10 peers), best: #22554786 (0x04e6…f3ee), finalized #22554784 (0x741a…b481), ⬇ 1.4MiB/s ⬆ 716.4kiB/s    
2024-04-02 02:45:36 Notification block pinning limit reached. Unpinning block with hash = 0x18b0c1a190eb92b33f06360bd03ee572ad0f7ea5a0b6c3a77bae1078c8768f0a    
2024-04-02 02:45:36 ✨ Imported #22554787 (0x8de0…b2ad)    
2024-04-02 02:45:39 💤 Idle (10 peers), best: #22554787 (0x8de0…b2ad), finalized #22554784 (0x741a…b481), ⬇ 1.2MiB/s ⬆ 51.1kiB/s    
2024-04-02 02:45:42 Notification block pinning limit reached. Unpinning block with hash = 0x9a042edb5808132c49e70b502f312f883bd2fe0f69c39192850c4967d23d6dde    
2024-04-02 02:45:42 ✨ Imported #22554788 (0xd539…b864)    
2024-04-02 02:45:44 💤 Idle (9 peers), best: #22554788 (0xd539…b864), finalized #22554785 (0xc631…c9d9), ⬇ 1.6MiB/s ⬆ 772.1kiB/s    
2024-04-02 02:45:48 Notification block pinning limit reached. Unpinning block with hash = 0xc1af19ca9fbcb4b21e35610cf1e7388e0f4ad86d0d88cc0c00e7528d298fa2c1    
2024-04-02 02:45:48 ✨ Imported #22554789 (0xa340…5985)    
2024-04-02 02:45:49 💤 Idle (10 peers), best: #22554789 (0xa340…5985), finalized #22554786 (0x04e6…f3ee), ⬇ 1.5MiB/s ⬆ 902.6kiB/s
2024-04-02 02:45:50 The number of unprocessed messages in channel `mpsc-notification-to-protocol-2-beefy` exceeded 100000.
The channel was created at:
   0: sc_utils::mpsc::tracing_unbounded
   1: sc_network::protocol::notifications::service::notification_service
   2: sc_network::config::NonDefaultSetConfig::new
   3: sc_consensus_beefy::communication::beefy_peers_set_config
   4: polkadot_service::new_full
   5: polkadot_service::build_full
   6: polkadot_cli::command::run_node_inner::{{closure}}::{{closure}}
   7: tokio::runtime::runtime::Runtime::block_on
   8: polkadot_cli::command::run
   9: polkadot::main
  10: std::sys_common::backtrace::__rust_begin_short_backtrace
  11: std::rt::lang_start::{{closure}}
  12: std::rt::lang_start_internal
  13: main
  14: <unknown>
  15: __libc_start_main
  16: _start

 Last message was sent from:
   0: sc_utils::mpsc::TracingUnboundedSender<T>::unbounded_send
   1: alloc::vec::Vec<T,A>::retain
   2: sc_network::protocol::notifications::service::ProtocolHandle::report_notification_received
   3: sc_network::service::NetworkWorker<B,H>::handle_swarm_event
   4: <futures_util::future::future::fuse::Fuse<Fut> as core::future::future::Future>::poll
   5: sc_service::build_network_future::{{closure}}::{{closure}}::{{closure}}
   6: <futures_util::future::poll_fn::PollFn<F> as core::future::future::Future>::poll
   7: <sc_service::task_manager::prometheus_future::PrometheusFuture<T> as core::future::future::Future>::poll
   8: <futures_util::future::select::Select<A,B> as core::future::future::Future>::poll
   9: <tracing_futures::Instrumented<T> as core::future::future::Future>::poll
  10: tokio::runtime::context::runtime::enter_runtime
  11: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
  12: tokio::runtime::task::core::Core<T,S>::poll
  13: tokio::runtime::task::harness::Harness<T,S>::poll
  14: tokio::runtime::blocking::pool::Inner::run
  15: std::sys_common::backtrace::__rust_begin_short_backtrace
  16: core::ops::function::FnOnce::call_once{{vtable.shim}}
  17: std::sys::pal::unix::thread::Thread::new::thread_start
  18: <unknown>
  19: <unknown>

In 15 mins mpsc_beefy_gossip_validator was also clogged:

2024-04-02 03:03:06 ✨ Imported #22554962 (0x26e6…b5cc)    
2024-04-02 03:03:06 ♻️  Reorg on #22554962,0x26e6…b5cc to #22554962,0x364c…474a, common ancestor #22554961,0x1809…a491    
2024-04-02 03:03:06 ✨ Imported #22554962 (0x364c…474a)    
2024-04-02 03:03:09 💤 Idle (10 peers), best: #22554962 (0x364c…474a), finalized #22554958 (0xc80c…ef33), ⬇ 1.4MiB/s ⬆ 733.2kiB/s    
2024-04-02 03:03:12 ✨ Imported #22554963 (0x6033…fc54)    
2024-04-02 03:03:14 💤 Idle (12 peers), best: #22554963 (0x6033…fc54), finalized #22554960 (0x484c…3721), ⬇ 1.4MiB/s ⬆ 634.1kiB/s    
2024-04-02 03:03:17 Report 12D3KooWJhiXUbyqf4r6UrEZ4NM5JwvUYmGX8P7qiN7mnYSCcKGE: -2147483648 to -2147483648. Reason: Same block request multiple times. Banned, disconnecting.    
2024-04-02 03:03:17 🥩 New Rounds for validator set id: 3485 with session_start 22550420    
2024-04-02 03:03:17 🥩 New Rounds for validator set id: 3486 with session_start 22551018    
2024-04-02 03:03:17 🥩 New Rounds for validator set id: 3487 with session_start 22551617    
2024-04-02 03:03:17 🥩 New Rounds for validator set id: 3488 with session_start 22552215    
2024-04-02 03:03:17 🥩 New Rounds for validator set id: 3489 with session_start 22552815    
2024-04-02 03:03:17 🥩 New Rounds for validator set id: 3490 with session_start 22553413    
2024-04-02 03:03:17 🥩 run BEEFY worker, best grandpa: #22553689.    
2024-04-02 03:03:17 🥩 Error: ConsensusReset. Restarting voter.
2024-04-02 03:03:18 The number of unprocessed messages in channel `mpsc_beefy_gossip_validator` exceeded 100000.
The channel was created at:
   0: sc_utils::mpsc::tracing_unbounded
   1: sc_consensus_beefy::communication::gossip::GossipValidator<B>::new
   2: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::future::future::Future>::poll
   3: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
   4: <sc_service::task_manager::prometheus_future::PrometheusFuture<T> as core::future::future::Future>::poll
   5: <futures_util::future::select::Select<A,B> as core::future::future::Future>::poll
   6: <tracing_futures::Instrumented<T> as core::future::future::Future>::poll
   7: tokio::runtime::context::runtime::enter_runtime
   8: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
   9: tokio::runtime::task::core::Core<T,S>::poll
  10: tokio::runtime::task::harness::Harness<T,S>::poll
  11: tokio::runtime::blocking::pool::Inner::run
  12: std::sys_common::backtrace::__rust_begin_short_backtrace
  13: core::ops::function::FnOnce::call_once{{vtable.shim}}
  14: std::sys::pal::unix::thread::Thread::new::thread_start
  15: <unknown>
  16: <unknown>

 Last message was sent from:
   0: sc_utils::mpsc::TracingUnboundedSender<T>::unbounded_send
   1: <sc_consensus_beefy::communication::gossip::GossipValidator<B> as sc_network_gossip::validator::Validator<B>>::validate
   2: sc_network_gossip::state_machine::ConsensusGossip<B>::on_incoming
   3: <sc_network_gossip::bridge::GossipEngine<B> as core::future::future::Future>::poll
   4: sc_consensus_beefy::start_beefy_gadget::{{closure}}::{{closure}}::{{closure}}
   5: <futures_util::future::poll_fn::PollFn<F> as core::future::future::Future>::poll
   6: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::future::future::Future>::poll
   7: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
   8: <sc_service::task_manager::prometheus_future::PrometheusFuture<T> as core::future::future::Future>::poll
   9: <futures_util::future::select::Select<A,B> as core::future::future::Future>::poll
  10: <tracing_futures::Instrumented<T> as core::future::future::Future>::poll
  11: tokio::runtime::context::runtime::enter_runtime
  12: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
  13: tokio::runtime::task::core::Core<T,S>::poll
  14: tokio::runtime::task::harness::Harness<T,S>::poll
  15: tokio::runtime::blocking::pool::Inner::run
  16: std::sys_common::backtrace::__rust_begin_short_backtrace
  17: core::ops::function::FnOnce::call_once{{vtable.shim}}
  18: std::sys::pal::unix::thread::Thread::new::thread_start
  19: <unknown>
  20: <unknown>

Complete logs: beefy_clogging.txt. The node was started as: ./polkadot -d=. --pruning=1000 --sync=warp --chain=kusama 2> stderr.txt.

CC @acatangiu

dmitry-markin commented 7 months ago

I see similar issue has already happened in the past: https://github.com/paritytech/polkadot-sdk/issues/3390

acatangiu commented 7 months ago

There’s no bandwidth to investigate this in bridges team now. I suggest someone from @paritytech/sdk-node take it. (Maybe they can fix it for good as they have more gossip protocol experience)

lexnv commented 7 months ago

Extra data points here

2024-04-05 01:21:55.019 ERROR tokio-runtime-worker sc_utils::mpsc: The number of unprocessed messages in channel `mpsc-notification-to-protocol` exceeded 100000.
The channel was created at:
   0: sc_utils::mpsc::tracing_unbounded
   1: sc_network::protocol::notifications::service::notification_service
   2: sc_network::config::NonDefaultSetConfig::new
   3: sc_consensus_beefy::communication::beefy_peers_set_config
   4: polkadot_service::new_full
   5: polkadot_service::build_full
   6: polkadot_cli::command::run_node_inner::{{closure}}::{{closure}}
   7: tokio::runtime::park::CachedParkThread::block_on
   8: tokio::runtime::runtime::Runtime::block_on
   9: polkadot_cli::command::run
  10: polkadot::main
  11: std::sys_common::backtrace::__rust_begin_short_backtrace
  12: std::rt::lang_start::{{closure}}
  13: std::rt::lang_start_internal
  14: main
  15: <unknown>
  16: __libc_start_main
  17: _start

 Last message was sent from:
   0: sc_utils::mpsc::TracingUnboundedSender<T>::unbounded_send
   1: alloc::vec::Vec<T,A>::retain
   2: sc_network::protocol::notifications::service::ProtocolHandle::report_notification_received
   3: sc_network::service::NetworkWorker<B,H>::handle_swarm_event
   4: <sc_network::service::NetworkWorker<B,H> as sc_network::service::traits::NetworkBackend<B,H>>::run::{{closure}}
   5: sc_service::build_network_future::{{closure}}::{{closure}}::{{closure}}
   6: <futures_util::future::poll_fn::PollFn<F> as core::future::future::Future>::poll
   7: <sc_service::task_manager::prometheus_future::PrometheusFuture<T> as core::future::future::Future>::poll
   8: <tracing_futures::Instrumented<T> as core::future::future::Future>::poll
   9: tokio::runtime::park::CachedParkThread::block_on
  10: tokio::runtime::task::core::Core<T,S>::poll
  11: tokio::runtime::task::harness::Harness<T,S>::poll
  12: tokio::runtime::blocking::pool::Inner::run
  13: std::sys_common::backtrace::__rust_begin_short_backtrace
  14: core::ops::function::FnOnce::call_once{{vtable.shim}}
  15: std::sys::pal::unix::thread::Thread::new::thread_start
  16: <unknown>
  17: <unknown>

bkchr commented 7 months ago

Should be fixed by this: https://github.com/paritytech/polkadot-sdk/pull/4015

I think we can do even better and not even have the gossip engine running at all when we are syncing/waiting for the pallet. The protocol should only be registered when the node is ready to handle the messages.

acatangiu commented 7 months ago

Should be fixed by this: #4015

I think we can do even better and not even have the gossip engine running at all when we are syncing/waiting for the pallet. The protocol should only be registered when the node is ready to handle the messages.

I tried doing this at some point, but had problems with the tests (spawning multiple BEEFY workers in same test) - registering the gossip engine “later” the workers would not see each other and gossip correctly - again, this was only a tests problem, in real case with single BEEFY worker task per node instance it worked fine.

bkchr commented 7 months ago

I just realized that @dmitry-markin posted above to error messages :see_no_evil: So, yeah, my pr is only solving the second one. For the first one we probably need to do what I said above about activating the networking stuff of beefy only when we need it.

serban300 commented 7 months ago

I would be interested in taking a look at this when I have time. I will add myself to the issue, but if anyone else can prioritize it, please feel free to take it.

serban300 commented 2 months ago

Looked a bit on this. I thought it was only about creating the GossipEngine at a later stage, but it's more complicated than that. We would need to register/activate the beefy gossiping protocol after the networks starts. This would involve some changes to the networking code.

Anyway, linking the branch with my experiments in order to not forget about it: https://github.com/serban300/polkadot-sdk/tree/beefy-gossip-lazy-init

alpay-falconx commented 2 days ago

Hey @serban300 - any new findings? I see a similar issue when running a subtensor node on version v1.1.4.

serban300 commented 1 day ago

Hi @alpay-falconx ! No news since the last update. Didn't have bandwidth to prioritize this issue

paritytech / polkadot-sdk

Beefy mpsc channels clogging after warp-sync #3945