Open dmitry-markin opened 7 months ago
I see similar issue has already happened in the past: https://github.com/paritytech/polkadot-sdk/issues/3390
There’s no bandwidth to investigate this in bridges team now. I suggest someone from @paritytech/sdk-node take it. (Maybe they can fix it for good as they have more gossip protocol experience)
Extra data points here
2024-04-05 01:21:55.019 ERROR tokio-runtime-worker sc_utils::mpsc: The number of unprocessed messages in channel `mpsc-notification-to-protocol` exceeded 100000.
The channel was created at:
0: sc_utils::mpsc::tracing_unbounded
1: sc_network::protocol::notifications::service::notification_service
2: sc_network::config::NonDefaultSetConfig::new
3: sc_consensus_beefy::communication::beefy_peers_set_config
4: polkadot_service::new_full
5: polkadot_service::build_full
6: polkadot_cli::command::run_node_inner::{{closure}}::{{closure}}
7: tokio::runtime::park::CachedParkThread::block_on
8: tokio::runtime::runtime::Runtime::block_on
9: polkadot_cli::command::run
10: polkadot::main
11: std::sys_common::backtrace::__rust_begin_short_backtrace
12: std::rt::lang_start::{{closure}}
13: std::rt::lang_start_internal
14: main
15: <unknown>
16: __libc_start_main
17: _start
Last message was sent from:
0: sc_utils::mpsc::TracingUnboundedSender<T>::unbounded_send
1: alloc::vec::Vec<T,A>::retain
2: sc_network::protocol::notifications::service::ProtocolHandle::report_notification_received
3: sc_network::service::NetworkWorker<B,H>::handle_swarm_event
4: <sc_network::service::NetworkWorker<B,H> as sc_network::service::traits::NetworkBackend<B,H>>::run::{{closure}}
5: sc_service::build_network_future::{{closure}}::{{closure}}::{{closure}}
6: <futures_util::future::poll_fn::PollFn<F> as core::future::future::Future>::poll
7: <sc_service::task_manager::prometheus_future::PrometheusFuture<T> as core::future::future::Future>::poll
8: <tracing_futures::Instrumented<T> as core::future::future::Future>::poll
9: tokio::runtime::park::CachedParkThread::block_on
10: tokio::runtime::task::core::Core<T,S>::poll
11: tokio::runtime::task::harness::Harness<T,S>::poll
12: tokio::runtime::blocking::pool::Inner::run
13: std::sys_common::backtrace::__rust_begin_short_backtrace
14: core::ops::function::FnOnce::call_once{{vtable.shim}}
15: std::sys::pal::unix::thread::Thread::new::thread_start
16: <unknown>
17: <unknown>
Should be fixed by this: https://github.com/paritytech/polkadot-sdk/pull/4015
I think we can do even better and not even have the gossip engine running at all when we are syncing/waiting for the pallet. The protocol should only be registered when the node is ready to handle the messages.
Should be fixed by this: #4015
I think we can do even better and not even have the gossip engine running at all when we are syncing/waiting for the pallet. The protocol should only be registered when the node is ready to handle the messages.
I tried doing this at some point, but had problems with the tests (spawning multiple BEEFY workers in same test) - registering the gossip engine “later” the workers would not see each other and gossip correctly - again, this was only a tests problem, in real case with single BEEFY worker task per node instance it worked fine.
I just realized that @dmitry-markin posted above to error messages :see_no_evil: So, yeah, my pr is only solving the second one. For the first one we probably need to do what I said above about activating the networking stuff of beefy only when we need it.
I would be interested in taking a look at this when I have time. I will add myself to the issue, but if anyone else can prioritize it, please feel free to take it.
Looked a bit on this. I thought it was only about creating the GossipEngine
at a later stage, but it's more complicated than that. We would need to register/activate the beefy gossiping protocol after the networks starts. This would involve some changes to the networking code.
Anyway, linking the branch with my experiments in order to not forget about it: https://github.com/serban300/polkadot-sdk/tree/beefy-gossip-lazy-init
Hey @serban300 - any new findings? I see a similar issue when running a subtensor node on version v1.1.4
.
Hi @alpay-falconx ! No news since the last update. Didn't have bandwidth to prioritize this issue
When warp-syncing kusama node I hit
mpsc-notification-to-protocol-2-beefy
clogging just after gap sync was finished:In 15 mins
mpsc_beefy_gossip_validator
was also clogged:Complete logs: beefy_clogging.txt. The node was started as:
./polkadot -d=. --pruning=1000 --sync=warp --chain=kusama 2> stderr.txt
.CC @acatangiu