paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.62k stars 564 forks source link

Kusama: Beefy gossip validator is processing messages too slowly #3390

Closed bkchr closed 4 months ago

bkchr commented 4 months ago

Validators on Kusama report the following:

Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]: 2024-02-19 13:42:50 The number of unprocessed messages in channel `mpsc_beefy_gossip_validator` exceeded 10000.
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]: The channel was created at:
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    0: sc_consensus_beefy::start_beefy_gadget::{{closure}}
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    1: <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    2: tokio::runtime::task::raw::poll
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    3: std::sys_common::backtrace::__rust_begin_short_backtrace
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    4: core::ops::function::FnOnce::call_once{{vtable.shim}}
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    5: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:              at ./rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/alloc/src/boxed.rs:2007:9
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    6: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:              at ./rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/alloc/src/boxed.rs:2007:9
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    7: std::sys::unix::thread::Thread::new::thread_start
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:              at ./rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/sys/unix/thread.rs:108:17
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    8: <unknown>
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    9: <unknown>
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:  Last message was sent from:
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    0: sc_utils::mpsc::TracingUnboundedSender<T>::unbounded_send
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    1: <sc_consensus_beefy::communication::gossip::GossipValidator<B> as sc_network_gossip::validator::Validator<B>>::validate
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    2: <sc_network_gossip::bridge::GossipEngine<B> as core::future::future::Future>::poll
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    3: sc_consensus_beefy::wait_for_runtime_pallet::{{closure}}::{{closure}}::{{closure}}
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    4: sc_consensus_beefy::start_beefy_gadget::{{closure}}
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    5: <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    6: tokio::runtime::task::raw::poll
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    7: std::sys_common::backtrace::__rust_begin_short_backtrace
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    8: core::ops::function::FnOnce::call_once{{vtable.shim}}
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    9: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:              at ./rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/alloc/src/boxed.rs:2007:9
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:   10: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:              at ./rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/alloc/src/boxed.rs:2007:9
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:   11: std::sys::unix::thread::Thread::new::thread_start
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:              at ./rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/sys/unix/thread.rs:108:17
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:   12: <unknown>
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:   13: <unknown>
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:
acatangiu commented 4 months ago

I am guessing this is a non-issue, 1000 validators gossiping, multiple messages getting enqueued without being processed during node restart. Maybe we should increase the limit for this warning on Kusama to 100k instead of 10k.

Besides seeing this shortly after restart, are you experiencing any other issues?

Also, is your node finalizing BEEFY blocks? do you see logs like INFO tokio-runtime-worker beefy: 🥩 Round #21946827 concluded? Or you can also check RPC against your node: RPC::beefy::getFinalizedHead().

paulormart commented 4 months ago

From my side, beside seeing similar message after a restart, all seems to be good, it's finalizing BEEFY blocks:

polkadot[208721]: 2024-02-19 14:42:47 🥩 Round #21946707 concluded, finality_proof: V1(SignedCommitment { ...

burdges commented 4 months ago

Why? If its CPU usage, we do not check the BLS signature when gossiping, right?

acatangiu commented 4 months ago

Why? If its CPU usage, we do not check the BLS signature when gossiping, right?

happens only during node restart while gossiped messages are piling up during restart (BEEFY voter task starts late in the process while network gossip subsystem starts early).

Once node restart process is complete and BEEFY voter/worker task gets crunching, it seems it consumes the pending gossip messages and carries on nicely.

From my side, beside seeing similar message after a restart, all seems to be good, it's finalizing BEEFY blocks

For completeness of sanity checks, please also check node RAM usage over time. The bad scenario to invalidate here is that the gossip messages keep piling up faster than being consumed - in which case the RAM usage would reflect this very visibly over a, say, 24h window.

kureus commented 4 months ago

As Paulo is, seeing finalised BEEFY blocks

Feb 19 16:45:55 ovh-adv2-fireproof polkadot[437166]: 2024-02-19 16:45:55 🥩 Round #21947934 concluded, finality_proof: V1(SignedCommitment { commitment: Commitment { payload: ...

dbpatty commented 4 months ago

I had the same issue with the v1.7.1 and not only at the restart of the node

yayoi-k commented 4 months ago

On both KSM nodes I updated, I received the error "The number of unprocessed messages in channelmpsc_beefy_gossip_validatorexceeded 10000." upon startup of the node after installing v1.7.1.

stakeworld commented 4 months ago

For completeness of sanity checks, please also check node RAM usage over time. The bad scenario to invalidate here is that the gossip messages keep piling up faster than being consumed - in which case the RAM usage would reflect this very visibly over a, say, 24h window.

Ram & cpu seem ok but network usage has doubled since the activation of beefy

burdges commented 4 months ago

happens only during node restart while gossiped messages are piling up during restart (BEEFY voter task starts late in the process while network gossip subsystem starts early).

GRANDPA was designed so you only need to pay attention to the current round. In principle BEEFY should continue this, so maybe this requires some look ahead that realizes older messsages could now be discarded. Alternatively BEEFY rounds could simply be run less frequently, so then even though this happens it winds up irrelevant.

network usage has doubled since the activation of beefy

Do we know if BEEFY caused this? I'd think network usage doubling sounds more like async backing, but not really sure.

stakeworld commented 4 months ago

network usage has doubled since the activation of beefy

Do we know if BEEFY caused this? I'd think network usage doubling sounds more like async backing, but not really sure.

You are right, a lot of things happening recently (runtime upgrade, new version, beefy), so it could also be the async backing kicking in.

acatangiu commented 4 months ago

Fixed by #3435 and will be released with node version 1.8.0.