paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.com/
1.91k stars 704 forks source link

network/sync: Panic on unexpected generic response protocol #6581

Open lexnv opened 23 hours ago

lexnv commented 23 hours ago

Kusama validator is panics on:

https://github.com/paritytech/polkadot-sdk/blob/3906c578c96d97a8a099a4bdac4685acbe375a7c/substrate/client/network/sync/src/strategy/chain_sync.rs#L640-L646

From the log line it looks like the sync engine no longer takes into account the legacy request-response protocol name.

Related PR:

Version deployed: version 1.16.1-ca8beaed148

Logs

Grafana link.

2024-11-21 13:37:36.937  WARN tokio-runtime-worker sync: Unexpected generic response protocol /ksmcc3/sync/2, strategy key StrategyKey("ChainSync")
====================
Version: 1.16.1-ca8beaed148
   0: sp_panic_handler::panic_hook
             at /builds/parity/mirrors/polkadot-sdk/substrate/primitives/panic-handler/src/lib.rs:170:18
      sp_panic_handler::set::{{closure}}
             at /builds/parity/mirrors/polkadot-sdk/substrate/primitives/panic-handler/src/lib.rs:63:12
   1: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/alloc/src/boxed.rs:2084:9
      std::panicking::rust_panic_with_hook
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/panicking.rs:808:13
   2: std::panicking::begin_panic_handler::{{closure}}
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/panicking.rs:667:13
   3: std::sys::backtrace::__rust_end_short_backtrace
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/sys/backtrace.rs:168:18
   4: rust_begin_unwind
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/std/src/panicking.rs:665:5
   5: core::panicking::panic_fmt
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/panicking.rs:74:14
   6: core::panicking::panic
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/panicking.rs:148:5
   7: <sc_network_sync::strategy::chain_sync::ChainSync<B,Client> as sc_network_sync::strategy::SyncingStrategy<B>>::on_generic_response
   8: <sc_network_sync::strategy::polkadot::PolkadotSyncingStrategy<B,Client> as sc_network_sync::strategy::SyncingStrategy<B>>::on_generic_response
   9: sc_network_sync::engine::SyncingEngine<B,Client>::process_response_event
             at /builds/parity/mirrors/polkadot-sdk/substrate/client/network/sync/src/engine.rs:978:5
  10: sc_network_sync::engine::SyncingEngine<B,Client>::run::{{closure}}
             at /builds/parity/mirrors/polkadot-sdk/substrate/client/network/sync/src/engine.rs:551:6
  11: <sc_service::task_manager::prometheus_future::PrometheusFuture<T> as core::future::future::Future>::poll
             at /builds/parity/mirrors/polkadot-sdk/substrate/client/service/src/task_manager/prometheus_future.rs:61:3
  12: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::future::future::Future>::poll
             at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/panic/unwind_safe.rs:297:9
      <futures_util::future::future::catch_unwind::CatchUnwind<Fut> as core::future::future::Future>::poll::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.31/src/future/future/catch_unwind.rs:37:42
      <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once

cc @paritytech/networking

lexnv commented 19 hours ago

Warp request protocols and state requests protocols are initializing their handler supporting 2 protocol names:

https://github.com/paritytech/polkadot-sdk/blob/bf20a9ee18f7215210bbbabf79e955c8c35b3360/substrate/client/network/sync/src/block_request_handler.rs#L89-L90

They are supporting both /{}/{}/sync/2 (new genesis-based format) and /{}/sync/2 (legacy format).

We are only checking the new genesis-based format in chain sync

dmitry-markin commented 2 hours ago

We are only checking the new genesis-based format in chain sync

This is fine, as the main protocol name is always used by substrate to refer to the protocol. Legacy protocol name is only used on the wire.

lexnv commented 1 hour ago

Oki doki, thanks for the info 🙏

I believe in that case this is the proper fix:

Looking over the litep2p code, we always provide the negotiated protocol (regardless if it is fallback or main):

let handle1 =  ConfigBuilder::new(ProtocolName::from("/protocol/1/improved"))
            .with_max_size(1024usize)
            .with_fallback_names(vec![ProtocolName::from("/protocol/1")])
            .build();

let handle2 = handle2) = RequestResponseConfig::new(
        ProtocolName::from("/protocol/1") ...;

handle1
        .send_request(peer2, vec![1, 3, 3, 7], DialOptions::Reject);

handle2.next().await.unwrap(),
        RequestResponseEvent::RequestReceived {
            peer: peer1,
            fallback: None, ...):

handle2.send_response(request_id, vec![1, 3, 3, 8]);
    assert_eq!(
        handle1.next().await.unwrap(),
        RequestResponseEvent::ResponseReceived {
            peer: peer2,
            request_id,
            response: vec![1, 3, 3, 8],
            fallback: Some(ProtocolName::from("/protocol/1")),
        }
    );

Step 1: Initialization

Step 2: handle 1 -> handle 2 request

Step3: handle 1 receives message for fallback

Then into substrate, we parse the received requests as follows:

https://github.com/paritytech/polkadot-sdk/blob/7c5224cb01710d0c14c87bf3463cc79e49b3e7b5/substrate/client/network/src/litep2p/shim/request_response/mod.rs#L528-L530

And we forward the protocol fallback if any: https://github.com/paritytech/polkadot-sdk/blob/7c5224cb01710d0c14c87bf3463cc79e49b3e7b5/substrate/client/network/src/litep2p/shim/request_response/mod.rs#L342