paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.78k stars 639 forks source link

litep2p: Kusama validator crashes repeatedly #5595

Open sandreim opened 1 week ago

sandreim commented 1 week ago

Version: 1.15.1-afab63e5a7b

Thread 'tokio-runtime-worker' panicked at 'assertion failed: false', /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/litep2p-0.6.2/src/protocol/transport_service.rs:220
This is a bug. Please report it at:
    https://github.com/paritytech/polkadot-sdk/issues/new
====================
   0: sp_panic_handler::panic_hook
             at /builds/parity/mirrors/polkadot-sdk/substrate/primitives/panic-handler/src/lib.rs:166:18
      sp_panic_handler::set::{{closure}}
             at /builds/parity/mirrors/polkadot-sdk/substrate/primitives/panic-handler/src/lib.rs:62:12
   1: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/alloc/src/boxed.rs:2029:9
      std::panicking::rust_panic_with_hook
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:785:13
   2: std::panicking::begin_panic_handler::{{closure}}
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:651:13
   3: std::sys_common::backtrace::__rust_end_short_backtrace
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys_common/backtrace.rs:171:18
   4: rust_begin_unwind
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:647:5
   5: core::panicking::panic_fmt
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/panicking.rs:72:14
   6: core::panicking::panic
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/panicking.rs:144:5
   7: litep2p::protocol::transport_service::TransportService::on_connection_closed
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/litep2p-0.6.2/src/protocol/transport_service.rs:220:13
      <litep2p::protocol::transport_service::TransportService as futures_core::stream::Stream>::poll_next
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/litep2p-0.6.2/src/protocol/transport_service.rs:381:42
   8: futures_util::stream::stream::StreamExt::poll_next_unpin
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/stream/stream/mod.rs:1638:9
      <futures_util::stream::stream::next::Next<St> as core::future::future::Future>::poll
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/stream/stream/next.rs:32:21
      litep2p::protocol::request_response::RequestResponseProtocol::run::{{closure}}::{{closure}}
             at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/macros/select.rs:524:49
      <tokio::future::poll_fn::PollFn<F> as core::future::future::Future>::poll

Full Logs: https://grafana.teleport.parity.io/goto/nlUPrk6SR?orgId=1

sandreim commented 1 week ago

CC @paritytech/networking

alexggh commented 1 week ago

It seems a few debug_asserts that are thrown: https://github.com/paritytech/litep2p/blob/master/src/protocol/transport_service.rs#L237C13-L237C25

alexggh commented 1 week ago

I think the deployed image from here did not have them stripped out: https://github.com/paritytech/devops/issues/3519#issuecomment-2329696367

dmitry-markin commented 1 week ago

For the reference, this is where it panics: https://github.com/paritytech/litep2p/blob/v0.6.2/src/protocol/transport_service.rs#L212-L222

let Some(context) = self.connections.get_mut(&peer) else {
    tracing::warn!(
        target: LOG_TARGET,
        ?peer,
        ?connection_id,
        "connection closed to a non-existent peer",
    );

    debug_assert!(false);
    return None;
};
lexnv commented 1 week ago

Investigating the panics, it seems indeed we are triggering debug_aserts. Indeed, we need to double-check our images are build with --release. These defensive failures are put in place to help us debug issues sooner, in this case, race-conditions and code expectations.

Issues

Connection closed for a nonexistent peer

Thread 'tokio-runtime-worker' panicked at 'assertion failed: false', /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/litep2p-0.6.2/src/protocol/transport_service.rs:220

Code path: https://github.com/paritytech/litep2p/blob/a27d0072913be67cce904e2e8c465cc37ec5223a/src/protocol/transport_service.rs#L220

Expecting different state for a dial failure report

Thread 'tokio-runtime-worker' panicked at 'assertion failed: false', /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/litep2p-0.6.2/src/transport/manager/mod.rs:748

Code path: https://github.com/paritytech/litep2p/blob/a27d0072913be67cce904e2e8c465cc37ec5223a/src/transport/manager/mod.rs#L748

Connection closed without primary and secondary connections

Thread 'tokio-runtime-worker' panicked at 'assertion failed: false', /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/litep2p-0.6.2/src/transport/manager/mod.rs:858

Code path:

https://github.com/paritytech/litep2p/blob/a27d0072913be67cce904e2e8c465cc37ec5223a/src/transport/manager/mod.rs#L858

Next steps

dmitry-markin commented 1 week ago

We can also use "release with debug info" builds to simplify crash reports analysis. I.e., no debug assertions, but still debugging symbols. ~It doesn't look polkadot-sdk Cargo.toml has such build profile:~ https://github.com/paritytech/polkadot-sdk/blob/49a68132882e58872411c5c0278b13a008b3682b/Cargo.toml#L1371-L1389

EDIT: we don't strip the debug symbols in the release profile (no strip = true), so the debug symbols are included, of course.