paritytech / substrate

Substrate: The platform for blockchain innovators
Apache License 2.0
8.39k stars 2.65k forks source link

`try-runtime-cli` panics on Polkadot #14006

Closed ggwpez closed 1 year ago

ggwpez commented 1 year ago

On commit Polkadot 3595684b0293439c32dfe563008958c7d0091289 ran:

cargo b -r --bin polkadot  --features try-runtime

RUST_LOG=runtime=debug,remote-ext=trace target/release/polkadot  try-runtime --detailed-log-output --chain polkadot-dev --runtime target/release/wbuild/polkadot-runtime/polkadot_runtime.wasm on-runtime-upgrade --checks=pre-and-post live --uri wss://polkadot-try-runtime-node.parity-chains.parity.io:443

Truncated output:

DEBUG ThreadId(47) remote-ext: Remaining payloads: 36003 Batch request size: 2435    
 INFO main remote-ext: inserting keys progress = 3% [50000 / 1843038]    
DEBUG ThreadId(35) remote-ext: Remaining payloads: 8902 Batch request size: 11787    
DEBUG ThreadId(51) remote-ext: Batch request failed, trying again with smaller batch size. Networking or low-level protocol error: The request body was too large    
DEBUG ThreadId(51) remote-ext: Remaining payloads: 24471 Batch request size: 2679    
DEBUG ThreadId(45) remote-ext: Remaining payloads: 8902 Batch request size: 11787    
DEBUG ThreadId(36) remote-ext: Remaining payloads: 8902 Batch request size: 11787    
 INFO main remote-ext: inserting keys progress = 5% [100000 / 1843038]    
DEBUG ThreadId(32) remote-ext: Remaining payloads: 8902 Batch request size: 11787    
 INFO main remote-ext: inserting keys progress = 8% [150000 / 1843038]    
 INFO main remote-ext: inserting keys progress = 11% [200000 / 1843038]    
 INFO main remote-ext: inserting keys progress = 14% [250000 / 1843038]    
DEBUG ThreadId(48) remote-ext: Remaining payloads: 58582 Batch request size: 1215    
 INFO main remote-ext: inserting keys progress = 16% [300000 / 1843038]    

====================

Version: 0.9.41-3595684b029

   0: sp_panic_handler::set::{{closure}}
   1: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/alloc/src/boxed.rs:2001:9
      std::panicking::rust_panic_with_hook
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:696:13
   2: std::panicking::begin_panic::{{closure}}
   3: std::sys_common::backtrace::__rust_end_short_backtrace
   4: std::panicking::begin_panic
   5: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
   6: <futures_util::future::future::flatten::Flatten<Fut,<Fut as core::future::future::Future>::Output> as core::future::future::Future>::poll
   7: <futures_util::future::either::Either<A,B> as core::future::future::Future>::poll
   8: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
   9: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
  10: hyper::client::client::Client<C,B>::retryably_send_request::{{closure}}
  11: <hyper::client::client::ResponseFuture as core::future::future::Future>::poll
  12: <tokio::time::timeout::Timeout<T> as core::future::future::Future>::poll
  13: <jsonrpsee_http_client::client::HttpClient as jsonrpsee_core::client::ClientT>::batch_request::{{closure}}
  14: frame_remote_externalities::Builder<B>::get_storage_data_dynamic_batch_size::{{closure}}
  15: frame_remote_externalities::Builder<B>::get_storage_data_dynamic_batch_size::{{closure}}
  16: frame_remote_externalities::Builder<B>::get_storage_data_dynamic_batch_size::{{closure}}
  17: frame_remote_externalities::Builder<B>::get_storage_data_dynamic_batch_size::{{closure}}
  18: frame_remote_externalities::Builder<B>::get_storage_data_dynamic_batch_size::{{closure}}
  19: frame_remote_externalities::Builder<B>::get_storage_data_dynamic_batch_size::{{closure}}
  20: frame_remote_externalities::Builder<B>::get_storage_data_dynamic_batch_size::{{closure}}
  21: frame_remote_externalities::Builder<B>::get_storage_data_dynamic_batch_size::{{closure}}
  22: frame_remote_externalities::Builder<B>::get_storage_data_dynamic_batch_size::{{closure}}
  23: frame_remote_externalities::Builder<B>::get_storage_data_dynamic_batch_size::{{closure}}
  24: frame_remote_externalities::Builder<B>::get_storage_data_dynamic_batch_size::{{closure}}
  25: tokio::runtime::park::CachedParkThread::block_on
  26: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
  27: tokio::runtime::runtime::Runtime::block_on
  28: std::sys_common::backtrace::__rust_begin_short_backtrace
  29: core::ops::function::FnOnce::call_once{{vtable.shim}}
  30: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/alloc/src/boxed.rs:1987:9
      <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/alloc/src/boxed.rs:1987:9
      std::sys::unix::thread::Thread::new::thread_start
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/sys/unix/thread.rs:108:17
  31: <unknown>
  32: <unknown>

Thread '<unnamed>' panicked at 'dispatch dropped without returning error', /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/hyper-0.14.20/src/client/conn.rs:329

This is a bug. Please report it at:

        https://github.com/paritytech/polkadot/issues/new

Need this for testing a change. Full log from the CI.

ggwpez commented 1 year ago

It is doing 94 recursions of get_storage_data_dynamic_batch_size on Westend. Should maybe re-write this to use a loop to prevent an overflow.

kianenigma commented 1 year ago

Seems like some of the parallelization added in #12537 is causing more issues than helping :(

ggwpez commented 1 year ago

Seems like some of the parallelization added in https://github.com/paritytech/substrate/pull/12537 is causing more issues than helping :(

Yea not sure how much it helped with the speed, came up in the review here https://github.com/paritytech/substrate/pull/13923#discussion_r1171035228

But the panic is not from your parallel change, but rather also #13923. We should add a test that it does not break in Polkadot :sweat_smile:

liamaharon commented 1 year ago

Hey thanks for raising this.

I had been testing try-runtime against a local Polkadot node on my laptop without any issues, I wasn't aware a CI env existed that needs testing against. I'm looking into this now.

niklasad1 commented 1 year ago

https://github.com/hyperium/hyper/issues/2112

Try creating a new client in the "batch thread" instead of cloning should fix that...

Here https://github.com/paritytech/substrate/blob/b4b21168bfa2d8fbde18b2efb9518124a2d70154/utils/frame/remote-externalities/src/lib.rs#L571 because it starts another tokio runtime in the new thread which might fuck up the hyper background thread/task because it is started on another runtime ^^

ggwpez commented 1 year ago

Yea this still fails, so i cant test https://github.com/paritytech/substrate/pull/13417

niklasad1 commented 1 year ago

The same error after #14030? Then we need what I wrote in the PR

ggwpez commented 1 year ago

Yea looks like the same error

BT

```pre 2023-04-28 08:51:43.840 DEBUG ThreadId(48) remote-ext: Remaining payloads: 57361 Batch request size: 1336 2023-04-28 08:51:44.178 INFO main remote-ext: inserting keys progress = 24% [450000 / 1842910] ==================== Version: 0.9.41-3ec08c0ac34 0: sp_panic_handler::set::{{closure}} 1: as core::ops::function::Fn>::call at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/alloc/src/boxed.rs:2001:9 std::panicking::rust_panic_with_hook at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:696:13 2: std::panicking::begin_panic::{{closure}} 3: std::sys_common::backtrace::__rust_end_short_backtrace 4: std::panicking::begin_panic 5: as core::future::future::Future>::poll 6: ::Output> as core::future::future::Future>::poll 7: as core::future::future::Future>::poll 8: as core::future::future::Future>::poll 9: as core::future::future::Future>::poll 10: hyper::client::client::Client::retryably_send_request::{{closure}} 11: ::poll 12: as core::future::future::Future>::poll 13: ::batch_request::{{closure}} 14: frame_remote_externalities::Builder::get_storage_data_dynamic_batch_size::{{closure}} 15: frame_remote_externalities::Builder::get_storage_data_dynamic_batch_size::{{closure}} 16: frame_remote_externalities::Builder::get_storage_data_dynamic_batch_size::{{closure}} 17: frame_remote_externalities::Builder::get_storage_data_dynamic_batch_size::{{closure}} 18: frame_remote_externalities::Builder::get_storage_data_dynamic_batch_size::{{closure}} 19: frame_remote_externalities::Builder::get_storage_data_dynamic_batch_size::{{closure}} 20: frame_remote_externalities::Builder::get_storage_data_dynamic_batch_size::{{closure}} 21: frame_remote_externalities::Builder::get_storage_data_dynamic_batch_size::{{closure}} 22: frame_remote_externalities::Builder::get_storage_data_dynamic_batch_size::{{closure}} 23: frame_remote_externalities::Builder::get_storage_data_dynamic_batch_size::{{closure}} 24: frame_remote_externalities::Builder::get_storage_data_dynamic_batch_size::{{closure}} 25: frame_remote_externalities::Builder::get_storage_data_dynamic_batch_size::{{closure}} 26: tokio::runtime::park::CachedParkThread::block_on 27: tokio::runtime::scheduler::multi_thread::MultiThread::block_on 28: tokio::runtime::runtime::Runtime::block_on 29: std::sys_common::backtrace::__rust_begin_short_backtrace 30: core::ops::function::FnOnce::call_once{{vtable.shim}} 31: as core::ops::function::FnOnce>::call_once at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/alloc/src/boxed.rs:1987:9 as core::ops::function::FnOnce>::call_once at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/alloc/src/boxed.rs:1987:9 std::sys::unix::thread::Thread::new::thread_start at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/sys/unix/thread.rs:108:17 32: 33: Thread '' panicked at 'dispatch dropped without returning error', /home/ubuntu/.cargo/registry/src/github.com-1ecc6299db9ec823/hyper-0.14.20/src/client/conn.rs:329 ```

Now after a re-run it does work. so probably something flaky…

liamaharon commented 1 year ago

Thanks for the logs. I'm going to prioritise refactoring this to remove the redundant multi-threading, which should resolve this issue.

ggwpez commented 1 year ago

It also happens with TRY_RUNTIME_MAX_THREADS=1 :man_shrugging:

niklasad1 commented 1 year ago

No, the problem is that the runtime that created client is closed when the background task tries to perform the RPC call ^^

The number of threads shouldn't matter