paritytech / substrate

Substrate: The platform for blockchain innovators
Apache License 2.0
8.39k stars 2.65k forks source link

Finality stalled #5093

Closed xlc closed 4 years ago

xlc commented 4 years ago

https://telemetry.polkadot.io/#list/Acala%20Mandala%20TC2

Running latest Acala, which uses 29cee59229626644c549e36e7921fca42bed68da

sentry_1       | 2020-03-01 11:01:32.393 main-tokio- INFO sc_informant  Reorg from #72155,0x6197…62b2 to #72155,0x5694…605e, common ancestor #72154,0xda43…2a00
sentry_1       | 2020-03-01 11:01:32.393 main-tokio- INFO substrate  Imported #72155 (0x5694…605e)
validator-1_1  | 2020-03-01 11:01:32.577 main-tokio- INFO sc_informant  Reorg from #72155,0x6197…62b2 to #72155,0x5694…605e, common ancestor #72154,0xda43…2a00
validator-1_1  | 2020-03-01 11:01:32.577 main-tokio- INFO substrate  Imported #72155 (0x5694…605e)
validator-2_1  | 2020-03-01 11:01:32.587 main-tokio- INFO sc_informant  Reorg from #72155,0x6197…62b2 to #72155,0x5694…605e, common ancestor #72154,0xda43…2a00
validator-2_1  | 2020-03-01 11:01:32.587 main-tokio- INFO substrate  Imported #72155 (0x5694…605e)
sentry_1       | 2020-03-01 11:01:33.213 main-tokio- INFO substrate  Idle (53 peers), best: #72155 (0x5694…605e), finalized #24937 (0x08d2…66d9), ⬇ 11.2kiB/s ⬆ 22.6kiB/s
sentry_1       |
sentry_1       | ====================
sentry_1       |
sentry_1       | Version: 0.3.1-83c13dc-x86_64-linux-gnu
sentry_1       |
sentry_1       |    0: sp_panic_handler::set::{{closure}}
sentry_1       |    1: std::panicking::rust_panic_with_hook
sentry_1       |              at src/libstd/panicking.rs:475
sentry_1       |    2: std::panicking::begin_panic
sentry_1       |    3: frame_executive::Executive<System,Block,Context,UnsignedValidator,AllModules>::execute_block
sentry_1       |    4: <acala_runtime::Runtime as sp_api::runtime_decl_for_Core::Core<sp_runtime::generic::block::Block<sp_runtime::generic::header::Header<u32,sp_runtime::traits::BlakeTwo256>,sp_runtime::generic::unchecked_extrinsic::UncheckedExtrinsic<<pallet_indices::Module<acala_runtime::Runtime> as sp_runtime::traits::StaticLookup>::Source,acala_runtime::Call,sp_runtime::MultiSignature,(frame_system::CheckVersion<acala_runtime::Runtime>, frame_system::CheckGenesis<acala_runtime::Runtime>, frame_system::CheckEra<acala_runtime::Runtime>, frame_system::CheckNonce<acala_runtime::Runtime>, frame_system::CheckWeight<acala_runtime::Runtime>, orml_oracle::CheckOperator<acala_runtime::Runtime>, module_accounts::ChargeTransactionPayment<acala_runtime::Runtime>)>>>>::execute_block
sentry_1       |    5: sp_api::runtime_decl_for_Core::execute_block_native_call_generator::{{closure}}
sentry_1       |    6: std::panicking::try::do_call
sentry_1       |    7: __rust_maybe_catch_panic
sentry_1       |              at src/libpanic_unwind/lib.rs:78
sentry_1       |    8: std::thread::local::LocalKey<T>::with
sentry_1       |    9: std::thread::local::LocalKey<T>::with
sentry_1       |   10: sp_state_machine::StateMachine<B,H,N,Exec>::execute_aux
sentry_1       |   11: sp_state_machine::StateMachine<B,H,N,Exec>::execute_using_consensus_failure_handler
sentry_1       |   12: <sc_client::call_executor::LocalCallExecutor<B,E> as sc_client_api::call_executor::CallExecutor<Block>>::contextual_call
sentry_1       |   13: <sc_client::client::Client<B,E,Block,RA> as sp_api::CallApiAt<Block>>::call_api_at
sentry_1       |   14: sp_api::runtime_decl_for_Core::execute_block_call_api_at
sentry_1       |   15: sp_api::Core::execute_block
sentry_1       |   16: <&sc_client::client::Client<B,E,Block,RA> as sp_consensus::block_import::BlockImport<Block>>::import_block
sentry_1       |   17: <sc_finality_grandpa::import::GrandpaBlockImport<BE,Block,Client,SC> as sp_consensus::block_import::BlockImport<Block>>::import_block
sentry_1       |   18: <sc_consensus_babe::BabeBlockImport<Block,Client,Inner> as sp_consensus::block_import::BlockImport<Block>>::import_block
sentry_1       |   19: sp_consensus::import_queue::import_single_block
sentry_1       |   20: <futures_util::future::poll_fn::PollFn<F> as core::future::future::Future>::poll
sentry_1       |   21: futures_util::future::future::chain::Chain<Fut1,Fut2,Data>::poll
sentry_1       |   22: <futures_util::future::poll_fn::PollFn<F> as core::future::future::Future>::poll
sentry_1       |   23: futures_executor::thread_pool::PoolState::work
sentry_1       |   24: std::sys_common::backtrace::__rust_begin_short_backtrace
sentry_1       |   25: std::panicking::try::do_call
sentry_1       |   26: __rust_maybe_catch_panic
sentry_1       |              at src/libpanic_unwind/lib.rs:78
sentry_1       |   27: core::ops::function::FnOnce::call_once{{vtable.shim}}
sentry_1       |   28: <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once
sentry_1       |              at rustc/f3e1a954d2ead4e2fc197c7da7d71e6c61bad196/src/liballoc/boxed.rs:1022
sentry_1       |   29: <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once
sentry_1       |              at rustc/f3e1a954d2ead4e2fc197c7da7d71e6c61bad196/src/liballoc/boxed.rs:1022
sentry_1       |       std::sys_common::thread::start_thread
sentry_1       |              at src/libstd/sys_common/thread.rs:13
sentry_1       |       std::sys::unix::thread::Thread::new::thread_start
sentry_1       |              at src/libstd/sys/unix/thread.rs:80
sentry_1       |   30: start_thread
sentry_1       |   31: clone
sentry_1       |
sentry_1       |
sentry_1       | Thread 'import-queue-worker-0' panicked at 'Transaction has a bad signature', /root/.cargo/git/checkouts/substrate-7e08433d4c370a21/29cee59/frame/executive/src/lib.rs:272
sentry_1       |
sentry_1       | This is a bug. Please report it at:
sentry_1       |
sentry_1       |    https://github.com/AcalaNetwork/Acala/issues
sentry_1       |
sentry_1       | 2020-03-01 11:01:33.552 import-queue-worker-0 WARN sc_client::client  Block prepare storage changes error:
sentry_1       | Execution(RuntimePanicked("Transaction has a bad signature"))

I believe the bad signature tx is related to our offchain worker that sending unsigned transaction

https://github.com/AcalaNetwork/Acala/blob/83c13dc2843603958e63f3230645e66220ea7844/modules/cdp_engine/src/lib.rs#L358

This happens on both sentry node and validator nodes.

Log with -l afg=trace

https://gist.github.com/xlc/6b52a0dc4686e704fe5b4f7783a6c83c

Related: https://github.com/AcalaNetwork/Acala/issues/135

bkchr commented 4 years ago

This is not related to Grandpa/Finality.

Looks like your unsigned validation code is not deterministic.

xlc commented 4 years ago

Are you saying this panic is unrelated to grandpa? In that case I have two issues...

How is it possible to have non-determinasic code? The offchain worker code does use a random value from offchain worker API. But the unsigned transaction itself should be deterministic.

xlc commented 4 years ago

Also the panic is constantly happening on a sentry node in our cloud server. But I can't reproduce it with local setup...

bkchr commented 4 years ago

The panic is on block import. That is not related to grandpa.

Is that only happening for one sentry? Or all?

I did not check your code and your validation function.

xlc commented 4 years ago

I only have log access to one sentry, which constantly have this issue. I managed to reproduce it locally once but not again. The validator connecting the troublesome sentry occasionally have this error as well.

I will ask for our validating partners to check their logs.

xlc commented 4 years ago

OK. With some dirty code modifying in cargo cache to insert some additional loggings, I found out 5 out of 8 had prevote and the threshold is 6. Now I know which 5 have voted and know where to check for issues.

It will be good if those information can be exposed somehow (logging or RPC)

xlc commented 4 years ago

Turn out to be some p2p issue. Two of the validators are synced, but their grandpa votes were never received by others. I extract the key out and run put it into a locally run validator and fixes the finality issue.

I think we need better diagnostic about grandpa. Will open another issue for that.

Still no idea about the panic.. Will investigate it bit more and post new issue when we have more solid question.

bkchr commented 4 years ago

If you can not import a block, you can not vote for it.