paritytech / substrate

Substrate: The platform for blockchain innovators
Apache License 2.0
8.39k stars 2.65k forks source link

Dangerously long stack trace when syncing a new node #8950

Closed koute closed 3 years ago

koute commented 3 years ago

I was doing some memory profiling over the weekend, and over the course of 60h of profiling I've got this stack trace that was 5763 frames long:

clone
start_thread
std::sys::unix::thread::Thread::new::thread_start
<alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
<alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
core::ops::function::FnOnce::call_once{{vtable.shim}}
std::sys_common::backtrace::__rust_begin_short_backtrace
tokio::runtime::context::enter
tokio::runtime::blocking::pool::Inner::run
tokio::runtime::task::harness::Harness<T,S>::poll
<std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
tokio::runtime::task::core::Core<T,S>::poll
futures_executor::local_pool::block_on
std::thread::local::LocalKey<T>::with
<tracing_futures::Instrumented<T> as core::future::future::Future>::poll
<core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
<sc_service::task_manager::prometheus_future::PrometheusFuture<T> as core::future::future::Future>::poll
<futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
<core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
<core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
<core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
<sc_consensus_babe::BabeVerifier<Block,Client,SelectChain,CAW,CIDP> as sp_consensus::import_queue::Verifier<Block>>::verify::{{closure}}
sp_block_builder::BlockBuilder::check_inherents
sp_block_builder::runtime_decl_for_BlockBuilder::check_inherents_call_api_at
<sc_service::client::client::Client<B,E,Block,RA> as sp_api::CallApiAt<Block>>::call_api_at
<sc_service::client::call_executor::LocalCallExecutor<Block,B,E> as sc_client_api::call_executor::CallExecutor<Block>>::contextual_call
<kusama_runtime::RuntimeApiImpl<__SR_API_BLOCK__,RuntimeApiImplCall> as sp_api::Core<__SR_API_BLOCK__>>::Core_initialize_block_runtime_api_impl
sp_api::runtime_decl_for_Core::initialize_block_call_api_at
<sc_service::client::client::Client<B,E,Block,RA> as sp_api::CallApiAt<Block>>::call_api_at
<sc_service::client::call_executor::LocalCallExecutor<Block,B,E> as sc_client_api::call_executor::CallExecutor<Block>>::contextual_call
sp_state_machine::execution::StateMachine<B,H,N,Exec>::execute_using_consensus_failure_handler
sp_state_machine::execution::StateMachine<B,H,N,Exec>::execute_aux
<sc_executor::native_executor::NativeExecutor<D> as sp_core::traits::CodeExecutor>::call
sc_executor::wasm_runtime::RuntimeCache::with_instance
sc_executor::native_executor::WasmExecutor::with_instance::{{closure}}
std::thread::local::LocalKey<T>::with
std::panicking::try
sc_executor_common::wasm_runtime::WasmInstance::call_export
<sc_executor_wasmi::WasmiInstance as sc_executor_common::wasm_runtime::WasmInstance>::call
wasmi::func::FuncInstance::invoke
wasmi::runner::Interpreter::start_execution
wasmi::func::FuncInstance::invoke
<sc_executor_wasmi::FunctionExecutor as wasmi::host::Externals>::invoke_index
<<sp_io::storage::HostFunctions as sp_wasm_interface::HostFunctions>::host_functions::ExtStorageNextKeyVersion1 as sp_wasm_interface::Function>::execute
sp_io::storage::next_key_version_1
tracing::span::Span::in_scope
std::thread::local::LocalKey<T>::with
<sp_state_machine::ext::Ext<H,N,B> as sp_externalities::Externalities>::next_storage_key
<sp_state_machine::ext::Ext<H,N,B> as sp_externalities::Externalities>::next_storage_key
<sp_state_machine::ext::Ext<H,N,B> as sp_externalities::Externalities>::next_storage_key
<sp_state_machine::ext::Ext<H,N,B> as sp_externalities::Externalities>::next_storage_key
...repeats 5685 times...
<sp_state_machine::ext::Ext<H,N,B> as sp_externalities::Externalities>::next_storage_key
<sp_state_machine::ext::Ext<H,N,B> as sp_externalities::Externalities>::next_storage_key
<sp_state_machine::ext::Ext<H,N,B> as sp_externalities::Externalities>::next_storage_key
<sp_state_machine::ext::Ext<H,N,B> as sp_externalities::Externalities>::next_storage_key
sp_state_machine::trie_backend_essence::TrieBackendEssence<S,H>::next_storage_key_from_root
<trie_db::triedb::TrieDBIterator<L> as trie_db::TrieIterator<L>>::seek
trie_db::iterator::TrieDBNodeIterator<L>::seek_prefix
trie_db::triedb::TrieDB<L>::get_raw_or_lookup
<sp_state_machine::trie_backend_essence::TrieBackendEssence<S,H> as hash_db::HashDBRef<H,alloc::vec::Vec<u8>>>::get
<sp_state_machine::trie_backend_essence::TrieBackendEssence<S,H> as hash_db::HashDB<H,alloc::vec::Vec<u8>>>::get
<sc_client_db::StorageDb<Block> as sp_state_machine::trie_backend_essence::Storage<<<Block as sp_runtime::traits::Block>::Header as sp_runtime::traits::Header>::Hashing>>::get
sc_state_db::StateDb<BlockHash,Key>::get
<sp_database::kvdb::DbAdapter<D> as sp_database::Database<H>>::get
<kvdb_rocksdb::Database as kvdb::KeyValueDB>::get
kvdb_rocksdb::Database::get
rocksdb::db::DB::get_pinned_cf_opt
rocksdb_get_pinned_cf
rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*) [clone .localalias]
rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions&)
rocksdb::MemTableListVersion::Get(rocksdb::LookupKey const&, std::string*, std::string*, rocksdb::Status*, rocksdb::MergeContext*, unsigned long*, unsigned long*, rocksdb::ReadOptions const&, rocksdb::ReadCallback*, bool*)
rocksdb::MemTableListVersion::GetFromList(std::__cxx11::list<rocksdb::MemTable*, std::allocator<rocksdb::MemTable*>>*, rocksdb::LookupKey const&, std::string*, std::string*, rocksdb::Status*, rocksdb::MergeContext*, unsigned long*, unsigned long*, rocksdb::ReadOptions const&, rocksdb::ReadCallback*, bool*)
rocksdb::MemTable::Get(rocksdb::LookupKey const&, std::string*, std::string*, rocksdb::Status*, rocksdb::MergeContext*, unsigned long*, unsigned long*, rocksdb::ReadOptions const&, rocksdb::ReadCallback*, bool*, bool)
rocksdb::MemTable::GetFromTable(rocksdb::LookupKey const&, unsigned long, bool, rocksdb::ReadCallback*, bool*, std::string*, std::string*, rocksdb::Status*, rocksdb::MergeContext*, unsigned long*, bool*, bool*)
rocksdb::(anonymous namespace)::SkipListRep::Get(rocksdb::LookupKey const&, void*, bool (*)(void*, char const*))
rocksdb::SaveValue(void*, char const*)
std::string::_M_replace(unsigned long, unsigned long, char const*, unsigned long)
std::string::_M_mutate(unsigned long, unsigned long, char const*, unsigned long)

The <sp_state_machine::ext::Ext<H,N,B> as sp_externalities::Externalities>::next_storage_key appears 5693 times in total in this stack trace; looks like unbound recursion.

This happened during a fresh sync of a new node on Kusama, with interpreted WASM execution, with freshly compiled polkadot from f0ce0ca1f8b03d5ad3b80ce66793f18bde21cd14.

Probably the same issue as https://github.com/paritytech/substrate/issues/4527 if I'd had to guess.

xlc commented 3 years ago

Maybe relates to https://github.com/paritytech/polkadot/issues/3053#issuecomment-848991368 as well. Something isn't right.

arkpar commented 3 years ago

Two things to fix here:

  1. Externalities::next_storage_key should not be recursive.
  2. It should not be linear time w.r.t. the overlay size. Looks like currently if there's a lot of None items in the overlay, it will iterate them all until it finds an existing key. Instead the overlay should probably maintain two separate BtreeMaps for existing and non-existing key.