opentensor / subtensor

Bittensor Blockchain Layer
The Unlicense
132 stars 147 forks source link

Mainnet nodes not syncing #635

Closed abbaahmad closed 1 month ago

abbaahmad commented 1 month ago

Describe the bug

2 Mainnet nodes stopped syncing at almost the same time. Both are reporting the same errors in their logs:

2024-07-15 09:06:19 StakeRemoved( hotkey:5e512a3020b53706fefeecb6b4e22d31deff7215492e467a752b0f5322704e12 (5ECNUAzQ...), stake_to_be_removed:331655568 )
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]: Hash: given=67d89a3c0c0a79038d9247853f62c406133a34e1a0753da7bfc4fcfb3f7c0c08, expected=d4eebe06db6e37283cf2e4cb66240059379340621542e527fe0b72f1267a056f
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]: ====================
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]: Version: 4.0.0-dev-b4ca34b6845
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:    0: sp_panic_handler::set::{{closure}}
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:    1: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:              at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/alloc/src/boxed.rs:2021:9
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:       std::panicking::rust_panic_with_hook
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:              at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:783:13
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:    2: std::panicking::begin_panic_handler::{{closure}}
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:              at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:649:13
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:    3: std::sys_common::backtrace::__rust_end_short_backtrace
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:              at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/sys_common/backtrace.rs:170:18
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:    4: rust_begin_unwind
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:              at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:645:5
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:    5: core::panicking::panic_fmt
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:              at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/panicking.rs:72:14
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:    6: frame_executive::Executive<System,Block,Context,UnsignedValidator,AllPalletsWithSystem,COnRuntimeUpgrade>::final_checks
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:    7: tracing::span::Span::in_scope
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:    8: frame_executive::Executive<System,Block,Context,UnsignedValidator,AllPalletsWithSystem,COnRuntimeUpgrade>::execute_block
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:    9: node_subtensor_runtime::api::dispatch
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:   10: environmental::using
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:   11: sc_executor::executor::WasmExecutor<H>::with_instance::{{closure}}
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:   12: sc_executor::wasm_runtime::RuntimeCache::with_instance
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:   13: <sc_executor::executor::NativeElseWasmExecutor<D> as sp_core::traits::CodeExecutor>::call
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:   14: sp_state_machine::execution::StateMachine<B,H,Exec>::execute
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:   15: <sc_service::client::call_executor::LocalCallExecutor<Block,B,E> as sc_client_api::call_executor::CallExecutor<Block>>::contextual_call
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:   16: <sc_service::client::client::Client<B,E,Block,RA> as sp_api::CallApiAt<Block>>::call_api_at
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:   17: <node_subtensor_runtime::RuntimeApiImpl<__SrApiBlock__,RuntimeApiImplCall> as sp_offchain::OffchainWorkerApi<__SrApiBlock__>>::__runtime_api_internal_call_api_at
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:   18: sp_api::Core::execute_block
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:   19: <&sc_service::client::client::Client<B,E,Block,RA> as sc_consensus::block_import::BlockImport<Block>>::import_block::{{closure}}
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:   20: <sc_consensus_grandpa::import::GrandpaBlockImport<BE,Block,Client,SC> as sc_consensus::block_import::BlockImport<Block>>::import_block::{{closure}}
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:   21: <alloc::boxed::Box<dyn sc_consensus::block_import::BlockImport<B>+Error = sp_consensus::error::Error+core::marker::Send+core::marker::Sync> as sc_consensus::block_import::BlockImport<B>>::import_block::{{closure}}
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:   22: <core::pin::Pin<P> as core::future::future::Future>::poll
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:   23: sc_consensus::import_queue::basic_queue::BlockImportWorker<B>::new::{{closure}}
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:   24: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:   25: <sc_service::task_manager::prometheus_future::PrometheusFuture<T> as core::future::future::Future>::poll
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:   26: <futures_util::future::select::Select<A,B> as core::future::future::Future>::poll
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:   27: <tracing_futures::Instrumented<T> as core::future::future::Future>::poll
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:   28: tokio::runtime::context::blocking::BlockingRegionGuard::block_on
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:   29: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:   30: tokio::runtime::task::core::Core<T,S>::poll
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:   31: tokio::runtime::task::harness::Harness<T,S>::poll
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:   32: tokio::runtime::blocking::pool::Inner::run
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:   33: std::sys_common::backtrace::__rust_begin_short_backtrace
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:   34: core::ops::function::FnOnce::call_once{{vtable.shim}}
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:   35: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:              at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/alloc/src/boxed.rs:2007:9
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:       <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:              at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/alloc/src/boxed.rs:2007:9
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:       std::sys::unix::thread::Thread::new::thread_start
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:              at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/sys/unix/thread.rs:108:17
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:   36: <unknown>
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:   37: <unknown>
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]: Thread 'tokio-runtime-worker' panicked at 'Storage root must match that calculated.', /home/ubuntu/.cargo/git/checkouts/polkadot-sdk-cff69157b985ed76/7049c3c/substrate/frame/executive/src/lib.rs:834
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]: This is a bug. Please report it at:
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]:         support.anonymous.an
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]: 2024-07-15 09:06:19 do_registration( coldkey:2c776103ddc17c19154c2504da041dbdd5274444508a141525607a16b5a5b828 (5D51Riqk...) netuid:9 hotkey:ba0003fa96731cf027482dc42b946ce8266891a70dcf1216492ad1e4640d384e (5GGamWvY...) )
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]: 2024-07-15 09:06:19 AxonServed( hotkey:c4c3f946748e07e818922436ec3f2bf270adb21ad740ab92e77b33f56fd1ac21 (5GWhTa4c...) )
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]: 2024-07-15 09:06:19 Evicting failed runtime instance error=Runtime panicked: Storage root must match that calculated.
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]: 2024-07-15 09:06:19 Block prepare storage changes error: Error at calling runtime api: Execution failed: Runtime panicked: Storage root must match that calculated.
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]: 2024-07-15 09:06:19 💔 Error importing block 0xe7bc870615f3baca6e7751d509622201b99296d8e1c4a77b9d4765f424aa4c3a: consensus error: Import failed: Import failed: Error at calling runtime api: Execution failed: Runtime panicked: Storage root must match that calculated.
Jul 15 09:06:19 juju-69fe70-0 polkadot[1026]: 2024-07-15 09:06:19 💔 Error importing block 0x4c17d5cbc566fcdc8f6cbc64f8a42fd10627530a85fe5a876212bd0dff169acf: block has an unknown parent

To Reproduce

Run bittensor node with the following configs Version: v1.1.3 Config: --chain /home/polkadot/spec/chain-spec.json --rpc-port 9933 --rpc-max-connections 1000 --rpc-external --rpc-methods Safe --rpc-cors all --prometheus-external --pruning archive --base-path /home/polkadot/.local --in-peers 500 --out-peers 500 --bootnodes /ip4/13.58.175.193/tcp/30333/p2p/12D3KooWDe7g2JbNETiKypcKT1KsCEZJbTzEHCn8hpd4PHZ6pdz5 chain-spec.json points to https://raw.githubusercontent.com/opentensor/subtensor/main/raw_spec_finney.json

Expected behavior

Sync to head without errors

Screenshots

No response

Environment

Ubuntu 22-04

Additional context

No response

tavurth commented 1 month ago
./scripts/run/subtensor.sh -e docker --network mainnet --node-type lite

I'm also seeing this, was running fine for a while, then:

Thread 'tokio-runtime-worker' panicked at 'Storage root must match that calculated.', /root/.cargo/git/checkouts/polkadot-sdk-cff69157b985ed76/7049c3c/substrate/frame/executive/src/lib.rs:834
lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: AuthenticAMD Model name: AMD Ryzen 7 PRO 8700GE w/ Radeon 780M Graphics CPU family: 25 Model: 117 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 Stepping: 2 Frequency boost: enabled CPU max MHz: 6743.9448 CPU min MHz: 1600.0000 BogoMIPS: 7300.25 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonsto p_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse 4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cq m_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pa usefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_r ecov succor smca fsrm flush_l1d

After this error started I can no longer start the subnet node from latest main.

Seems related? https://github.com/paritytech/substrate/issues/5871

open-junius commented 1 month ago

Thanks for you report the issue, will investigate it.

UselessXiaoYao commented 1 month ago

I meet the same issue.

tavurth commented 1 month ago

Solution can be found here:

https://github.com/opentensor/subtensor/issues/655#issuecomment-2259624975

Let's close this as duplicate of that issue?

orriin commented 1 month ago

Duplicate #655