opentensor / subtensor

Bittensor Blockchain Layer
The Unlicense
130 stars 144 forks source link

Thread 'tokio-runtime-worker' panicked at 'Storage root must match that calculated.' #655

Closed alex-drocks closed 1 month ago

alex-drocks commented 1 month ago

Describe the bug

Local subtensor node found out of sync this morning, with an error log like this:

2024-07-22 05:01:38 Evicting failed runtime instance error=Runtime panicked: Storage root must match that calculated.
2024-07-22 05:01:38 Block prepare storage changes error: Error at calling runtime api: Execution failed: Runtime panicked: Storage root must match that calculated.
2024-07-22 05:01:41 Report 12D3KooWCqAEBKQ5w9B7KRifD5MooAkDKrVoMTePy5vAD2GPWKLD: -2147483648 to -2147483648. Reason: Genesis mismatch. Banned, disconnecting.
2024-07-22 05:01:41 :broken_heart: Error importing block 0x10efd9db2309de9477a7574bc5a85676104d58196177440972a1893f70a83cb0: block has an unknown parent

Then restarted it and it was throwing tons of errors like this after the restart:

2024-07-22 14:39:52 StakeRemoved( hotkey:a02cc4cd145f455f02c88fdc7c5f94275bdddbe5472f933caa3f7f4ef1ba6e53 (5FgipPNQ...), stake_to_be_removed:9839315 )
2024-07-22 14:39:52 do_remove_stake( origin:1c7d4b5248f2ae72cb0c5c6c03f274d12ec7efbb1ee34628b0736ad8fd79e26b (5Ci4RCiG...) hotkey:e8498f8ba1ea025cfba7bee38a803c6e272aa02eb2f19de12e7aa26fa381d90c (5HKGpRj4...), stake_to_be_removed:0 )
Hash: given=2ad52ae1bec16b59f330ae22c3a0e74306e3a9829bc92b8a9d4447f6785f3994, expected=aae074ac2f1d2a2e978c18a374a260653093271c438f2173fa9eb0ae90c64d93

====================

Version: 4.0.0-dev-unknown

   0: sp_panic_handler::set::{{closure}}
   1: std::panicking::rust_panic_with_hook
   2: std::panicking::begin_panic_handler::{{closure}}
   3: std::sys_common::backtrace::__rust_end_short_backtrace
   4: rust_begin_unwind
   5: core::panicking::panic_fmt
   6: <node_subtensor_runtime::Runtime as sp_api::runtime_decl_for_core::CoreV5<sp_runtime::generic::block::Block<sp_runtime::generic::header::Header<u32,sp_runtime::traits::BlakeTwo256>,sp_runtime::generic::unchecked_extrinsic::UncheckedExtrinsic<sp_runtime::multiaddress::MultiAddress<<<sp_runtime::MultiSignature as sp_runtime::traits::Verify>::Signer as sp_runtime::traits::IdentifyAccount>::AccountId,()>,node_subtensor_runtime::RuntimeCall,sp_runtime::MultiSignature,(frame_system::extensions::check_non_zero_sender::CheckNonZeroSender<node_subtensor_runtime::Runtime>,frame_system::extensions::check_spec_version::CheckSpecVersion<node_subtensor_runtime::Runtime>,frame_system::extensions::check_tx_version::CheckTxVersion<node_subtensor_runtime::Runtime>,frame_system::extensions::check_genesis::CheckGenesis<node_subtensor_runtime::Runtime>,frame_system::extensions::check_mortality::CheckMortality<node_subtensor_runtime::Runtime>,node_subtensor_runtime::check_nonce::CheckNonce<node_subtensor_runtime::Runtime>,frame_system::extensions::check_weight::CheckWeight<node_subtensor_runtime::Runtime>,pallet_transaction_payment::ChargeTransactionPayment<node_subtensor_runtime::Runtime>,pallet_subtensor::SubtensorSignedExtension<node_subtensor_runtime::Runtime>,pallet_commitments::CommitmentsSignedExtension<node_subtensor_runtime::Runtime>)>>>>::execute_block
   7: node_subtensor_runtime::api::dispatch
   8: sc_executor::executor::WasmExecutor<H>::with_instance::{{closure}}
   9: <sc_executor::executor::NativeElseWasmExecutor<D> as sp_core::traits::CodeExecutor>::call
  10: sp_state_machine::execution::StateMachine<B,H,Exec>::execute
  11: <sc_service::client::client::Client<B,E,Block,RA> as sp_api::CallApiAt<Block>>::call_api_at
  12: <node_subtensor_runtime::RuntimeApiImpl<__SrApiBlock__,RuntimeApiImplCall> as sp_api::Core<__SrApiBlock__>>::__runtime_api_internal_call_api_at::{{closure}}
  13: <node_subtensor_runtime::RuntimeApiImpl<__SrApiBlock__,RuntimeApiImplCall> as sp_api::Core<__SrApiBlock__>>::__runtime_api_internal_call_api_at
  14: sp_api::Core::execute_block
  15: <&sc_service::client::client::Client<B,E,Block,RA> as sc_consensus::block_import::BlockImport<Block>>::import_block::{{closure}}
  16: <sc_consensus_grandpa::import::GrandpaBlockImport<BE,Block,Client,SC> as sc_consensus::block_import::BlockImport<Block>>::import_block::{{closure}}
  17: <alloc::boxed::Box<dyn sc_consensus::block_import::BlockImport<B>+Error = sp_consensus::error::Error+core::marker::Sync+core::marker::Send> as sc_consensus::block_import::BlockImport<B>>::import_block::{{closure}}
  18: sc_consensus::import_queue::basic_queue::BlockImportWorker<B>::new::{{closure}}
  19: <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll
  20: <tracing_futures::Instrumented<T> as core::future::future::Future>::poll
  21: tokio::runtime::task::raw::poll
  22: std::sys_common::backtrace::__rust_begin_short_backtrace
  23: core::ops::function::FnOnce::call_once{{vtable.shim}}
  24: std::sys::pal::unix::thread::Thread::new::thread_start
  25: start_thread
  26: clone

Thread 'tokio-runtime-worker' panicked at 'Storage root must match that calculated.', /root/.cargo/git/checkouts/polkadot-sdk-cff69157b985ed76/7049c3c/substrate/frame/executive/src/lib.rs:834

This is a bug. Please report it at:

        support.anonymous.an

2024-07-22 14:39:52 do_remove_stake( origin:b6f85d613d4c2b961e171644c4ee2de0998c309539a86b0bcfd883fd05074b5e (5GCcMN81...) hotkey:00420dfed7ec724228e8551d64827ffbb1f00ad39f14c64b36ef1a26c23ad87a (5C53UmH5...), stake_to_be_removed:53921232 )
2024-07-22 14:39:52 StakeRemoved( hotkey:00420dfed7ec724228e8551d64827ffbb1f00ad39f14c64b36ef1a26c23ad87a (5C53UmH5...), stake_to_be_removed:53921232 )
2024-07-22 14:39:52 Evicting failed runtime instance error=Runtime panicked: Storage root must match that calculated.
2024-07-22 14:39:52 Block prepare storage changes error: Error at calling runtime api: Execution failed: Runtime panicked: Storage root must match that calculated.
2024-07-22 14:39:52 šŸ’” Error importing block 0x16774dea580287b84a2216cdaf0ff767bee8718a8e0c4b8c5e62788b23137794: consensus error: Import failed: Import failed: Error at calling runtime api: Execution failed: Runtime panicked: Storage root must match that calculated.
2024-07-22 14:39:52 do_set_weights( origin:522f432f51287df57711a3a00f6c7a334b0c813ff178c22649f01566ef4a5d48 (5DvTpini...) netuid:7

To Reproduce

Running local subtensor node on ryzen 7950x Debian with this:

cd ~/subtensor/
git fetch --all
git reset --hard origin/main
git checkout main
git pull
docker compose down
docker system prune --all
docker volume rm $(docker volume ls -q --filter name=sub*)
sudo ./scripts/run/subtensor.sh -e docker --network mainnet --node-type lite
# logs
docker logs -f --tail 10 subtensor-mainnet-lite

Expected behavior

No panick error logs

Screenshots

image

Environment

Linux Debian 12

Additional context

Node was up and running for a few days.

This morning at 5am EST, I tried btcli w balance command and saw outdated data so I inspected the subtensor logs.

I had this error in the logs:

2024-07-22 05:01:38 Evicting failed runtime instance error=Runtime panicked: Storage root must match that calculated.
2024-07-22 05:01:38 Block prepare storage changes error: Error at calling runtime api: Execution failed: Runtime panicked: Storage root must match that calculated.
2024-07-22 05:01:41 Report 12D3KooWCqAEBKQ5w9B7KRifD5MooAkDKrVoMTePy5vAD2GPWKLD: -2147483648 to -2147483648. Reason: Genesis mismatch. Banned, disconnecting.
2024-07-22 05:01:41 :broken_heart: Error importing block 0x10efd9db2309de9477a7574bc5a85676104d58196177440972a1893f70a83cb0: block has an unknown parent

After seeing that it was broken (didn't sync to the chain properly afterwards) I just restarted the node and then it had thousands of lines per second containing the error:

2024-07-22 14:39:52 StakeRemoved( hotkey:a02cc4cd145f455f02c88fdc7c5f94275bdddbe5472f933caa3f7f4ef1ba6e53 (5FgipPNQ...), stake_to_be_removed:9839315 )
2024-07-22 14:39:52 do_remove_stake( origin:1c7d4b5248f2ae72cb0c5c6c03f274d12ec7efbb1ee34628b0736ad8fd79e26b (5Ci4RCiG...) hotkey:e8498f8ba1ea025cfba7bee38a803c6e272aa02eb2f19de12e7aa26fa381d90c (5HKGpRj4...), stake_to_be_removed:0 )
Hash: given=2ad52ae1bec16b59f330ae22c3a0e74306e3a9829bc92b8a9d4447f6785f3994, expected=aae074ac2f1d2a2e978c18a374a260653093271c438f2173fa9eb0ae90c64d93

====================

Version: 4.0.0-dev-unknown

   0: sp_panic_handler::set::{{closure}}
   1: std::panicking::rust_panic_with_hook
   2: std::panicking::begin_panic_handler::{{closure}}
   3: std::sys_common::backtrace::__rust_end_short_backtrace
   4: rust_begin_unwind
   5: core::panicking::panic_fmt
   6: <node_subtensor_runtime::Runtime as sp_api::runtime_decl_for_core::CoreV5<sp_runtime::generic::block::Block<sp_runtime::generic::header::Header<u32,sp_runtime::traits::BlakeTwo256>,sp_runtime::generic::unchecked_extrinsic::UncheckedExtrinsic<sp_runtime::multiaddress::MultiAddress<<<sp_runtime::MultiSignature as sp_runtime::traits::Verify>::Signer as sp_runtime::traits::IdentifyAccount>::AccountId,()>,node_subtensor_runtime::RuntimeCall,sp_runtime::MultiSignature,(frame_system::extensions::check_non_zero_sender::CheckNonZeroSender<node_subtensor_runtime::Runtime>,frame_system::extensions::check_spec_version::CheckSpecVersion<node_subtensor_runtime::Runtime>,frame_system::extensions::check_tx_version::CheckTxVersion<node_subtensor_runtime::Runtime>,frame_system::extensions::check_genesis::CheckGenesis<node_subtensor_runtime::Runtime>,frame_system::extensions::check_mortality::CheckMortality<node_subtensor_runtime::Runtime>,node_subtensor_runtime::check_nonce::CheckNonce<node_subtensor_runtime::Runtime>,frame_system::extensions::check_weight::CheckWeight<node_subtensor_runtime::Runtime>,pallet_transaction_payment::ChargeTransactionPayment<node_subtensor_runtime::Runtime>,pallet_subtensor::SubtensorSignedExtension<node_subtensor_runtime::Runtime>,pallet_commitments::CommitmentsSignedExtension<node_subtensor_runtime::Runtime>)>>>>::execute_block
   7: node_subtensor_runtime::api::dispatch
   8: sc_executor::executor::WasmExecutor<H>::with_instance::{{closure}}
   9: <sc_executor::executor::NativeElseWasmExecutor<D> as sp_core::traits::CodeExecutor>::call
  10: sp_state_machine::execution::StateMachine<B,H,Exec>::execute
  11: <sc_service::client::client::Client<B,E,Block,RA> as sp_api::CallApiAt<Block>>::call_api_at
  12: <node_subtensor_runtime::RuntimeApiImpl<__SrApiBlock__,RuntimeApiImplCall> as sp_api::Core<__SrApiBlock__>>::__runtime_api_internal_call_api_at::{{closure}}
  13: <node_subtensor_runtime::RuntimeApiImpl<__SrApiBlock__,RuntimeApiImplCall> as sp_api::Core<__SrApiBlock__>>::__runtime_api_internal_call_api_at
  14: sp_api::Core::execute_block
  15: <&sc_service::client::client::Client<B,E,Block,RA> as sc_consensus::block_import::BlockImport<Block>>::import_block::{{closure}}
  16: <sc_consensus_grandpa::import::GrandpaBlockImport<BE,Block,Client,SC> as sc_consensus::block_import::BlockImport<Block>>::import_block::{{closure}}
  17: <alloc::boxed::Box<dyn sc_consensus::block_import::BlockImport<B>+Error = sp_consensus::error::Error+core::marker::Sync+core::marker::Send> as sc_consensus::block_import::BlockImport<B>>::import_block::{{closure}}
  18: sc_consensus::import_queue::basic_queue::BlockImportWorker<B>::new::{{closure}}
  19: <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll
  20: <tracing_futures::Instrumented<T> as core::future::future::Future>::poll
  21: tokio::runtime::task::raw::poll
  22: std::sys_common::backtrace::__rust_begin_short_backtrace
  23: core::ops::function::FnOnce::call_once{{vtable.shim}}
  24: std::sys::pal::unix::thread::Thread::new::thread_start
  25: start_thread
  26: clone

Thread 'tokio-runtime-worker' panicked at 'Storage root must match that calculated.', /root/.cargo/git/checkouts/polkadot-sdk-cff69157b985ed76/7049c3c/substrate/frame/executive/src/lib.rs:834

This is a bug. Please report it at:

        support.anonymous.an

2024-07-22 14:39:52 do_remove_stake( origin:b6f85d613d4c2b961e171644c4ee2de0998c309539a86b0bcfd883fd05074b5e (5GCcMN81...) hotkey:00420dfed7ec724228e8551d64827ffbb1f00ad39f14c64b36ef1a26c23ad87a (5C53UmH5...), stake_to_be_removed:53921232 )
2024-07-22 14:39:52 StakeRemoved( hotkey:00420dfed7ec724228e8551d64827ffbb1f00ad39f14c64b36ef1a26c23ad87a (5C53UmH5...), stake_to_be_removed:53921232 )
2024-07-22 14:39:52 Evicting failed runtime instance error=Runtime panicked: Storage root must match that calculated.
2024-07-22 14:39:52 Block prepare storage changes error: Error at calling runtime api: Execution failed: Runtime panicked: Storage root must match that calculated.
2024-07-22 14:39:52 šŸ’” Error importing block 0x16774dea580287b84a2216cdaf0ff767bee8718a8e0c4b8c5e62788b23137794: consensus error: Import failed: Import failed: Error at calling runtime api: Execution failed: Runtime panicked: Storage root must match that calculated.
2024-07-22 14:39:52 do_set_weights( origin:522f432f51287df57711a3a00f6c7a334b0c813ff178c22649f01566ef4a5d48 (5DvTpini...) netuid:7
alex-drocks commented 1 month ago

Similar to https://github.com/opentensor/subtensor/issues/635

liho00 commented 1 month ago

same issue here +1

alex-drocks commented 1 month ago

This issue came back again after wiping my folders and using the rust based subtensor instead of docker šŸ˜¦ Node breaks after a certain point šŸ˜¦

I got the problem with a ryzen 7950x Debian using Docker subtensor (original issue) AND then wiped out disk to reinstall with Rust based subtensor also.

Just also found two other servers on different providers local subtensor out of sync.

All same cause

Thanos30 commented 1 month ago

Facing the same problem

tavurth commented 1 month ago

This is a huge problem for us. We've been seeing this on our own subtensor node, and across many validators on the subnet, i.e. this is a bittensor-breaking issue which should be addressed ASAP to maintain the integrity of the entire chain.

So with that said, let's try to be constructive here:

  1. Is there a known workaround?
  2. Is there a known working commit?

Currently, latest main is not runnable on our machines, that's using docker or directly compiling from main...

@distributedstatemachine any ideas?

Thanos30 commented 1 month ago

For anyone still experiencing this issue, I pulled Release v1.1.1, cleaned blockchain data, rebuilt cargo and started the subtensor again and it works fine.

I am using the subtensor only to fetch data from the chain and for no other reason. If you are using the subtensor for any other reason, I can't guarantee this release is gonna work smoothly for you.

Workaround until this issue is fixed.

alex-drocks commented 1 month ago

For anyone still experiencing this issue, I pulled Release v1.1.1, cleaned blockchain data, rebuilt cargo and started the subtensor again and it works fine.

I am using the subtensor only to fetch data from the chain and for no other reason. If you are using the subtensor for any other reason, I can't guarantee this release is gonna work smoothly for you.

Workaround until this issue is fixed.

My node will run for about 24h normally before it encounters the error again when starting from scratch with a clean install. Have yours been working for longer than this?

Thanos30 commented 1 month ago

The past week that's what was happening for me as well, but yesterday it wouldn't even work for a while, it immediately started spamming the same errors.

With the old Release, no issues for 24 hours so far, and the node is syncing fine.

tavurth commented 1 month ago

With the old Release, no issues for 24 hours so far, and the node is syncing fine.

Yep can also confirm 1.1.1 is working for me:

cd subtensor
git fetch --all
git reset --hard 703bdbd178c5d6e64edd4ffcf69aa49d63349290
cargo build --release --features pow-faucet

... Restart

https://github.com/opentensor/subtensor/commit/703bdbd178c5d6e64edd4ffcf69aa49d63349290

orriin commented 1 month ago

Confirming a workaround is to use an old node version, it is fully compatible with the latest runtime. We're using a625834e on a test machine and it syncs fine.

We're continuing to investigate this, thanks for bearing with us.

orriin commented 1 month ago

This is fixed in #714

A node will attempt to use its native runtime in substitute for the on-chain Wasm runtime if all of spec_name, spec_version, and authoring_version are the same between the Wasm and native binaries. This was causing nodes compiled from main to use a runtime slightly different compared to what is on-chain, and write invalid stuff to storage.

In the future, CI should prevent us from ever pushing to main a spec version that matches what is on-chain (cc @sam0x17)