near / stakewars-iv

12 stars 8 forks source link

Node crushing after a simple restart of the service #139

Closed abahmanem closed 1 week ago

abahmanem commented 2 weeks ago

Bug Report

Overview

Please share high level description of the issue/bug you are reporting.

i set up a stateless node days ago and was running fine.

This morning , i just did a : sudo systemctl restart neard and i'm getting this :

Opened a new RocksDB instance. num_instances=1
thread 'main' panicked at chain/client/src/client_actor.rs:168:6:
called `Result::unwrap()` on an `Err` value: Chain(StorageError(StorageInconsistentState("No ChunkExtra for block 8WX1DQnSttuk4WTyHPD5oJnrYBAL95hbCDaF2nbX2pgj in shard s1.v3")))
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: nearcore::start_with_config_and_synchronization
   4: neard::cli::RunCmd::run::{{closure}}
   5: tokio::task::local::LocalSet::run_until::{{closure}}
   6: neard::cli::NeardCmd::parse_and_run
   7: neard::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Aborted (core dumped)

I got the latest snapshot data but i had the same error.

Affected parties

Who is affected? Validators? Contract developers? Or regular users?

stateless node pool : abahmane.pool.statelessnet.

Impact

What’s the worst outcome of the issue?

Reproduction steps

Please share step by step guideline on how to reproduce the issue.

Do a simple :sudo systemctl restart neard

[Optional] Code reference

Please locate the issue in the codebase.

[Optional] Root cause analysis

This section is optional but should be filed to claim additional reward. Please share your analysis on the root cause of the issue.

[Optional] Suggested fix

This section is optional but should be filed to claim additional reward. Please share a recommended long-term/short-term fix for the issue.

telezhnaya commented 2 weeks ago

Thank you for reporting! We'll investigate this

GO2Pro commented 2 weeks ago

Confirm the error:

Jun 17 09:55:15 stakewars-iv-h15a neard[143422]: thread 'main' panicked at chain/client/src/client_actor.rs:168:6:
Jun 17 09:55:15 stakewars-iv-h15a neard[143422]: called `Result::unwrap()` on an `Err` value: Chain(StorageError(StorageInconsistentState("No ChunkExtra for block 8WX1DQnSttuk4WTyHPD5oJnrYBAL95hbCDaF2nbX2pgj in shard s1.v3")))
Jun 17 09:55:15 stakewars-iv-h15a neard[143422]: stack backtrace:
Jun 17 09:55:15 stakewars-iv-h15a neard[143422]:    0: rust_begin_unwind
Jun 17 09:55:15 stakewars-iv-h15a neard[143422]:    1: core::panicking::panic_fmt
Jun 17 09:55:15 stakewars-iv-h15a neard[143422]:    2: core::result::unwrap_failed
Jun 17 09:55:15 stakewars-iv-h15a neard[143422]:    3: nearcore::start_with_config_and_synchronization
Jun 17 09:55:15 stakewars-iv-h15a neard[143422]:    4: neard::cli::RunCmd::run::{{closure}}
Jun 17 09:55:15 stakewars-iv-h15a neard[143422]:    5: tokio::task::local::LocalSet::run_until::{{closure}}
Jun 17 09:55:15 stakewars-iv-h15a neard[143422]:    6: neard::cli::NeardCmd::parse_and_run
Jun 17 09:55:15 stakewars-iv-h15a neard[143422]:    7: neard::main
Jun 17 09:55:15 stakewars-iv-h15a neard[143422]: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Jun 17 09:55:15 stakewars-iv-h15a systemd[1]: neard.service: Main process exited, code=dumped, status=6/ABRT
Jun 17 09:55:15 stakewars-iv-h15a systemd[1]: neard.service: Failed with result 'core-dump'.

The problem appeared right after the neard stop/start, nothing was changed.

Stateless pool : go2pro.pool.statelessnet

Here are the steps I took to fix this error:

  1. Checked out the latest version: git checkout statelessnet_latest
  2. Compiled the latest version: build 1.36.1-730-g4b39f0226
  3. Copied the DB Snapshot
  4. Initialized the working directory, after deleting the files from the .near directory.
  5. Downloaded the config file
  6. Replaced validator_key.json and node_key.json with mine.

When moving the keys to another server, the error is completely reproduced.

The error is still present.

abahmanem commented 2 weeks ago

managed to start the node with this snapshot : 2024-06-17T11:42:04Z

GO2Pro commented 2 weeks ago

managed to start the node with this snapshot : 2024-06-17T11:42:04Z

Thanks, it works, but I'm stuck on the block 119407740

telezhnaya commented 1 week ago

The network was reset, this problem should not appear anymore