near / stakewars-iv

12 stars 9 forks source link

Error after Update to statelessnet-83.0: STATE_SNAPSHOT_KEY and DBNotFoundErr on Epoch Block #64

Closed GO2Pro closed 6 months ago

GO2Pro commented 6 months ago

Bug Report

Overview

After updating to version statelessnet-83.0 (1.36.1-298-g984f6ad71), I've started encountering an error logged as follows:

Mar 15 05:33:20 n5a8180 neard[445982]: 2024-03-15T05:33:20.813150Z ERROR client: Received an invalid block during state sync err=EpochOutOfBounds(EpochId(4LRTbMEFb2ShLqqWY7GC6Nse1CPL88HGQekM2U7FVDWh)) block_hash=8oZXJq5dKSEWWohxnp58v2WuLibSbb43eHwtVCestvW6
Mar 15 05:33:20 n5a8180 neard[445982]: 2024-03-15T05:33:20.815575Z ERROR client: Received an invalid block during state sync err=EpochOutOfBounds(EpochId(4LRTbMEFb2ShLqqWY7GC6Nse1CPL88HGQekM2U7FVDWh)) block_hash=8oZXJq5dKSEWWohxnp58v2WuLibSbb43eHwtVCestvW6
Mar 15 05:33:20 n5a8180 neard[445982]: 2024-03-15T05:33:20.884804Z ERROR client: Received an invalid block during state sync err=EpochOutOfBounds(EpochId(4LRTbMEFb2ShLqqWY7GC6Nse1CPL88HGQekM2U7FVDWh)) block_hash=8oZXJq5dKSEWWohxnp58v2WuLibSbb43eHwtVCestvW6

Upon attempting to restart the service with the commands:

sudo systemctl stop neard
sudo systemctl start neard

I am faced with a different error:

Mar 15 05:36:47 n5a8180 neard[461014]: 2024-03-15T05:36:47.423889Z ERROR runtime: Failed to check if a state snapshot exists err=STATE_SNAPSHOT_KEY
Mar 15 05:36:47 n5a8180 neard[461014]: thread 'main' panicked at chain/client/src/client_actor.rs:219:6:
Mar 15 05:36:47 n5a8180 neard[461014]: called `Result::unwrap()` on an `Err` value: Chain(DBNotFoundErr("epoch block: GJRLyvqkiPnWc6Q1yHunyTntr1G24FebJ5XLTSoDfmyc"))
Mar 15 05:36:47 n5a8180 neard[461014]: stack backtrace:
Mar 15 05:36:47 n5a8180 neard[461014]:    0: rust_begin_unwind
Mar 15 05:36:47 n5a8180 neard[461014]:              at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:645:5
Mar 15 05:36:47 n5a8180 neard[461014]:    1: core::panicking::panic_fmt
Mar 15 05:36:47 n5a8180 neard[461014]:              at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panicking.rs:72:14
Mar 15 05:36:47 n5a8180 neard[461014]:    2: core::result::unwrap_failed
Mar 15 05:36:47 n5a8180 neard[461014]:              at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/result.rs:1649:5
Mar 15 05:36:47 n5a8180 neard[461014]:    3: nearcore::start_with_config_and_synchronization
Mar 15 05:36:47 n5a8180 neard[461014]:    4: neard::cli::RunCmd::run::{{closure}}
Mar 15 05:36:47 n5a8180 neard[461014]:    5: tokio::task::local::LocalSet::run_until::{{closure}}
Mar 15 05:36:47 n5a8180 neard[461014]:    6: neard::cli::NeardCmd::parse_and_run
Mar 15 05:36:47 n5a8180 neard[461014]:    7: neard::main
Mar 15 05:36:47 n5a8180 neard[461014]: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Mar 15 05:36:47 n5a8180 systemd[1]: neard.service: Main process exited, code=killed, status=6/ABRT
Mar 15 05:36:47 n5a8180 systemd[1]: neard.service: Failed with result 'signal'.

And an expanded version of this error when using RUST_BACKTRACE=full:

Mar 15 07:50:42 n5a8180 neard[579891]: 2024-03-15T07:50:42.561086Z  INFO db: Opened a new RocksDB instance. num_instances=1
Mar 15 07:50:42 n5a8180 neard[579891]: 2024-03-15T07:50:42.593859Z ERROR runtime: Failed to check if a state snapshot exists err=STATE_SNAPSHOT_KEY
Mar 15 07:50:42 n5a8180 neard[579891]: thread 'main' panicked at chain/client/src/client_actor.rs:219:6:
Mar 15 07:50:42 n5a8180 neard[579891]: called `Result::unwrap()` on an `Err` value: Chain(DBNotFoundErr("epoch block: GJRLyvqkiPnWc6Q1yHunyTntr1G24FebJ5XLTSoDfmyc"))
Mar 15 07:50:42 n5a8180 neard[579891]: stack backtrace:
Mar 15 07:50:42 n5a8180 neard[579891]:    0:     0x558e370886f9 - std::backtrace_rs::backtrace::libunwind::trace::hbee8a7973eeb6c93
Mar 15 07:50:42 n5a8180 neard[579891]:                                at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/../../backtrace/src/backtrace/libunwind.rs:104:5
Mar 15 07:50:42 n5a8180 neard[579891]:    1:     0x558e370886f9 - std::backtrace_rs::backtrace::trace_unsynchronized::hc8ac75eea3aa6899
Mar 15 07:50:42 n5a8180 neard[579891]:                                at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
Mar 15 07:50:42 n5a8180 neard[579891]:    2:     0x558e370886f9 - std::sys_common::backtrace::_print_fmt::hc7f3e3b5298b1083
Mar 15 07:50:42 n5a8180 neard[579891]:                                at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/sys_common/backtrace.rs:68:5
Mar 15 07:50:42 n5a8180 neard[579891]:    3:     0x558e370886f9 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hbb235daedd7c6190
Mar 15 07:50:42 n5a8180 neard[579891]:                                at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/sys_common/backtrace.rs:44:22
Mar 15 07:50:42 n5a8180 neard[579891]:    4:     0x558e35a45960 - core::fmt::rt::Argument::fmt::h76c38a80d925a410
Mar 15 07:50:42 n5a8180 neard[579891]:                                at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/fmt/rt.rs:142:9
Mar 15 07:50:42 n5a8180 neard[579891]:    5:     0x558e35a45960 - core::fmt::write::h3ed6aeaa977c8e45
Mar 15 07:50:42 n5a8180 neard[579891]:                                at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/fmt/mod.rs:1120:17
Mar 15 07:50:42 n5a8180 neard[579891]:    6:     0x558e370545ce - std::io::Write::write_fmt::h78b18af5775fedb5
Mar 15 07:50:42 n5a8180 neard[579891]:                                at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/io/mod.rs:1810:15
Mar 15 07:50:42 n5a8180 neard[579891]:    7:     0x558e3708a09e - std::sys_common::backtrace::_print::h5d645a07e0fcfdbb
Mar 15 07:50:42 n5a8180 neard[579891]:                                at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/sys_common/backtrace.rs:47:5
Mar 15 07:50:42 n5a8180 neard[579891]:    8:     0x558e3708a09e - std::sys_common::backtrace::print::h85035a511aafe7a8
Mar 15 07:50:42 n5a8180 neard[579891]:                                at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/sys_common/backtrace.rs:34:9
Mar 15 07:50:42 n5a8180 neard[579891]:    9:     0x558e37089947 - std::panicking::default_hook::{{closure}}::hcce8cea212785a25
Mar 15 07:50:42 n5a8180 neard[579891]:   10:     0x558e3708a933 - std::panicking::default_hook::hf5fcb0f213fe709a
Mar 15 07:50:42 n5a8180 neard[579891]:                                at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:292:9
Mar 15 07:50:42 n5a8180 neard[579891]:   11:     0x558e3708a933 - std::panicking::rust_panic_with_hook::h095fccf1dc9379ee
Mar 15 07:50:42 n5a8180 neard[579891]:                                at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:779:13
Mar 15 07:50:42 n5a8180 neard[579891]:   12:     0x558e3708a410 - std::panicking::begin_panic_handler::{{closure}}::h032ba12139b353db
Mar 15 07:50:42 n5a8180 neard[579891]:                                at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:657:13
Mar 15 07:50:42 n5a8180 neard[579891]:   13:     0x558e3708a366 - std::sys_common::backtrace::__rust_end_short_backtrace::h9259bc2ff8fd0f76
Mar 15 07:50:42 n5a8180 neard[579891]:                                at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/sys_common/backtrace.rs:171:18
Mar 15 07:50:42 n5a8180 neard[579891]:   14:     0x558e3708a35f - rust_begin_unwind
Mar 15 07:50:42 n5a8180 neard[579891]:                                at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:645:5
Mar 15 07:50:42 n5a8180 neard[579891]:   15:     0x558e356f6e24 - core::panicking::panic_fmt::h784f20a50eaab275
Mar 15 07:50:42 n5a8180 neard[579891]:                                at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panicking.rs:72:14
Mar 15 07:50:42 n5a8180 neard[579891]:   16:     0x558e356f7392 - core::result::unwrap_failed::h03d8a5018196e1cd
Mar 15 07:50:42 n5a8180 neard[579891]:                                at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/result.rs:1649:5
Mar 15 07:50:42 n5a8180 neard[579891]:   17:     0x558e36c3f2c5 - nearcore::start_with_config_and_synchronization::h1a98b2c8819e5539
Mar 15 07:50:42 n5a8180 neard[579891]:   18:     0x558e358d1584 - neard::cli::RunCmd::run::{{closure}}::hac1c7056c17659a9
Mar 15 07:50:42 n5a8180 neard[579891]:   19:     0x558e358a662a - tokio::task::local::LocalSet::run_until::{{closure}}::hb7900b3bd6f853b0
Mar 15 07:50:42 n5a8180 neard[579891]:   20:     0x558e358c302a - neard::cli::NeardCmd::parse_and_run::h09f96011c101ace8
Mar 15 07:50:42 n5a8180 neard[579891]:   21:     0x558e358d5331 - neard::main::h94fe5cfdd150efef
Mar 15 07:50:42 n5a8180 neard[579891]:   22:     0x558e358927b3 - std::sys_common::backtrace::__rust_begin_short_backtrace::hafddc4f1c842a120
Mar 15 07:50:42 n5a8180 neard[579891]:   23:     0x558e358ff45f - main
Mar 15 07:50:42 n5a8180 neard[579891]:   24:     0x7f4b28797083 - __libc_start_main
Mar 15 07:50:42 n5a8180 neard[579891]:   25:     0x558e35856dae - _start
Mar 15 07:50:42 n5a8180 neard[579891]:   26:                0x0 - <unknown>
Mar 15 07:50:42 n5a8180 systemd[1]: neard.service: Main process exited, code=killed, status=6/ABRT
Mar 15 07:50:42 n5a8180 systemd[1]: neard.service: Failed with result 'signal'.

Affected parties

Validators

Reproduction steps

Follow all the steps provided in the link starting from:

  1. Build the code with the command:
    cargo build --package neard --features statelessnet_protocol --release
  2. Copy DB Snapshot
    # Command to copy DB Snapshot.
  3. Initialise the working directory
    ~/nearcore/target/release/neard --home ~/.near init --chain-id statelessnet
  4. Replace the config.json file and restore the node_key.json and validator_key.json files.
  5. Start the service with sudo systemctl start neard
  6. Wait for the headers to synchronize until the aforementioned error appears.
    Mar 15 10:32:46 n5a8180 neard[642576]: 2024-03-15T10:32:46.145566Z ERROR client: Received an invalid block during state sync err=DBNotFoundErr("epoch block: 7Heyf6k1A2GW68sJT4wGf6TywKA32K8siq3nVch8Rgzh") block_hash=3yrwKeih9iUFxai4oRMWh4dLixPgkKGNXwMVgfQripob
  7. Restart the service as soon as the error occurs.
    sudo systemctl stop neard
    sudo systemctl start neard
    journalctl -n 100 -f -u neard | ccze -A
Mar 15 10:40:27 n5a8180 neard[649128]: 2024-03-15T10:40:27.022749Z  INFO db: Opened a new RocksDB instance. num_instances=1
Mar 15 10:40:27 n5a8180 neard[649128]: 2024-03-15T10:40:27.061716Z ERROR runtime: Failed to check if a state snapshot exists err=STATE_SNAPSHOT_KEY
Mar 15 10:40:27 n5a8180 neard[649128]: thread 'main' panicked at chain/client/src/client_actor.rs:219:6:
Mar 15 10:40:27 n5a8180 neard[649128]: called `Result::unwrap()` on an `Err` value: Chain(DBNotFoundErr("epoch block: AHDKD22F4fB1YTaEFHoyuTYzAnVvyY1a9ERBcLGVcDYA"))
Mar 15 10:40:27 n5a8180 neard[649128]: stack backtrace:
Mar 15 10:40:27 n5a8180 neard[649128]:    0: rust_begin_unwind
Mar 15 10:40:27 n5a8180 neard[649128]:              at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:645:5
Mar 15 10:40:27 n5a8180 neard[649128]:    1: core::panicking::panic_fmt
Mar 15 10:40:27 n5a8180 neard[649128]:              at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panicking.rs:72:14
Mar 15 10:40:27 n5a8180 neard[649128]:    2: core::result::unwrap_failed
Mar 15 10:40:27 n5a8180 neard[649128]:              at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/result.rs:1649:5
Mar 15 10:40:27 n5a8180 neard[649128]:    3: nearcore::start_with_config_and_synchronization
Mar 15 10:40:27 n5a8180 neard[649128]:    4: neard::cli::RunCmd::run::{{closure}}
Mar 15 10:40:27 n5a8180 neard[649128]:    5: tokio::task::local::LocalSet::run_until::{{closure}}
Mar 15 10:40:27 n5a8180 neard[649128]:    6: neard::cli::NeardCmd::parse_and_run
Mar 15 10:40:27 n5a8180 neard[649128]:    7: neard::main
Mar 15 10:40:27 n5a8180 neard[649128]: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Mar 15 10:40:27 n5a8180 systemd[1]: neard.service: Main process exited, code=killed, status=6/ABRT
Mar 15 10:40:27 n5a8180 systemd[1]: neard.service: Failed with result 'signal'.

After that, the service no longer starts.

Repeated the error several times in a row.

[Optional] Server Configuration

encipher88 commented 6 months ago
cd $HOME
systemctl stop neard
rm -r "$HOME/.near/data/"
cd nearcore
cargo build --package neard --features statelessnet_protocol --release
cp $HOME/nearcore/target/release/neard /usr/local/bin/neard
cd $HOME

mkdir -p ~/.near/data
cd ~/.near/data
aws s3 --no-sign-request cp s3://near-protocol-public/backups/statelessnet/rpc/latest .
latest=$(cat latest)
aws s3 --no-sign-request cp --recursive s3://near-protocol-public/backups/statelessnet/rpc/$latest .

cd $HOME
sed -i 's/"state_sync_enabled": true/"state_sync_enabled": false/' $HOME/.near/config.json

sudo systemctl restart neard
journalctl -u neard -f -o cat
GO2Pro commented 6 months ago

Confirm, the option "state_sync_enabled": false, solves the problem.

VanBarbascu commented 6 months ago

Thanks for reporting this issue. ERROR runtime: Failed to check if a state snapshot exists err=STATE_SNAPSHOT_KEY has nothing to do with the issue.

It is very likely that your node was too far behind and the garbage collection cleared the block that was considered the chain head. State sync may be related to this issue in the sense that the node was in the state sync mode when it was restarted and did not get the chance to update the chain block head after the block was garbage collected. This means that the upgrade did not cause the issue.

In order to check this hypothesis, you can run ./neard --unsafe-fast-startup view-state scan-db-column --column Block and grep for the block in the error message: Chain(DBNotFoundErr("epoch block: <YourBlockHash>")).

In this case, restoring the /data folder is the only solution. @encipher88 already provided the steps.

telezhnaya commented 6 months ago

Should be fixed now @SmartMoveTeam could you please confirm?

GO2Pro commented 6 months ago

Should be fixed now @SmartMoveTeam could you please confirm?

I confirm, everything works after enabling option state_sync_enabled: true and restarting.