Validator: failover to secondary node is not working on mainnet

pznamensky commented 2 months ago

Describe the bug Both failover approaches (restart secondary with validator_key.json and hot swap) are not working on validator running v2.1.1. Secondary node stuck on trying to download blocks.

To Reproduce To reproduce the problem you should bootstrap secondary node, wait until it will catch up all the blocks (curl -s localhost:3030/status | jq -C .sync_info) and then follow one of two possible approaches as described here: https://github.com/near/nearcore/releases/tag/2.0.0. On older versions (1.x) we used to use the 1st approach (copy over validator_key and node_key), but we've tried the 2nd one too and both don't work. After failing over, node can't catch up blocks with the following messages:

2024-08-28T14:39:42.108268Z  INFO stats: #126754210 Downloading blocks 0.00% (115 left; at 126754210) 32 peers ⬇ 8.00 MB/s ⬆ 6.72 MB/s 0.00 bps 0 gas/s CPU: 165%, Mem: 8.52 GB
...
2024-08-28T14:39:52.108957Z  INFO stats: #126754210 Downloading blocks 0.00% (123 left; at 126754210) 32 peers ⬇ 8.05 MB/s ⬆ 6.75 MB/s 0.00 bps 0 gas/s CPU: 89%, Mem: 8.50 GB
...
2024-08-28T14:40:02.109626Z  INFO stats: #126754210 Downloading blocks 0.00% (131 left; at 126754210) 32 peers ⬇ 8.12 MB/s ⬆ 6.34 MB/s 0.00 bps 0 gas/s CPU: 71%, Mem: 8.54 GB

Please see more logs below.

Expected behavior We should be able to perform failover.

Version (please complete the following information):

nearcore 2.1.1
rust version 1.79.0
docker 24
mainnet

Additional context Here is log right after send SIGHUP signal to the backup node (see this doc): neard.log

andrei-near commented 1 month ago

Hi @pznamensky. We've just tested failover option 2 with nearcore 2.1.1 internally and it went through as expected. Could you please: 1) provide the configuration for primary and secondary nodes 2) confirm what happens with primary and secondary node node_key? This should NOT be moved from primary to secondary and it can lead to nodes not being able to sync (other peers see new IP but key is already associated with primary node IP address so they will not allow syncing)

pznamensky commented 1 month ago

Hi @andrei-near, Previously, we used to copy over node_key from primary to secondary, however we've tried without replacing node_key on secondary and that's not worked either.

Here is our config.json with redacted public_addrs field which we use with "restart" approach. In "hot reload" approach tracked_shadow_validator was also added.

Also, I'm not sure if it's related, but in this epoch secondary node is crashing after copying validator_key and restarting neard:

2024-09-06T15:24:20.298880Z  INFO neard: version="2.1.1" build="unknown" latest_protocol=70
2024-09-06T15:24:20.299083Z  WARN neard: /root/.near/config.json: encountered unrecognised fields: consensus.reduce_wait_for_missing_block, consensus.state_fetch_horizon, store.trie_cache_capacities, store.background_migration_threads
2024-09-06T15:24:20.299093Z  INFO config: Validating Config, extracted from config.json...
2024-09-06T15:24:20.301007Z  WARN genesis: Skipped genesis validation
2024-09-06T15:24:20.301022Z  INFO config: Validating Genesis config and records. This could take a few minutes...
2024-09-06T15:24:20.301227Z  INFO config: All validations have passed!
2024-09-06T15:24:20.302980Z  INFO neard: Reset the config "/root/.near/log_config.json" because the config file doesn't exist. err=Os { code: 2, kind: NotFound, message: "No such file or directory" }
2024-09-06T15:24:20.303069Z  WARN neard: /root/.near/config.json: encountered unrecognised fields: consensus.reduce_wait_for_missing_block, consensus.state_fetch_horizon, store.trie_cache_capacities, store.background_migration_threads
2024-09-06T15:24:20.303076Z  INFO config: Validating Config, extracted from config.json...
2024-09-06T15:24:20.303098Z  INFO neard: Hot loading validator key /root/.near/validator_key.json.
2024-09-06T15:24:20.303140Z  INFO near_o11y::reload: Updated the logging layer according to `log_config.json`
2024-09-06T15:24:20.303162Z  INFO db_opener: Opening NodeStorage path="/root/.near/data" cold_path="none"
2024-09-06T15:24:20.303180Z  INFO db: Opened a new RocksDB instance. num_instances=1
2024-09-06T15:24:20.325975Z  INFO db: Closed a RocksDB instance. num_instances=0
2024-09-06T15:24:20.325990Z  INFO db_opener: The database exists. path=/root/.near/data
2024-09-06T15:24:20.325994Z  INFO db: Opened a new RocksDB instance. num_instances=1
2024-09-06T15:24:20.542850Z  INFO db: Closed a RocksDB instance. num_instances=0
2024-09-06T15:24:20.542881Z  INFO db: Opened a new RocksDB instance. num_instances=1
2024-09-06T15:24:20.563855Z  INFO db: Closed a RocksDB instance. num_instances=0
2024-09-06T15:24:20.563874Z  INFO db: Opened a new RocksDB instance. num_instances=1
thread 'main' panicked at chain/client/src/client_actor.rs:168:6:
called `Result::unwrap()` on an `Err` value: Chain(StorageError(StorageInconsistentState("No ChunkExtra for block 7CfkJyEdHnJiZ4kEURa6p8HPEBwDj4eZcqMNKetrV9eD in shard s5.v3")))
stack backtrace:
   0:     0x5641c7fea424 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h1e1a1972118942ad
   1:     0x5641c68a589b - core::fmt::write::hc090a2ffd6b28c4a
   2:     0x5641c7fb608e - std::io::Write::write_fmt::h8898bac6ff039a23
   3:     0x5641c7febae9 - std::sys_common::backtrace::print::ha96650907276675e
   4:     0x5641c7feb409 - std::panicking::default_hook::{{closure}}::h215c2a0a8346e0e0
   5:     0x5641c7fec4e5 - std::panicking::rust_panic_with_hook::hac8bdceee1e4fe2c
   6:     0x5641c7febe72 - std::panicking::begin_panic_handler::{{closure}}::h00d785e82757ce3c
   7:     0x5641c7febdc9 - std::sys_common::backtrace::__rust_end_short_backtrace::h1628d957bcd06996
   8:     0x5641c7febdb6 - rust_begin_unwind
   9:     0x5641c6544072 - core::panicking::panic_fmt::hdc63834ffaaefae5
  10:     0x5641c6544485 - core::result::unwrap_failed::h82b551e0ff2b2176
  11:     0x5641c7b54f03 - nearcore::start_with_config_and_synchronization::haab8ba90a9fdf122
  12:     0x5641c673c498 - neard::cli::RunCmd::run::{{closure}}::h24af1b6de28656ca
  13:     0x5641c6709429 - tokio::task::local::LocalSet::run_until::{{closure}}::h60aa1f7ad8e92db2
  14:     0x5641c672db25 - neard::cli::NeardCmd::parse_and_run::h124534cdc5ba48bd
  15:     0x5641c673fd13 - neard::main::h3bab0c83991e2cc8
  16:     0x5641c66f6253 - std::sys_common::backtrace::__rust_begin_short_backtrace::h1a2f36ec6ad85c72
  17:     0x5641c6769eff - main
  18:     0x7f6c2cb6cd90 - <unknown>
  19:     0x7f6c2cb6ce40 - __libc_start_main
  20:     0x5641c66bd905 - _start
  21:                0x0 - <unknown>

staffik commented 1 month ago

Here is our config.json

Is it config for primary node, failover node, or both?

pznamensky commented 1 month ago

Is it config for primary node, failover node, or both?

That's for the both nodes.

staffik commented 1 month ago

It has tracked_shards=[] and it does not have tracked_shadow_validator, thus failover node does not track anything. I think that might be why it crashed (No ChunkExtra .. ).

pznamensky commented 1 month ago

Thanks for the update. Unfortunately, with tracked_shadow_validator neard keeps crashing with the same error.

staffik commented 1 month ago

Could you send configs for both nodes and share steps that you took?

pznamensky commented 1 month ago

@staffik sure!

config.json with redacted public_addrs field. Also on secondary node "tracked_shadow_validator": "<our_pool_id_here>" was also added (and restarted).

Steps to perform "hot" failover:

ensure that secondary node is in sync with network via curl -s localhost:3030/status | jq -C .sync_info
stop neard on primary server
copy over validator_key.json to secondary node
send SIGHUP to neard on secondary node

This results into these messages and no progress in downloading blocks:

INFO stats: #126754210 Downloading blocks 0.00% (115 left; at 126754210) 32 peers ⬇ 8.00 MB/s ⬆ 6.72 MB/s 0.00 bps 0 gas/s CPU: 165%, Mem: 8.52 GB
...
ERROR near: Unwrap error: Other: Node has not caught up yet
...
INFO stats: #126754210 Downloading blocks 0.00% (123 left; at 126754210) 32 peers ⬇ 8.05 MB/s ⬆ 6.75 MB/s 0.00 bps 0 gas/s CPU: 89%, Mem: 8.50 GB
...
ERROR near: Unwrap error: Other: Node has not caught up yet
...

In case we restart secondary node instead of sending SIGHUP, it results in crash as in this message: https://github.com/near/nearcore/issues/12030#issuecomment-2334389264

PS: this happens on 2.2.1 too

andrei-near commented 1 month ago

@pznamensky , we repeatedly tested hot failover scenario with 2.20 and 2.2.1 and it works as expected. Failover node must be healthy and run with these params set: .tracked_shards = [] .store.load_mem_tries_for_tracked_shards = true .tracked_shadow_validator = "$your_pool_name"

Steps we run for hot-swap failover: 1) Copy validator key on new node 2) Stop Neard service on old validator 3) Send SIGHUP to neard process to pickup validator key - sudo pkill -HUP neard

I would recommend trying to download the freshest snapshot on failover node and starting with above configuration set.

pznamensky commented 1 month ago

Hi @andrei-near, thanks for checking! We use a clean server with a fresh snapshot with exactly the same steps and config options that you've described. Is there a chance that this issue can only be reproduced for our validatar? Also, what can we do to find the root cause? Try debug mode (if any), dump metrics, etc?

andrei-near commented 2 weeks ago

@pznamensky do you gather metrics from you node somewhere? I would like to check if the backup node is tracking the active validator shard. If you open you node metrics at http://$your_backup_node_ip:4040/metrics you should see following metrics saying which shard is the backup node tracking

near_client_tracked_shards{shard_id="0"} 0 near_client_tracked_shards{shard_id="1"} 1 near_client_tracked_shards{shard_id="2"} 0 near_client_tracked_shards{shard_id="3"} 0 near_client_tracked_shards{shard_id="4"} 0 near_client_tracked_shards{shard_id="5"} 0

Another option would be to run the backup node with RPC configs: .tracked_shards = [111] .store.load_mem_tries_for_tracked_shards = false

and change them to following only when validator key was moved to the node: .tracked_shards = [] .store.load_mem_tries_for_tracked_shards = true

Downside is that it will require neard restart

pznamensky commented 1 week ago

Hey @andrei-near, We've tried one more time using hot-reload approach and it worked! The only change we made this time is picking up the latest config.json. I don't know what exactly was wrong with our config, but now everything works fine. Thank you 🙇

near / nearcore

Validator: failover to secondary node is not working on mainnet #12030