Closed pznamensky closed 1 week ago
Hi @pznamensky. We've just tested failover option 2 with nearcore 2.1.1 internally and it went through as expected. Could you please: 1) provide the configuration for primary and secondary nodes 2) confirm what happens with primary and secondary node node_key? This should NOT be moved from primary to secondary and it can lead to nodes not being able to sync (other peers see new IP but key is already associated with primary node IP address so they will not allow syncing)
Hi @andrei-near, Previously, we used to copy over node_key from primary to secondary, however we've tried without replacing node_key on secondary and that's not worked either.
Here is our config.json with redacted public_addrs
field which we use with "restart" approach. In "hot reload" approach tracked_shadow_validator
was also added.
Also, I'm not sure if it's related, but in this epoch secondary node is crashing after copying validator_key and restarting neard:
2024-09-06T15:24:20.298880Z INFO neard: version="2.1.1" build="unknown" latest_protocol=70
2024-09-06T15:24:20.299083Z WARN neard: /root/.near/config.json: encountered unrecognised fields: consensus.reduce_wait_for_missing_block, consensus.state_fetch_horizon, store.trie_cache_capacities, store.background_migration_threads
2024-09-06T15:24:20.299093Z INFO config: Validating Config, extracted from config.json...
2024-09-06T15:24:20.301007Z WARN genesis: Skipped genesis validation
2024-09-06T15:24:20.301022Z INFO config: Validating Genesis config and records. This could take a few minutes...
2024-09-06T15:24:20.301227Z INFO config: All validations have passed!
2024-09-06T15:24:20.302980Z INFO neard: Reset the config "/root/.near/log_config.json" because the config file doesn't exist. err=Os { code: 2, kind: NotFound, message: "No such file or directory" }
2024-09-06T15:24:20.303069Z WARN neard: /root/.near/config.json: encountered unrecognised fields: consensus.reduce_wait_for_missing_block, consensus.state_fetch_horizon, store.trie_cache_capacities, store.background_migration_threads
2024-09-06T15:24:20.303076Z INFO config: Validating Config, extracted from config.json...
2024-09-06T15:24:20.303098Z INFO neard: Hot loading validator key /root/.near/validator_key.json.
2024-09-06T15:24:20.303140Z INFO near_o11y::reload: Updated the logging layer according to `log_config.json`
2024-09-06T15:24:20.303162Z INFO db_opener: Opening NodeStorage path="/root/.near/data" cold_path="none"
2024-09-06T15:24:20.303180Z INFO db: Opened a new RocksDB instance. num_instances=1
2024-09-06T15:24:20.325975Z INFO db: Closed a RocksDB instance. num_instances=0
2024-09-06T15:24:20.325990Z INFO db_opener: The database exists. path=/root/.near/data
2024-09-06T15:24:20.325994Z INFO db: Opened a new RocksDB instance. num_instances=1
2024-09-06T15:24:20.542850Z INFO db: Closed a RocksDB instance. num_instances=0
2024-09-06T15:24:20.542881Z INFO db: Opened a new RocksDB instance. num_instances=1
2024-09-06T15:24:20.563855Z INFO db: Closed a RocksDB instance. num_instances=0
2024-09-06T15:24:20.563874Z INFO db: Opened a new RocksDB instance. num_instances=1
thread 'main' panicked at chain/client/src/client_actor.rs:168:6:
called `Result::unwrap()` on an `Err` value: Chain(StorageError(StorageInconsistentState("No ChunkExtra for block 7CfkJyEdHnJiZ4kEURa6p8HPEBwDj4eZcqMNKetrV9eD in shard s5.v3")))
stack backtrace:
0: 0x5641c7fea424 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h1e1a1972118942ad
1: 0x5641c68a589b - core::fmt::write::hc090a2ffd6b28c4a
2: 0x5641c7fb608e - std::io::Write::write_fmt::h8898bac6ff039a23
3: 0x5641c7febae9 - std::sys_common::backtrace::print::ha96650907276675e
4: 0x5641c7feb409 - std::panicking::default_hook::{{closure}}::h215c2a0a8346e0e0
5: 0x5641c7fec4e5 - std::panicking::rust_panic_with_hook::hac8bdceee1e4fe2c
6: 0x5641c7febe72 - std::panicking::begin_panic_handler::{{closure}}::h00d785e82757ce3c
7: 0x5641c7febdc9 - std::sys_common::backtrace::__rust_end_short_backtrace::h1628d957bcd06996
8: 0x5641c7febdb6 - rust_begin_unwind
9: 0x5641c6544072 - core::panicking::panic_fmt::hdc63834ffaaefae5
10: 0x5641c6544485 - core::result::unwrap_failed::h82b551e0ff2b2176
11: 0x5641c7b54f03 - nearcore::start_with_config_and_synchronization::haab8ba90a9fdf122
12: 0x5641c673c498 - neard::cli::RunCmd::run::{{closure}}::h24af1b6de28656ca
13: 0x5641c6709429 - tokio::task::local::LocalSet::run_until::{{closure}}::h60aa1f7ad8e92db2
14: 0x5641c672db25 - neard::cli::NeardCmd::parse_and_run::h124534cdc5ba48bd
15: 0x5641c673fd13 - neard::main::h3bab0c83991e2cc8
16: 0x5641c66f6253 - std::sys_common::backtrace::__rust_begin_short_backtrace::h1a2f36ec6ad85c72
17: 0x5641c6769eff - main
18: 0x7f6c2cb6cd90 - <unknown>
19: 0x7f6c2cb6ce40 - __libc_start_main
20: 0x5641c66bd905 - _start
21: 0x0 - <unknown>
Here is our config.json
Is it config for primary node, failover node, or both?
Is it config for primary node, failover node, or both?
That's for the both nodes.
It has tracked_shards=[]
and it does not have tracked_shadow_validator
, thus failover node does not track anything.
I think that might be why it crashed (No ChunkExtra ..
).
Thanks for the update.
Unfortunately, with tracked_shadow_validator
neard keeps crashing with the same error.
Could you send configs for both nodes and share steps that you took?
@staffik sure!
config.json with redacted public_addrs field.
Also on secondary node "tracked_shadow_validator": "<our_pool_id_here>"
was also added (and restarted).
Steps to perform "hot" failover:
curl -s localhost:3030/status | jq -C .sync_info
neard
on primary servervalidator_key.json
to secondary nodeneard
on secondary nodeThis results into these messages and no progress in downloading blocks:
INFO stats: #126754210 Downloading blocks 0.00% (115 left; at 126754210) 32 peers ⬇ 8.00 MB/s ⬆ 6.72 MB/s 0.00 bps 0 gas/s CPU: 165%, Mem: 8.52 GB
...
ERROR near: Unwrap error: Other: Node has not caught up yet
...
INFO stats: #126754210 Downloading blocks 0.00% (123 left; at 126754210) 32 peers ⬇ 8.05 MB/s ⬆ 6.75 MB/s 0.00 bps 0 gas/s CPU: 89%, Mem: 8.50 GB
...
ERROR near: Unwrap error: Other: Node has not caught up yet
...
In case we restart secondary node instead of sending SIGHUP, it results in crash as in this message: https://github.com/near/nearcore/issues/12030#issuecomment-2334389264
PS: this happens on 2.2.1 too
@pznamensky , we repeatedly tested hot failover scenario with 2.20 and 2.2.1 and it works as expected. Failover node must be healthy and run with these params set: .tracked_shards = [] .store.load_mem_tries_for_tracked_shards = true .tracked_shadow_validator = "$your_pool_name"
Steps we run for hot-swap failover: 1) Copy validator key on new node 2) Stop Neard service on old validator 3) Send SIGHUP to neard process to pickup validator key - sudo pkill -HUP neard
I would recommend trying to download the freshest snapshot on failover node and starting with above configuration set.
Hi @andrei-near, thanks for checking! We use a clean server with a fresh snapshot with exactly the same steps and config options that you've described. Is there a chance that this issue can only be reproduced for our validatar? Also, what can we do to find the root cause? Try debug mode (if any), dump metrics, etc?
@pznamensky do you gather metrics from you node somewhere? I would like to check if the backup node is tracking the active validator shard. If you open you node metrics at http://$your_backup_node_ip:4040/metrics you should see following metrics saying which shard is the backup node tracking
near_client_tracked_shards{shard_id="0"} 0 near_client_tracked_shards{shard_id="1"} 1 near_client_tracked_shards{shard_id="2"} 0 near_client_tracked_shards{shard_id="3"} 0 near_client_tracked_shards{shard_id="4"} 0 near_client_tracked_shards{shard_id="5"} 0
Another option would be to run the backup node with RPC configs: .tracked_shards = [111] .store.load_mem_tries_for_tracked_shards = false
and change them to following only when validator key was moved to the node: .tracked_shards = [] .store.load_mem_tries_for_tracked_shards = true
Downside is that it will require neard restart
Hey @andrei-near, We've tried one more time using hot-reload approach and it worked! The only change we made this time is picking up the latest config.json. I don't know what exactly was wrong with our config, but now everything works fine. Thank you 🙇
Describe the bug Both failover approaches (restart secondary with validator_key.json and hot swap) are not working on validator running v2.1.1. Secondary node stuck on trying to download blocks.
To Reproduce To reproduce the problem you should bootstrap secondary node, wait until it will catch up all the blocks (
curl -s localhost:3030/status | jq -C .sync_info
) and then follow one of two possible approaches as described here: https://github.com/near/nearcore/releases/tag/2.0.0. On older versions (1.x) we used to use the 1st approach (copy over validator_key and node_key), but we've tried the 2nd one too and both don't work. After failing over, node can't catch up blocks with the following messages:Please see more logs below.
Expected behavior We should be able to perform failover.
Version (please complete the following information):
Additional context Here is log right after send SIGHUP signal to the backup node (see this doc): neard.log