near / nearcore

Reference client for NEAR Protocol
https://near.org
GNU General Public License v3.0
2.32k stars 623 forks source link

[Node Issue]: Validator Migration Giving Errors #12140

Closed Frixoe closed 1 month ago

Frixoe commented 1 month ago

Contact Details

suryansh@luganodes.com

What happened?

We have 3 machines which have been synced using the same config. We were migrating from machine 1 to machine 2 and machine 2 crashed.

So we tried to move the key back to machine 1, then machine 1 stopped syncing. We resynced machine 2 from the snapshot and whenever the key is not on either machine, they run properly with no issues. But the moment we restart with the key, the nodes stop syncing.

On machine 3, we did fresh snapshot download and restarted with the validator key. But first we see an error like this:

WARN garbage collection: Error in gc: GC Error: block on canonical chain shouldn't have refcount 0

Then after a restart, we see:

WARN chunks: Error processing partial encoded chunk: ChainError(InvalidChunkHeight)

and if we restart again, we see:

ERROR metrics: Error when exporting postponed receipts count DB Not Found Error: BLOCK: AX8wFPoyVoULT9N7hMcVLxJtPBYZj4EkBhbhThKYZ7WN.

But once the validator key is removed and the node is restarted, the node syncs with no issues. Then, if we put the validator key back, it stops syncing.

The validator key doesn't seem to be running on any machine.

We also saw another error while trying to sync machine 3 from snapshot(after 1 restart or neard) with the validator key on it:

Sep 11 08:47:26 near-mainnet-validator-happy-lamarr-ovh-fr neard[49048]: 2024-09-11T08:47:26.647486Z  WARN near_store::db::rocksdb: target="store::db::rocksdb" making a write batch took a very long time, make smaller transactions! elapsed=8.637442105s back
trace=   0: <near_store::db::rocksdb::RocksDB as near_store::db::Database>::write
Sep 11 08:47:26 near-mainnet-validator-happy-lamarr-ovh-fr neard[49048]:    1: near_store::StoreUpdate::commit
Sep 11 08:47:26 near-mainnet-validator-happy-lamarr-ovh-fr neard[49048]:    2: near_chain::store::ChainStoreUpdate::commit
Sep 11 08:47:26 near-mainnet-validator-happy-lamarr-ovh-fr neard[49048]:    3: near_chain::garbage_collection::<impl near_chain::store::ChainStore>::reset_data_pre_state_sync
Sep 11 08:47:26 near-mainnet-validator-happy-lamarr-ovh-fr neard[49048]:    4: near_client::client_actor::ClientActorInner::run_sync_step
Sep 11 08:47:26 near-mainnet-validator-happy-lamarr-ovh-fr neard[49048]:    5: near_client::client_actor::ClientActorInner::check_triggers
Sep 11 08:47:26 near-mainnet-validator-happy-lamarr-ovh-fr neard[49048]:    6: <actix::address::envelope::SyncEnvelopeProxy<M> as actix::address::envelope::EnvelopeProxy<A>>::handle
Sep 11 08:47:26 near-mainnet-validator-happy-lamarr-ovh-fr neard[49048]:    7: <actix::contextimpl::ContextFut<A,C> as core::future::future::Future>::poll
Sep 11 08:47:26 near-mainnet-validator-happy-lamarr-ovh-fr neard[49048]:    8: tokio::runtime::task::raw::poll
Sep 11 08:47:26 near-mainnet-validator-happy-lamarr-ovh-fr neard[49048]:    9: tokio::task::local::LocalSet::tick
Sep 11 08:47:26 near-mainnet-validator-happy-lamarr-ovh-fr neard[49048]:   10: tokio::task::local::LocalSet::run_until::{{closure}}
Sep 11 08:47:26 near-mainnet-validator-happy-lamarr-ovh-fr neard[49048]:   11: std::sys_common::backtrace::__rust_begin_short_backtrace
Sep 11 08:47:26 near-mainnet-validator-happy-lamarr-ovh-fr neard[49048]:   12: core::ops::function::FnOnce::call_once{{vtable.shim}}
Sep 11 08:47:26 near-mainnet-validator-happy-lamarr-ovh-fr neard[49048]:   13: std::sys::pal::unix::thread::Thread::new::thread_start
Sep 11 08:47:26 near-mainnet-validator-happy-lamarr-ovh-fr neard[49048]:   14: <unknown>
Sep 11 08:47:26 near-mainnet-validator-happy-lamarr-ovh-fr neard[49048]:   15: <unknown>

And then once the logs start moving we see logs like this:

WARN client: Dropping tx me=Some(AccountId("validator_contract.pool.near")) tx=SignedTransaction { transaction: V0(TransactionV0 { signer_id: AccountId("0-relay.hot.tg"), public_key: ed25519:HD71jFrShGZfUYY2QqtMgq6Y4wCoVffTqwFJ8ayixXL9, nonce: 114347435947781, receiver_id: AccountId("7254785709.tg"), block_hash: 9au2oGsybmTqUUW17Re9wSgM24vrw94m2mvHp4PUcLs7, actions: [Delegate(SignedDelegateAction { delegate_action: DelegateAction { sender_id: AccountId("7254785709.tg"), receiver_id: AccountId("game.hot.tg"), actions: [NonDelegateAction(FunctionCall(FunctionCallAction { method_name: l2_claim, args: eyJjaGFyZ2VfZ2FzX2ZlZSI6ZmFsc2UsInNpZ25hdHVyZSI6IjI0MWQxZDlhNWYxODE1MDViYzA2ZGMxNTJhZTM3NTNiMGIyMGU3OGM1NGRmZGQ0YzU5MWY4YjEzZDRiMTg4NjAiLCJtaW5pbmdfdGltZSI6Ijk4Mjk2MyIsIm1heF90cyI6IjE3MjYwNDIxNTkyNjA0NzEwNDAifQ==, gas: 300000000000000, deposit: 0 }))], nonce: 120626739001019, max_block_height: 128190301, public_key: ed25519:9NHXSXiyLnU2zV7z44CAVDp1jxqnAuF5Fbyxfj7tGeJM }, signature: ed25519:2yeC1AURAAUG55b5rcGnezjna77tJXaX9pTdtHvCmTCE3V1reyKnoya5SnFyfm8kVmiBtpwtUXfLku8VG2Bq76iN })] }), signature: ed25519:3XUsbK8KDvfcumfyVTLn6yTriY5Mdc5Yigk74wuBMHNQgGWih3ZgNwnni3UmUjbqsQUWSHsi2wLY44NmDLzKTv5W, hash: BmEeoZxr1mUwcqeVUnXaeNyN8gCtBDCxceyYQFNVAWpJ, size: 461 }

Version

neard (release 2.2.0) (build 2.2.0) (rustc 1.79.0) (protocol 71) (db 40)
features: [default, json_rpc, rosetta_rpc]

Node type

RPC (Default)

Are you a validator?

Relevant log output

INFO stats: State 4cupEDQnemFXCa6s9Z3ankCBgj4sXDSnb2SP4KJTX51T[0: parts] 32 peers ⬇ 5.13 MB/s ⬆ 4.08 MB/s 0.00 bps 0 gas/s CPU: 296%, Mem: 22.4 GB
INFO stats: State 4cupEDQnemFXCa6s9Z3ankCBgj4sXDSnb2SP4KJTX51T[0: apply in progress] 31 peers ⬇ 5.27 MB/s ⬆ 4.63 MB/s 0.00 bps 0 gas/s CPU: 306%, Mem: 6.47 GB

Node head info

"CHUNK_TAIL": 127685197
"FINAL_HEAD": Tip { height: 127685193, last_block_hash: AX8wFPoyVoULT9N7hMcVLxJtPBYZj4EkBhbhThKYZ7WN, prev_block_hash: 5eeheX9m3Et51gnLenQh8t6SvECYZ5JkfqTgqwHjs6KK, epoch_id: EpochId(A6faGmnyqHh6gZYa8bX8NjrPuPYqY3eG1TtzDS67GXp), next_epoch_id: EpochId(EuzPvfoXd71sfgPRVoiGME3muncV5h3bhkJMc8bm3CCp) }
"FORK_TAIL": 127484451
"GENESIS_JSON_HASH": 93on1kcuqTXU94zGyGvBm3YYpPqCkaM8bssbxndgbeRX
"GENESIS_STATE_ROOTS": [8EhZRfDTYujfZoUZtZ3eSMB9gJyFo5zjscR12dEcaxGU]
"HEAD": Tip { height: 127685195, last_block_hash: D7rQvNeRWaD1fEZywBgEfRxNLrU1mb8GNKSa8tys46eW, prev_block_hash: 4kGUzZKyw1964LMtQsz9ukgLgDn6uV6KZDQyCviTzFRw, epoch_id: EpochId(A6faGmnyqHh6gZYa8bX8NjrPuPYqY3eG1TtzDS67GXp), next_epoch_id: EpochId(EuzPvfoXd71sfgPRVoiGME3muncV5h3bhkJMc8bm3CCp) }
"HEADER_HEAD": Tip { height: 127791665, last_block_hash: EnYcS4d4CqfXR3A2axXuPn4XcJni8qiLsPHLdUR4XimF, prev_block_hash: 5L2GZk24n36fivKJeqkFZKcx2dWLw9XeFPzhnKHbapXU, epoch_id: EpochId(7TUPSvHkWBZS81zzqHi16C2PheG7dJ6svjyvXeH5vWmk), next_epoch_id: EpochId(EsEQwWtjtURiwejU64TR5CVR37kCv6fWi5nN7W4yVRCs) }
"LARGEST_TARGET_HEIGHT": 127685406
"LATEST_KNOWN": LatestKnown { height: 127791665, seen: 1726043931818204310 }
"STATE_SYNC_DUMP:\0\0\0\0\0\0\0\0": AllDumped { epoch_id: EpochId(4c3AEoBnXPoqPM8cQHxqRfXbq5hm6CpJAv9okUSGMMxZ), epoch_height: 2362 }
"STATE_SYNC_DUMP:\u{1}\0\0\0\0\0\0\0": AllDumped { epoch_id: EpochId(4c3AEoBnXPoqPM8cQHxqRfXbq5hm6CpJAv9okUSGMMxZ), epoch_height: 2362 }
"STATE_SYNC_DUMP:\u{2}\0\0\0\0\0\0\0": AllDumped { epoch_id: EpochId(4c3AEoBnXPoqPM8cQHxqRfXbq5hm6CpJAv9okUSGMMxZ), epoch_height: 2362 }
"STATE_SYNC_DUMP:\u{3}\0\0\0\0\0\0\0": AllDumped { epoch_id: EpochId(4c3AEoBnXPoqPM8cQHxqRfXbq5hm6CpJAv9okUSGMMxZ), epoch_height: 2362 }
"SYNC_HEAD": Tip { height: 13740748, last_block_hash: 69A1wh25GwoD2CzEuQhs8D2goWPXqe1Liu3jq1i5tdMS, prev_block_hash: 8ZdbgiXn3JpfGGdMGByMqVE5GppKYrojjHJjgZajNV8Z, epoch_id: EpochId(EeWh36LxiVaZgQRsyCzyAUBhaL5yACKZSjK2vTobAC4d), next_epoch_id: EpochId(7edSVzdsSoo1ujdy79abYv3ztbfx7WDawdhCgKhK5qjj) }
"TAIL": 127484451

Node upgrade history

We were migrating to a node with the latest version(2.2.0) and started facing this issue.

DB reset history

Multiple times today on all our machines.
staffik commented 1 month ago

Could you share config.json for each machine? Do you move node_key.json too or just validator_key.json? So after upgrading all machines to 2.2.0, it only works if neither machine run as validator?

telezhnaya commented 1 month ago

@Frixoe do you still have this problem?

Frixoe commented 1 month ago

@telezhnaya No we don't.