Node Issue: freshly setup backup node can not run with 32GB RAM

wackazong commented 1 month ago

Contact Details

alexander@wackazong.com

Node type

Top 100 Validator

Which network are you running?

mainnet

What happened?

I am currently trying to get a new backup node started. I followed the requirements for the new backup node scenario and downloaded a new snapshot. My machine has 15 Cores, 32GB RAM and four parallel fast SSDs in RAID0.

I start neard with the backup config recommended for scenario 2 (the new one).

  "tracked_shards": [],
  "tracked_shadow_validator": "myvalidatorname.poolv1.near,
  "state_sync_enabled": true,
  "load_mem_tries_for_tracked_shards": true,

First, it dowloads headers. Then, as soon as headers are downloaded I get these lines in the log first. I never get a log message that any blocks have been downloaded to catch up with the network.

WARN garbage collection: Error in gc: GC Error: block on canonical chain shouldn't have refcount 0

Then, after some 30mins or so, I get this

Sep 30 09:12:07 validator-b systemd[1]: neard.service: A process of this unit has been killed by the OOM killer.
Sep 30 09:12:09 validator-b systemd[1]: neard.service: Main process exited, code=killed, status=9/KILL
Sep 30 09:12:09 validator-b systemd[1]: neard.service: Failed with result 'oom-kill'.
Sep 30 09:12:09 validator-b systemd[1]: neard.service: Consumed 2h 16min 6.405s CPU time.
Sep 30 09:12:39 validator-b systemd[1]: neard.service: Scheduled restart job, restart counter is at 27.
Sep 30 09:12:39 validator-b systemd[1]: Stopped Run a NEAR protocol node.
Sep 30 09:12:39 validator-b systemd[1]: neard.service: Consumed 2h 16min 6.405s CPU time.
Sep 30 09:12:39 validator-b systemd[1]: Started Run a NEAR protocol node.

What can I do to help analyse this issue?

Version

2.2.1

Relevant log output

see above

Node head info

2024-09-30T13:28:21.112530Z  WARN genesis: Skipped genesis validation
2024-09-30T13:28:21.112561Z  WARN genesis: Skipped genesis validation
thread 'main' panicked at tools/state-viewer/src/cli.rs:144:55:
called `Result::unwrap()` on an `Err` value: DbDoesNotExist
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: state_viewer::cli::StateViewerSubCommand::run
   4: neard::cli::NeardCmd::parse_and_run
   5: neard::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Aborted

Node upgrade history

Now

DB reset history

Now

telezhnaya commented 1 month ago

Config: https://ctxt.io/2/AAC4C18fFg

telezhnaya commented 1 month ago

wackazong commented 1 month ago

I added generous swap and now the node is downloading blocks.

wackazong commented 1 month ago

I did some memory monitoring while setting up another node from a fresh snapshot. Strangely enough, memory was never over 30% on a 32GB machine. But also on that machine I got the same behaviour before.

wackazong commented 1 month ago

Over night, the second node now started showing the error while downloading blocks and failed subsequently:

Oct 01 00:09:50 validator-a neard[626278]: 2024-09-30T22:09:50.794404Z  INFO stats: #129220486 Downloading blocks 37.22% (47816 left; at 129220486) 31 peers ⬇ 7.23 MB/s ⬆ 5.43 MB/s 3.00 bps>
Oct 01 00:09:51 validator-a neard[626278]: 2024-09-30T22:09:51.971708Z  WARN network: Message dropped because TTL reached 0. msg=RoutedMessageV2 { msg: RoutedMessage { target: PeerId(ed2551>
Oct 01 00:09:51 validator-a neard[626278]: 2024-09-30T22:09:51.978342Z  WARN network: Message dropped because TTL reached 0. msg=RoutedMessageV2 { msg: RoutedMessage { target: PeerId(ed2551>
Oct 01 00:09:52 validator-a neard[626278]: 2024-09-30T22:09:52.081806Z  WARN network: Message dropped because TTL reached 0. msg=RoutedMessageV2 { msg: RoutedMessage { target: PeerId(ed2551>
Oct 01 00:10:00 validator-a neard[626278]: 2024-09-30T22:10:00.941000Z  WARN garbage collection: Error in gc: GC Error: block on canonical chain shouldn't have refcount 0

Oct 01 00:35:23 validator-a neard[626278]: 2024-09-30T22:35:23.339729Z  WARN garbage collection: Error in gc: GC Error: block on canonical chain shouldn't have refcount 0
Oct 01 00:35:24 validator-a neard[626278]: 2024-09-30T22:35:24.342663Z  WARN garbage collection: Error in gc: GC Error: block on canonical chain shouldn't have refcount 0
Oct 01 00:35:25 validator-a neard[626278]: 2024-09-30T22:35:25.348669Z  WARN garbage collection: Error in gc: GC Error: block on canonical chain shouldn't have refcount 0
Oct 01 00:35:26 validator-a neard[626278]: 2024-09-30T22:35:26.355773Z  WARN garbage collection: Error in gc: GC Error: block on canonical chain shouldn't have refcount 0
Oct 01 00:35:27 validator-a neard[626278]: 2024-09-30T22:35:27.458409Z  WARN garbage collection: Error in gc: GC Error: block on canonical chain shouldn't have refcount 0
Oct 01 00:35:28 validator-a neard[626278]: 2024-09-30T22:35:28.969291Z  WARN garbage collection: Error in gc: GC Error: block on canonical chain shouldn't have refcount 0
Oct 01 00:35:29 validator-a systemd[1]: neard.service: A process of this unit has been killed by the OOM killer.
Oct 01 00:35:31 validator-a systemd[1]: neard.service: Main process exited, code=killed, status=9/KILL
Oct 01 00:35:31 validator-a systemd[1]: neard.service: Failed with result 'oom-kill'.
Oct 01 00:35:31 validator-a systemd[1]: neard.service: Consumed 4h 19min 15.235s CPU time.
Oct 01 00:36:01 validator-a systemd[1]: neard.service: Scheduled restart job, restart counter is at 1.
Oct 01 00:36:01 validator-a systemd[1]: Stopped Run a NEAR protocol node.
Oct 01 00:36:01 validator-a systemd[1]: neard.service: Consumed 4h 19min 15.235s CPU time.

Setup was 32GB of RAM and 128GB of Swap

wackazong commented 1 month ago

I had multiple backup servers with the same node key running. This might have affected the issue presented here.

near / nearcore