Open wackazong opened 1 month ago
Config: https://ctxt.io/2/AAC4C18fFg
May be related: https://github.com/near/nearcore/issues/11927
I added generous swap and now the node is downloading blocks.
I did some memory monitoring while setting up another node from a fresh snapshot. Strangely enough, memory was never over 30% on a 32GB machine. But also on that machine I got the same behaviour before.
Over night, the second node now started showing the error while downloading blocks and failed subsequently:
Oct 01 00:09:50 validator-a neard[626278]: 2024-09-30T22:09:50.794404Z INFO stats: #129220486 Downloading blocks 37.22% (47816 left; at 129220486) 31 peers ⬇ 7.23 MB/s ⬆ 5.43 MB/s 3.00 bps>
Oct 01 00:09:51 validator-a neard[626278]: 2024-09-30T22:09:51.971708Z WARN network: Message dropped because TTL reached 0. msg=RoutedMessageV2 { msg: RoutedMessage { target: PeerId(ed2551>
Oct 01 00:09:51 validator-a neard[626278]: 2024-09-30T22:09:51.978342Z WARN network: Message dropped because TTL reached 0. msg=RoutedMessageV2 { msg: RoutedMessage { target: PeerId(ed2551>
Oct 01 00:09:52 validator-a neard[626278]: 2024-09-30T22:09:52.081806Z WARN network: Message dropped because TTL reached 0. msg=RoutedMessageV2 { msg: RoutedMessage { target: PeerId(ed2551>
Oct 01 00:10:00 validator-a neard[626278]: 2024-09-30T22:10:00.941000Z WARN garbage collection: Error in gc: GC Error: block on canonical chain shouldn't have refcount 0
Oct 01 00:35:23 validator-a neard[626278]: 2024-09-30T22:35:23.339729Z WARN garbage collection: Error in gc: GC Error: block on canonical chain shouldn't have refcount 0
Oct 01 00:35:24 validator-a neard[626278]: 2024-09-30T22:35:24.342663Z WARN garbage collection: Error in gc: GC Error: block on canonical chain shouldn't have refcount 0
Oct 01 00:35:25 validator-a neard[626278]: 2024-09-30T22:35:25.348669Z WARN garbage collection: Error in gc: GC Error: block on canonical chain shouldn't have refcount 0
Oct 01 00:35:26 validator-a neard[626278]: 2024-09-30T22:35:26.355773Z WARN garbage collection: Error in gc: GC Error: block on canonical chain shouldn't have refcount 0
Oct 01 00:35:27 validator-a neard[626278]: 2024-09-30T22:35:27.458409Z WARN garbage collection: Error in gc: GC Error: block on canonical chain shouldn't have refcount 0
Oct 01 00:35:28 validator-a neard[626278]: 2024-09-30T22:35:28.969291Z WARN garbage collection: Error in gc: GC Error: block on canonical chain shouldn't have refcount 0
Oct 01 00:35:29 validator-a systemd[1]: neard.service: A process of this unit has been killed by the OOM killer.
Oct 01 00:35:31 validator-a systemd[1]: neard.service: Main process exited, code=killed, status=9/KILL
Oct 01 00:35:31 validator-a systemd[1]: neard.service: Failed with result 'oom-kill'.
Oct 01 00:35:31 validator-a systemd[1]: neard.service: Consumed 4h 19min 15.235s CPU time.
Oct 01 00:36:01 validator-a systemd[1]: neard.service: Scheduled restart job, restart counter is at 1.
Oct 01 00:36:01 validator-a systemd[1]: Stopped Run a NEAR protocol node.
Oct 01 00:36:01 validator-a systemd[1]: neard.service: Consumed 4h 19min 15.235s CPU time.
Setup was 32GB of RAM and 128GB of Swap
I had multiple backup servers with the same node key running. This might have affected the issue presented here.
Contact Details
alexander@wackazong.com
Node type
Top 100 Validator
Which network are you running?
mainnet
What happened?
I am currently trying to get a new backup node started. I followed the requirements for the new backup node scenario and downloaded a new snapshot. My machine has 15 Cores, 32GB RAM and four parallel fast SSDs in RAID0.
I start neard with the backup config recommended for scenario 2 (the new one).
First, it dowloads headers. Then, as soon as headers are downloaded I get these lines in the log first. I never get a log message that any blocks have been downloaded to catch up with the network.
Then, after some 30mins or so, I get this
What can I do to help analyse this issue?
Version
Relevant log output
Node head info
Node upgrade history
DB reset history