Open ksalab opened 2 years ago
After the last update (commit to use to 78ef2f55857d6118047efccf070ae0f7ddb232ea), the error began to appear more frequently.
I'd love to confirm, a shardnet node of mine constantly restarts after upgrading to the latest commit 78ef2f55857d6118047efccf070ae0f7ddb232ea
:
Aug 04 16:44:15 ubuntu-s-4vcpu-8gb-fra1-01 systemd[1]: neard.service: A process of this unit has been killed by the OOM killer.
Aug 04 16:44:15 ubuntu-s-4vcpu-8gb-fra1-01 systemd[1]: neard.service: Main process exited, code=killed, status=9/KILL
Aug 04 16:44:15 ubuntu-s-4vcpu-8gb-fra1-01 systemd[1]: neard.service: Failed with result 'oom-kill'.
Aug 04 16:44:15 ubuntu-s-4vcpu-8gb-fra1-01 systemd[1]: neard.service: Consumed 1min 53.846s CPU time.
Aug 04 16:44:45 ubuntu-s-4vcpu-8gb-fra1-01 systemd[1]: neard.service: Scheduled restart job, restart counter is at 1.
Aug 04 16:44:45 ubuntu-s-4vcpu-8gb-fra1-01 systemd[1]: Stopped NEARd Daemon Service for Shardnet.
Aug 04 16:44:45 ubuntu-s-4vcpu-8gb-fra1-01 systemd[1]: neard.service: Consumed 1min 53.846s CPU time.
Aug 04 16:44:46 ubuntu-s-4vcpu-8gb-fra1-01 systemd[1]: Started NEARd Daemon Service for Shardnet.
Aug 04 16:44:46 ubuntu-s-4vcpu-8gb-fra1-01 neard[1745]: 2022-08-04T16:44:46.144730Z INFO neard: version="trunk" build="1.1.0-2580-g78ef2f558" latest_protocol=100
Or
All CPUs seem to be over-utilised, which did NOT happen before upgrading to the commit 78ef2f55857d6118047efccf070ae0f7ddb232ea
:
This node's spec is exactly as recommended in the Challenge 2, which is: 4 CPU, 8GB RAM
@mina86 @bowenwang1996
Looking. For now nothing obvious. The best course at the moment is to roll back to previous commit or if you care to try bisecting.
After the update, the node restarts on its own, plus it is not possible to load blocks for this reason, they start loading from the beginning
Can you enable debug log? Putting log_config.json
file in ~/.near with { "verbose_module": "" }
content should be enough.
after update to commit 78ef2f55857d6118047efccf070ae0f7ddb232ea. neard fault . Service periodically restart.
In syslog i see this : node01 kernel: [94077.608260] Out of memory: Killed process 621769 (neard) total-vm:9639640kB, anon-rss:3376208kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:14368kB oom_score_adj:0
I think, dont enouch memory. When i increase size the swap file from 4 to 8 Gb. The error didnt reapprear.
My node is currently run on 2580 version. `
Aug 4 17:16:29 node01 neard[621769]: 1753496 2y3krntKw4AzJLkWmksG5c4ap8aNc3Tt7723M41txKar Processed in progress for 8904ms orphan for 125
5ms missing chunks for 7338ms Chunks:(✔✔X✔))
Aug 4 17:16:29 node01 neard[621769]: 1753495 FLfixbgsgi9NvHgfFAn9A5GPU5HKuGjs6tCGF8f2Q9fj Processed in progress for 3468ms orphan for 336
4ms Chunks:(✔✔✔✔))
Aug 4 17:16:29 node01 neard[621769]: 1753492 3jHP45j7EiMfdw5ikZeaar5xXS8U48Pspi5wMex1ZKCi Processed in progress for 3237ms orphan for 154
0ms missing chunks for 1494ms Chunks:(✔✔XX))
Aug 4 17:16:29 node01 neard[621769]: 1753491 8YPRXhdFHQeVEUFGk9yzLb5rq9vwDJUmRhd9SHSnhrET Processed in progress for 1560ms orphan for 154
0ms Chunks:(✔✔X✔))
Aug 4 17:16:29 node01 neard[621769]: 1753489 EBtohyn2Zpi32FPrMmRk1VnDo787qfPZGi9UwPHqCwZg Processed in progress for 1685ms orphan for 732
ms Chunks:(✔✔X✔))
Aug 4 17:16:29 node01 neard[621769]: 1753488 EuibYb2RJS9JTi2WGCQRxz9TFa5ra5M4Zg1wx5QRaqvv Processed in progress for 807ms missing chunks
for 613ms Chunks:(✔..X))
Aug 4 17:16:33 node01 kernel: [94073.917545] [UFW BLOCK] IN=eth0 OUT= MAC=96:00:01:74:b5:d7:d2:74:7f:6e:37:e3:08:00 SRC=20.225.168.150 DS
T=95.217.7.222 LEN=40 TOS=0x00 PREC=0x00 TTL=240 ID=58774 PROTO=TCP SPT=1216 DPT=15481 WINDOW=1024 RES=0x00 SYN URGP=0
Aug 4 17:16:35 node01 kernel: [94075.592574] tmux: server invoked oom-killer: gfp_mask=0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), order=1,
oom_score_adj=0
Aug 4 17:16:35 node01 kernel: [94075.592585] CPU: 1 PID: 2182 Comm: tmux: server Not tainted 5.4.0-122-generic #138-Ubuntu
Aug 4 17:16:35 node01 kernel: [94075.592586] Hardware name: Hetzner vServer, BIOS 20171111 11/11/2017
Aug 4 17:16:35 node01 kernel: [94075.592588] Call Trace:
Aug 4 17:16:35 node01 kernel: [94075.592604] dump_stack+0x6d/0x8b
Aug 4 17:16:35 node01 kernel: [94075.592607] dump_header+0x4f/0x1eb
Aug 4 17:16:35 node01 kernel: [94075.592608] oom_kill_process.cold+0xb/0x10
Aug 4 17:16:35 node01 kernel: [94075.592611] out_of_memory+0x1cf/0x4d0
Aug 4 17:16:35 node01 kernel: [94075.592614] __alloc_pages_slowpath+0xd5e/0xe50
Aug 4 17:16:35 node01 kernel: [94075.592617] __alloc_pages_nodemask+0x2d0/0x320
Aug 4 17:16:35 node01 kernel: [94075.592619] alloc_pages_current+0x87/0xe0
Aug 4 17:16:35 node01 kernel: [94075.592621] __get_free_pages+0x11/0x40
Aug 4 17:16:35 node01 kernel: [94075.592622] pgd_alloc+0x37/0x210
Aug 4 17:16:35 node01 kernel: [94075.592624] mm_init+0x1be/0x2b0
Aug 4 17:16:35 node01 kernel: [94075.592626] dup_mm+0x59/0x120
Aug 4 17:16:35 node01 kernel: [94075.592627] copy_process+0x1601/0x1ae0
Aug 4 17:16:35 node01 kernel: [94075.592628] _do_fork+0x89/0x360
Aug 4 17:16:35 node01 kernel: [94075.592631] ? recalc_sigpending+0x1c/0x60
Aug 4 17:16:35 node01 kernel: [94075.592632] ? __set_task_blocked+0x38/0xa0
......
Aug 4 17:16:37 node01 kernel: [94077.608137] [ 455864] 0 455864 1783 466 53248 86 0 tmux: client
Aug 4 17:16:37 node01 kernel: [94077.608138] [ 599047] 0 599047 2377 1264 57344 95 0 nethogs
Aug 4 17:16:37 node01 kernel: [94077.608140] [ 614516] 1000 614516 2071 258 49152 413 0 bash
Aug 4 17:16:37 node01 kernel: [94077.608142] [ 614529] 1000 614529 279718 403 1576960 11618 0 cargo
Aug 4 17:16:37 node01 kernel: [94077.608143] [ 621769] 1000 621769 2409910 844052 14712832 397474 0 neard
Aug 4 17:16:37 node01 kernel: [94077.608168] [ 623168] 0 623168 3301 504 61440 286 0 sshd
Aug 4 17:16:37 node01 kernel: [94077.608170] [ 623169] 112 623169 3043 53 61440 227 0 sshd
Aug 4 17:16:37 node01 kernel: [94077.608172] [ 623349] 1000 623349 141357 15845 1028096 36779 0 rustc
Aug 4 17:16:37 node01 kernel: [94077.608174] [ 625378] 1000 625378 80226 19811 614400 26 0 rustc
Aug 4 17:16:37 node01 kernel: [94077.608175] [ 625664] 0 625664 1809 161 45056 101 0 bash
Aug 4 17:16:37 node01 kernel: [94077.608176] [ 625665] 0 625665 1782 0 49152 0 0 swapoff
Aug 4 17:16:37 node01 kernel: [94077.608178] [ 625668] 0 625668 652 102 40960 0 0 sh
Aug 4 17:16:37 node01 kernel: [94077.608179] [ 625669] 0 625669 652 398 40960 0 0 byobu-status
Aug 4 17:16:37 node01 kernel: [94077.608181] [ 625672] 0 625672 1747 438 53248 0 0 tmux: client
Aug 4 17:16:37 node01 kernel: [94077.608182] [ 625673] 0 625673 9701 3687 110592 4581 0 tmux: server
Aug 4 17:16:37 node01 kernel: [94077.608183] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_
memcg=/system.slice/neard.service,task=neard,pid=621769,uid=1000
Aug 4 17:16:37 node01 kernel: [94077.608260] Out of memory: Killed process 621769 (neard) total-vm:9639640kB, anon-rss:3376208kB, file-rs
s:0kB, shmem-rss:0kB, UID:1000 pgtables:14368kB oom_score_adj:0
Aug 4 17:16:37 node01 systemd[1]: neard.service: Main process exited, code=killed, status=9/KILL
Aug 4 17:16:37 node01 systemd[1]: neard.service: Failed with result 'signal'.
Aug 4 17:17:07 node01 systemd[1]: neard.service: Scheduled restart job, restart counter is at 6.
Aug 4 17:17:07 node01 systemd[1]: Stopped Near node.
`
The OOMs might happen due to increased amount of RAM needed during initial sync (especially as nodes are exchanging the information about the current graph etc).
To workaround it, try one of two things:
The OOMs might happen due to increased amount of RAM needed during initial sync (especially as nodes are exchanging the information about the current graph etc).
To workaround it, try one of two things:
- increase ram to 16GB OR
- lower the amount of nodes that you're connecting to -- this can be done in a config, by lowering these values: "ideal_connections_lo": 30, "ideal_connections_hi": 35, -- you could set them to (for example 10 & 15) and see it if helped.
i tried your first option with 16GB ram and new git , no any signal fail error , it seems ok
Unfortunately, changing these parameters did not help ((( tried like
"ideal_connections_lo": 10,
"ideal_connections_hi": 15,
so
"ideal_connections_lo": 5,
"ideal_connections_hi": 10,
the result is identical
i'll try:
"ideal_connections_lo": 1,
"ideal_connections_hi": 5,
...the result is identical (((
Good evening. After restarting the node (08/04/2022), such an error began to appear. How to fix it or what can be done?