near / stakewars-iii

Stake Wars: Episode 3 challenges and place to report issues
87 stars 177 forks source link

neard.services: Failed with result 'signal' #81

Open ksalab opened 2 years ago

ksalab commented 2 years ago

Good evening. After restarting the node (08/04/2022), such an error began to appear. How to fix it or what can be done? near_error_2022-08-04 18 15 17

ksalab commented 2 years ago

After the last update (commit to use to 78ef2f55857d6118047efccf070ae0f7ddb232ea), the error began to appear more frequently.

Thesephi commented 2 years ago

I'd love to confirm, a shardnet node of mine constantly restarts after upgrading to the latest commit 78ef2f55857d6118047efccf070ae0f7ddb232ea:

Aug 04 16:44:15 ubuntu-s-4vcpu-8gb-fra1-01 systemd[1]: neard.service: A process of this unit has been killed by the OOM killer.
Aug 04 16:44:15 ubuntu-s-4vcpu-8gb-fra1-01 systemd[1]: neard.service: Main process exited, code=killed, status=9/KILL
Aug 04 16:44:15 ubuntu-s-4vcpu-8gb-fra1-01 systemd[1]: neard.service: Failed with result 'oom-kill'.
Aug 04 16:44:15 ubuntu-s-4vcpu-8gb-fra1-01 systemd[1]: neard.service: Consumed 1min 53.846s CPU time.
Aug 04 16:44:45 ubuntu-s-4vcpu-8gb-fra1-01 systemd[1]: neard.service: Scheduled restart job, restart counter is at 1.
Aug 04 16:44:45 ubuntu-s-4vcpu-8gb-fra1-01 systemd[1]: Stopped NEARd Daemon Service for Shardnet.
Aug 04 16:44:45 ubuntu-s-4vcpu-8gb-fra1-01 systemd[1]: neard.service: Consumed 1min 53.846s CPU time.
Aug 04 16:44:46 ubuntu-s-4vcpu-8gb-fra1-01 systemd[1]: Started NEARd Daemon Service for Shardnet.
Aug 04 16:44:46 ubuntu-s-4vcpu-8gb-fra1-01 neard[1745]: 2022-08-04T16:44:46.144730Z  INFO neard: version="trunk" build="1.1.0-2580-g78ef2f558" latest_protocol=100

Or

Screen Shot 2022-08-04 at 18 47 01

All CPUs seem to be over-utilised, which did NOT happen before upgrading to the commit 78ef2f55857d6118047efccf070ae0f7ddb232ea:

Screen Shot 2022-08-04 at 18 51 44

This node's spec is exactly as recommended in the Challenge 2, which is: 4 CPU, 8GB RAM

DDeAlmeida commented 2 years ago

@mina86 @bowenwang1996

mina86 commented 2 years ago

Looking. For now nothing obvious. The best course at the moment is to roll back to previous commit or if you care to try bisecting.

ssq0-0 commented 2 years ago

After the update, the node restarts on its own, plus it is not possible to load blocks for this reason, they start loading from the beginning image

mina86 commented 2 years ago

Can you enable debug log? Putting log_config.json file in ~/.near with { "verbose_module": "" } content should be enough.

SNSMLN commented 2 years ago

after update to commit 78ef2f55857d6118047efccf070ae0f7ddb232ea. neard fault . Service periodically restart.

In syslog i see this : node01 kernel: [94077.608260] Out of memory: Killed process 621769 (neard) total-vm:9639640kB, anon-rss:3376208kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:14368kB oom_score_adj:0

I think, dont enouch memory. When i increase size the swap file from 4 to 8 Gb. The error didnt reapprear.

My node is currently run on 2580 version. `

Aug  4 17:16:29 node01 neard[621769]: 1753496 2y3krntKw4AzJLkWmksG5c4ap8aNc3Tt7723M41txKar Processed in progress for 8904ms orphan for 125
5ms missing chunks for 7338ms Chunks:(✔✔X✔))                                                                                              
Aug  4 17:16:29 node01 neard[621769]: 1753495 FLfixbgsgi9NvHgfFAn9A5GPU5HKuGjs6tCGF8f2Q9fj Processed in progress for 3468ms orphan for 336
4ms  Chunks:(✔✔✔✔))                                                                                                                       
Aug  4 17:16:29 node01 neard[621769]: 1753492 3jHP45j7EiMfdw5ikZeaar5xXS8U48Pspi5wMex1ZKCi Processed in progress for 3237ms orphan for 154
0ms missing chunks for 1494ms Chunks:(✔✔XX))                                                                                              
Aug  4 17:16:29 node01 neard[621769]: 1753491 8YPRXhdFHQeVEUFGk9yzLb5rq9vwDJUmRhd9SHSnhrET Processed in progress for 1560ms orphan for 154
0ms  Chunks:(✔✔X✔))                                                                                                                       
Aug  4 17:16:29 node01 neard[621769]: 1753489 EBtohyn2Zpi32FPrMmRk1VnDo787qfPZGi9UwPHqCwZg Processed in progress for 1685ms orphan for 732
ms  Chunks:(✔✔X✔))                                                                                                                        
Aug  4 17:16:29 node01 neard[621769]: 1753488 EuibYb2RJS9JTi2WGCQRxz9TFa5ra5M4Zg1wx5QRaqvv Processed in progress for 807ms  missing chunks
 for 613ms Chunks:(✔..X))                                                                                                                 
Aug  4 17:16:33 node01 kernel: [94073.917545] [UFW BLOCK] IN=eth0 OUT= MAC=96:00:01:74:b5:d7:d2:74:7f:6e:37:e3:08:00 SRC=20.225.168.150 DS
T=95.217.7.222 LEN=40 TOS=0x00 PREC=0x00 TTL=240 ID=58774 PROTO=TCP SPT=1216 DPT=15481 WINDOW=1024 RES=0x00 SYN URGP=0                    
Aug  4 17:16:35 node01 kernel: [94075.592574] tmux: server invoked oom-killer: gfp_mask=0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), order=1, 
oom_score_adj=0                                                                                                                           
Aug  4 17:16:35 node01 kernel: [94075.592585] CPU: 1 PID: 2182 Comm: tmux: server Not tainted 5.4.0-122-generic #138-Ubuntu               
Aug  4 17:16:35 node01 kernel: [94075.592586] Hardware name: Hetzner vServer, BIOS 20171111 11/11/2017                                    
Aug  4 17:16:35 node01 kernel: [94075.592588] Call Trace:                                                                                 
Aug  4 17:16:35 node01 kernel: [94075.592604]  dump_stack+0x6d/0x8b                                                                       
Aug  4 17:16:35 node01 kernel: [94075.592607]  dump_header+0x4f/0x1eb                                                                     
Aug  4 17:16:35 node01 kernel: [94075.592608]  oom_kill_process.cold+0xb/0x10                                                             
Aug  4 17:16:35 node01 kernel: [94075.592611]  out_of_memory+0x1cf/0x4d0                                                                  
Aug  4 17:16:35 node01 kernel: [94075.592614]  __alloc_pages_slowpath+0xd5e/0xe50                                                         
Aug  4 17:16:35 node01 kernel: [94075.592617]  __alloc_pages_nodemask+0x2d0/0x320                                                         
Aug  4 17:16:35 node01 kernel: [94075.592619]  alloc_pages_current+0x87/0xe0                                                              
Aug  4 17:16:35 node01 kernel: [94075.592621]  __get_free_pages+0x11/0x40                                                                 
Aug  4 17:16:35 node01 kernel: [94075.592622]  pgd_alloc+0x37/0x210                                                                       
Aug  4 17:16:35 node01 kernel: [94075.592624]  mm_init+0x1be/0x2b0                                                                        
Aug  4 17:16:35 node01 kernel: [94075.592626]  dup_mm+0x59/0x120                                                                          
Aug  4 17:16:35 node01 kernel: [94075.592627]  copy_process+0x1601/0x1ae0                                                                 
Aug  4 17:16:35 node01 kernel: [94075.592628]  _do_fork+0x89/0x360                                                                        
Aug  4 17:16:35 node01 kernel: [94075.592631]  ? recalc_sigpending+0x1c/0x60                                                              
Aug  4 17:16:35 node01 kernel: [94075.592632]  ? __set_task_blocked+0x38/0xa0                                                             

......

Aug  4 17:16:37 node01 kernel: [94077.608137] [ 455864]     0 455864     1783      466    53248       86             0 tmux: client       
Aug  4 17:16:37 node01 kernel: [94077.608138] [ 599047]     0 599047     2377     1264    57344       95             0 nethogs            
Aug  4 17:16:37 node01 kernel: [94077.608140] [ 614516]  1000 614516     2071      258    49152      413             0 bash               
Aug  4 17:16:37 node01 kernel: [94077.608142] [ 614529]  1000 614529   279718      403  1576960    11618             0 cargo              
Aug  4 17:16:37 node01 kernel: [94077.608143] [ 621769]  1000 621769  2409910   844052 14712832   397474             0 neard              
Aug  4 17:16:37 node01 kernel: [94077.608168] [ 623168]     0 623168     3301      504    61440      286             0 sshd               
Aug  4 17:16:37 node01 kernel: [94077.608170] [ 623169]   112 623169     3043       53    61440      227             0 sshd               
Aug  4 17:16:37 node01 kernel: [94077.608172] [ 623349]  1000 623349   141357    15845  1028096    36779             0 rustc              
Aug  4 17:16:37 node01 kernel: [94077.608174] [ 625378]  1000 625378    80226    19811   614400       26             0 rustc              
Aug  4 17:16:37 node01 kernel: [94077.608175] [ 625664]     0 625664     1809      161    45056      101             0 bash               
Aug  4 17:16:37 node01 kernel: [94077.608176] [ 625665]     0 625665     1782        0    49152        0             0 swapoff            
Aug  4 17:16:37 node01 kernel: [94077.608178] [ 625668]     0 625668      652      102    40960        0             0 sh                 
Aug  4 17:16:37 node01 kernel: [94077.608179] [ 625669]     0 625669      652      398    40960        0             0 byobu-status       
Aug  4 17:16:37 node01 kernel: [94077.608181] [ 625672]     0 625672     1747      438    53248        0             0 tmux: client       
Aug  4 17:16:37 node01 kernel: [94077.608182] [ 625673]     0 625673     9701     3687   110592     4581             0 tmux: server       
Aug  4 17:16:37 node01 kernel: [94077.608183] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_
memcg=/system.slice/neard.service,task=neard,pid=621769,uid=1000                                                                          
Aug  4 17:16:37 node01 kernel: [94077.608260] Out of memory: Killed process 621769 (neard) total-vm:9639640kB, anon-rss:3376208kB, file-rs
s:0kB, shmem-rss:0kB, UID:1000 pgtables:14368kB oom_score_adj:0                                                                           
Aug  4 17:16:37 node01 systemd[1]: neard.service: Main process exited, code=killed, status=9/KILL                                         
Aug  4 17:16:37 node01 systemd[1]: neard.service: Failed with result 'signal'.                                                            

Aug  4 17:17:07 node01 systemd[1]: neard.service: Scheduled restart job, restart counter is at 6.                                         
Aug  4 17:17:07 node01 systemd[1]: Stopped Near node.   
`
mm-near commented 2 years ago

The OOMs might happen due to increased amount of RAM needed during initial sync (especially as nodes are exchanging the information about the current graph etc).

To workaround it, try one of two things:

blntbytk commented 2 years ago

The OOMs might happen due to increased amount of RAM needed during initial sync (especially as nodes are exchanging the information about the current graph etc).

To workaround it, try one of two things:

  • increase ram to 16GB OR
  • lower the amount of nodes that you're connecting to -- this can be done in a config, by lowering these values: "ideal_connections_lo": 30, "ideal_connections_hi": 35, -- you could set them to (for example 10 & 15) and see it if helped.

i tried your first option with 16GB ram and new git , no any signal fail error , it seems ok

ksalab commented 2 years ago

Unfortunately, changing these parameters did not help ((( tried like

"ideal_connections_lo": 10,
"ideal_connections_hi": 15,

so

"ideal_connections_lo": 5,
"ideal_connections_hi": 10,

the result is identical

i'll try:

"ideal_connections_lo": 1,
"ideal_connections_hi": 5,

...the result is identical (((