near / stakewars-iii

Stake Wars: Episode 3 challenges and place to report issues
87 stars 177 forks source link

neard lost connection to the network. #135

Open SNSMLN opened 2 years ago

SNSMLN commented 2 years ago

For some unknown reason, my node neard stopped processing chunks. Also, the height of the network stopped growing. You can see from the screenshot of grafana. It was about 9:50 a.m. I looked at syslog and near.log. There's nothing there at this time.. The number of peers did not change critically. You can see from the screenshot of grafana. Although in fact, the neard did not receive new blocks, and did not process them.

After that, for an hour. neard used all available RAM. And almost the entire swap file. I restarted it manually at 10:45:07 AM

In the near.log log there is a message about the failure of the process

Aug 29 10:45:12 node01 neard[4185074]: thread 'main' panicked at 'there is no reactor running, must be called from the context of a Tokio 1.x runtime', /home/near/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.18.2/src/runtime/context.rs:54:26 Aug 29 10:45:12 node01 neard[4185074]: note: run withRUST_BACKTRACE=1environment variable to display a backtrace

After the reboot, it became clear that the neard was 1907 blocks behind the network.

Neard version last: neard (release trunk) (build 1.1.0-2685-gfe435d02c) (rustc 1.63.0) (protocol 101) (db 31)

This situation is very unpleasant for me. ****. And my neard has been synchronizing for more than 2 hours and trying to start signing chunks.

near lost net 2 near lost net syslog.log near.log

SNSMLN commented 2 years ago

I think this is not an isolated case. I have seen many times in debug/pages/network_info nodes where the network height is stopped on a random block near lost net 3 near lost net 4

SNSMLN commented 2 years ago

The problem recurred.
On the latest version of neard. { "version": "trunk", "build": "1.1.0-2735-g1897d5144", "rustc_version": "1.63.0" }

My VPS: 8 cpu , 32+8swap ram , 160 ssd.

neard began to use all available memory and processor. The height of the network has stopped growing. This time I did not restart the service manually. The service stopped linux kernel in 04:37:12 with an OOM error. In the near log i did not find anyting.

kernel: [590242.745574] Out of memory: Killed process 1162105 (neard) total-vm:57164348kB, anon-rss:31193928kB, fil e-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:91388kB oom_score_adj:0

In summary, Im skip 3 epoch ((( . Not very good, in intensived testnet ((

Here is full system log and neard.log (12 Mb! in bz2 ) syslog.log neard.log.bz2.log

near lost net 11 near lost net 12 near lost net 13

mm-near commented 2 years ago

thanks for filing the issue.

I've looked at the bz file that you included - and found a potential issue/bug.

Seems that your node took a lot of time to download the chunks for older blocks:

WaitingForChunks in progress for 5387826ms orphan for 2031597ms missing chunks for **3356211ms** Chunks:(⬇⬇✔⬇⬇))

This was probably caused by a bad/unreliable peer - and the system didn't re-route the request correctly.