near / stakewars-iii

Stake Wars: Episode 3 challenges and place to report issues
87 stars 177 forks source link

Node is getting OOM'ed without a valid reason #55

Closed freak12techno closed 1 year ago

freak12techno commented 2 years ago

So I am running a shardnet fullnode. Faced this issue two times happening a few hours after the upgrade each time: the node is using all the RAM and CPU, then is getting OOM'ed. Here are two Grafana graphs: RAM usage:

image

and loadavg1 / cores:

image

I didn't do anything at that time with the server, so it's unlikely it's something I did, seems like it's either the chain issue or my hardware being not powerful enough.

The logs are almost 100% this (or similar, with the same pattern: <number <string> in progress for <time>s orphan for <time>s Chunks:(....)):

image

I am using Contabo VPS S for hosting a fullnode, I've already ordered an upgrade but I wonder if there may be other things I've missed that are causing this. Can you help?

Thanks a lot in advance.

DDeAlmeida commented 2 years ago

@bowenwang1996

mm-near commented 2 years ago

Thanks for the report @freak12techno .

Do you happen to have a full log somewhere by any chance?

(from the first look of it -- seems that your node got stuck a little - it was trying to request the block 1468427 - and it sent the request to fetch 3 chunks (that's what the 'arrow up' means), but it didn't get them yet).

Still - that should not cause the OOM. How much ram did you give to this job?

freak12techno commented 2 years ago

Do you happen to have a full log somewhere by any chance? I unfortunately don't, it got rotated apparently, however as I stated above, most of the messages were exactly the same as in the screenshot above.

How much ram did you give to this job? I have 8GB RAM, guess it ate all of it and then the process was killed by the system afterwards.

freak12techno commented 2 years ago

Okay I've faced it the third time now. Got last 75k lines of logs here: https://gist.github.com/freak12techno/183f2478a0fdf28f89961b64b0d6cacd, otherwise it's exactly the same as I've reported before.

freak12techno commented 2 years ago

This seems critical btw, as I cannot maintain a stable uptime as my node constantly crashes and consumes all the resources available :(

joesixpack commented 2 years ago

I don't know if it matters, but Contabo doesn't give you 100% physical memory on their VPS, IIRC half is actually disk-based swap at the hypervisor level.

freak12techno commented 2 years ago

UPD: it was apparently fixed after upgrading Contabo VPS, so it might be the solution. I still don't know the reason behind the issue though.