near / nearcore

Reference client for NEAR Protocol
https://near.org
GNU General Public License v3.0
2.31k stars 618 forks source link

Validator mainnet node restarted with `memory allocation failed` issue #5931

Open ok-everstake opened 2 years ago

ok-everstake commented 2 years ago

Description Validator mainnet node has been restarted two times with memory allocation failed issue. Node memory usage for the past 2 days avg 5,78 GB & max 7,6 GB. 64 GB on the server.

To Reproduce Nothing special

Expected behavior Validator node works fine without unpredictable restarts

Logs Error on the first restart

Dec 20 21:27:31.391 INFO stats: #55625082 4CQ95gS4Ru9vZ8JyQR4YHkab26X6cq4gC9UHe5oEa4dT V/59 34/33/40 peers 2.3MiB/s 2.0MiB/s 0.80 bps 74.19 Tgas/s CPU: 111%, Mem: 6.7 GiB memory allocation of 3584 bytes failed

Error on the second restart

Dec 21 12:22:02.268 INFO stats: #55670205 9WvRTCzVpP6n7GxFGVyLZdJN5cJyZGYWwVaNHuK1EMYS V/52 34/30/40 peers 828.6kiB/s 2.1MiB/s 0.90 bps 72.72 Tgas/s CPU: 47%, Mem: 7.1 GiB thread '' panicked at 'TODO creating memory cannot fail: Region("System call failed: Cannot allocate memory (os error 12)")', runtime/near-vm-runner/src/wasmer2_runner.rs:56:14 note: run with RUST_BACKTRACE=1 environment variable to display a backtrace Dec 21 12:22:04.278 ERROR network: Received error sending message to client: Mailbox has closed for Some(ed25519:Faoro1uDYXH2AFBA1eG4C5pQaLrk4vhWu6KW2pXpnaD8@142.132.143.229:24567) memory allocation of memory allocation of 34563456 bytes failed bytes failed

Version:

ok-everstake commented 2 years ago

Validator mainnet node has been restarted two more times today (Dec 23). Available memory on the server - min 56% for the past 2 days.

Dec 23 15:13:46.459  INFO stats: #55825362 59GdowyAkXRarsvkXUfUZzJCDc6cA7bLum8ZZ9GspNpy V/63 34/26/40 peers ⬇ 682.8kiB/s ⬆ 1.4MiB/s 0.90 bps 289.36 Tgas/s CPU: 77%, Mem: 7.8 GiB
memory allocation of 897332 bytes failed
Dec 23 17:32:05.006  INFO stats: #55832121 8GxXvZzu9bB3H2o52xMZ7WTBkokP3egeRJqSLZ3nnZrD V/65 33/27/40 peers ⬇ 2.4MiB/s ⬆ 2.1MiB/s 0.80 bps 213.27 Tgas/s CPU: 88%, Mem: 6.8 GiB
memory allocation of 2528 bytes failed
bowenwang1996 commented 2 years ago

@matklad is it possible that wasmer sometimes tries to allocate a lot of memory? It could have very well been something else though but just want to double check

kucharskim commented 2 years ago

This occurs every couple of days on multiple of our nodes:

Jan 03 19:54:49 near03 neard[28265]: thread '<unnamed>' panicked at 'TODO creating memory cannot fail: Region("System call failed: Cannot allocate memory (os error 12)")', runtime/near-vm-runner/src/wasmer2_runner.rs:56:14
kucharskim commented 2 years ago

We are bumping limits up to:

$ grep -e ^Limit /etc/systemd/system/neard.service
LimitDATA=32G
LimitRSS=32G
kucharskim commented 2 years ago

...we were at 20GB before, for 9 months or so.

matklad commented 2 years ago

cc @nagisa

This particular line indeed allocates a sizable chunk of memory -- 64megs. It's always 64 megs though, it's not variable. It's also serviced directly from mmap, bypassing the allocator. And we do this for every contract call, which happens rather often.

So, I am not surprised that this is the line that fails under memory exhaustion condition.

I would be surprised though if this line is to blame for relative memory increase: if we have a leak here, I'd expect it to be a really big problem which we'd notice immediately.

nagisa commented 2 years ago

There are situations other than exhaustion of physical memory bytes that could cause any attempt to allocate return an ENOMEM. In particular, the failure modes are (paraphrasing the man pages for mmap):

Here specifically it is important to note that both of these limits pertain to virtual memory rather than resident memory. (As a side note RLIMIT_RSS has no effect since Linux 2.4.30). So both of these failure modes are a plausible points of concern. The map count limit in particular is a limit on the number of map regions process can have and can be quite low by default:

$ sysctl vm.max_map_count
vm.max_map_count = 65530

So it is possible to receive this error after leaking just 250MiB worth of pages or so, which would not necessarily be very apparent using typical monitoring tools.

Now I cannot really tell for sure where we may be hitting these cases: the increase could be extremely quick to the point where none of the usual sampling-based mechanisms would catch the peak; debugging if we hold onto memory maps excessively would require a core dump after a OOM crash.


As a quick'n'easy check, is there any chance anybody experiencing this collects virtual memory utilization metrics over time and could share a slice of these metrics around the time of crash here?

As another check, is there any chance anybody experiencing this has coredumps enabled and a coredump of such an occurrence saved? Listing the mapping table for this coredump would be helpful.

ok-everstake commented 2 years ago

@nagisa I have some graphs, not sure if this helps. For the last 15 days, falls around the time of crashes/restarts Memory utilization % by zabbix vm.memory.utilization Screenshot 2022-01-07 at 02 28 03 Node used RAM by prometheus metrics Screenshot 2022-01-07 at 02 24 48

I'll increase vm.max_map_count limit for tests

bowenwang1996 commented 2 years ago

@ok-everstake thanks for the graphs! It seems that your node either did not run out of memory or did so extremely quickly that prometheus is not able to catch it

gruberx commented 2 years ago

Description: I have a similar problem in StakeWars right now, commit: https://github.com/near/nearcore/commit/8448ad1ebf27731a43397686103aa5277e7f2fcf cargo build -p neard --release --features shardnet The node is built and running inside the docker. The memory allocated for the container is 16GB, but it can be seen that the node uses about 3GB before the reboot. There was nothing like this on https://github.com/near/nearcore/releases/tag/1.28.0-rc.3

To Reproduce: Nothing special

Logs: In the logs at different intervals I see that the node is restarting. There were two different options, I give an example:

INFO stats: #  988989 52g39XebLjGs3TrrXZuGXutFEfdQecXwxA3GzFMh9RjB Validator | 97 validators 30 peers ⬇ 530 kB/s ⬆ 1.18 MB/s 0.70 bps 147 Tgas/s CPU: 49%, Mem: 3.33 GB
DEBUG stats: EpochId(`11111111111111111111111111111111`) Blocks in progress: 3 Chunks in progress: 0 Orphans: 0
thread '<unnamed>' panicked at 'creating memory must not fail: Region("System call failed: Cannot allocate memory (os error 12)")', runtime/near-vm-runner/src/wasmer2_runner.rs:53:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
INFO neard: version="trunk" build="crates-0.14.0-218-g8448ad1eb" latest_protocol=100
INFO db: Created a new RocksDB instance. num_instances=1
INFO db: Dropped a RocksDB instance. num_instances=0
INFO near: Opening RocksDB database path=/home/xxx/.near/data
INFO db: Created a new RocksDB instance. num_instances=1
INFO near_network::peer_manager::peer_manager_actor: Bandwidth stats total_bandwidth_used_by_all_peers=0 total_msg_received_count=0 max_max_record_num_messages_in_progress=0
INFO stats: #  988996 Waiting for peers 0 peers ⬇ 0 B/s ⬆ 0 B/s 0.00 bps 0 gas/s CPU: 0%, Mem: 300 MB
DEBUG stats: EpochId(`11111111111111111111111111111111`) Blocks in progress: 0 Chunks in progress: 0 Orphans: 0
INFO stats: #  989004 6JVChTLQ9N3uDpp2WZZBSujGfZE8Vp1pqBhFduHboCGd Validator | 97 validators 19 peers ⬇ 271 kB/s ⬆ 504 kB/s 0.70 bps 184 Tgas/s CPU: 135%, Mem: 884 MB
INFO stats: #  988989 52g39XebLjGs3TrrXZuGXutFEfdQecXwxA3GzFMh9RjB Validator | 97 validators 30 peers ⬇ 530 kB/s ⬆ 1.18 MB/s 0.70 bps 147 Tgas/s CPU: 49%, Mem: 3.33 GB
DEBUG stats: EpochId(`11111111111111111111111111111111`) Blocks in progress: 3 Chunks in progress: 0 Orphans: 0
thread '<unnamed>' panicked at 'creating memory must not fail: Region("System call failed: Cannot allocate memory (os error 12)")', runtime/near-vm-runner/src/wasmer2_runner.rs:53:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
INFO neard: version="trunk" build="crates-0.14.0-218-g8448ad1eb" latest_protocol=100
INFO db: Created a new RocksDB instance. num_instances=1
INFO db: Dropped a RocksDB instance. num_instances=0
INFO near: Opening RocksDB database path=/home/xxx/.near/data
INFO db: Created a new RocksDB instance. num_instances=1
INFO near_network::peer_manager::peer_manager_actor: Bandwidth stats total_bandwidth_used_by_all_peers=0 total_msg_received_count=0 max_max_record_num_messages_in_progress=0
INFO stats: #  988996 Waiting for peers 0 peers ⬇ 0 B/s ⬆ 0 B/s 0.00 bps 0 gas/s CPU: 0%, Mem: 300 MB
DEBUG stats: EpochId(`11111111111111111111111111111111`) Blocks in progress: 0 Chunks in progress: 0 Orphans: 0
INFO stats: #  989004 6JVChTLQ9N3uDpp2WZZBSujGfZE8Vp1pqBhFduHboCGd Validator | 97 validators 19 peers ⬇ 271 kB/s ⬆ 504 kB/s 0.70 bps 184 Tgas/s CPU: 135%, Mem: 884 MB

Version: /nearcore/target/release/neard --version neard (release trunk) (build crates-0.14.0-218-g8448ad1eb) (rustc 1.62.0) (protocol 100) (db 31)

Additional context: mem

kucharskim commented 2 years ago

We had that on mainnet yesterday:

Jul 19 13:17:03 near01 neard[431910]: 2022-07-19T13:17:03.924068Z  INFO stats: #70188894 E4Xa2tcnWGFZjmkc53sQ8BFd74SwsdjvLbRaowg1dTa4 Validator | 100 validators 30 peers ⬇ 3.64 MB/s ⬆ 18.9 MB/s 0.70 bps 240 Tgas/s CPU: 167%, Mem: 9.78 GB
Jul 19 13:17:12 near01 neard[431910]: thread '<unnamed>' panicked at 'creating memory must not fail: Region("System call failed: Cannot allocate memory (os error 12)")', runtime/near-vm-runner/src/wasmer2_runner.rs:53:14
Jul 19 13:17:12 near01 neard[431910]: note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Jul 19 13:17:13 near01 systemd[1]: neard.service: Main process exited, code=killed, status=6/ABRT
Jul 19 13:17:13 near01 systemd[1]: neard.service: Failed with result 'signal'.
Jul 19 13:17:18 near01 systemd[1]: neard.service: Scheduled restart job, restart counter is at 2.
Jul 19 13:17:18 near01 systemd[1]: Stopped NEAR daemon.
Jul 19 13:17:18 near01 systemd[1]: Starting NEAR daemon...
...
kucharskim commented 2 years ago
root@near01 ~ # prlimit -p `pgrep -x neard`
RESOURCE   DESCRIPTION                               SOFT        HARD UNITS
AS         address space limit                  unlimited   unlimited bytes
CORE       max core file size                           0   unlimited bytes
CPU        CPU time                             unlimited   unlimited seconds
DATA       max data size                      34359738368 34359738368 bytes
FSIZE      max file size                        unlimited   unlimited bytes
LOCKS      max number of file locks held        unlimited   unlimited locks
MEMLOCK    max locked-in-memory address space       65536       65536 bytes
MSGQUEUE   max bytes in POSIX mqueues              819200      819200 bytes
NICE       max nice prio allowed to raise               0           0 
NOFILE     max number of open files                 50000       50000 files
NPROC      max number of processes                 256754      256754 processes
RSS        max resident set size                unlimited   unlimited bytes
RTPRIO     max real-time priority                       0           0 
RTTIME     timeout for real-time tasks          unlimited   unlimited microsecs
SIGPENDING max number of pending signals           256754      256754 signals
STACK      max stack size                         8388608   unlimited bytes
gruberx commented 2 years ago

Clarification for my case: https://github.com/near/nearcore/issues/5931#issuecomment-1189079138 All this happened after the hard fork, after the end of the first epoch, this problem disappeared. Neard stopped restarting.

kucharskim commented 2 years ago

This problem still occurs randomly:

Aug 28 04:41:22 near02 neard[717073]: memory allocation of 2074607616 bytes failed
Aug 28 04:41:23 near02 systemd[1]: neard.service: Main process exited, code=killed, status=6/ABRT
kucharskim commented 2 years ago

This was on an RPC / fullnode, not on a validator node:

$ curl -s localhost:3030/status | jq -r .validator_account_id
null
kucharskim commented 2 years ago

I didn't notice what was at the time in near logs, but I see this in dmesg -T:

[Mon Aug 22 07:16:31 2022] mmap: actix-rt|system (2898): VmData 34575671296 exceed data ulimit 34359738368. Update limits or use boot option ignore_rlimit_data.
kucharskim commented 2 years ago

Ah, please ignore previous comment, stray LimitDATA=32G in systemd unit file. Need to remove it.