prysmaticlabs / prysm

Go implementation of Ethereum proof of stake
https://www.offchainlabs.com
GNU General Public License v3.0
3.47k stars 1k forks source link

Prysm Killed - Out of Memory #13964

Open tunerooster opened 5 months ago

tunerooster commented 5 months ago

Every so often, from days to weeks, my beacon-chain process id killed, siting an OOM error. The kernel messages are:

May  7 16:08:50 u59 kernel: Mem-Info:
May  7 16:08:50 u59 kernel: active_anon:3033195 inactive_anon:348918 isolated_anon:96\x0a active_file:44 inactive_file:43 isolated_file:0\x0a unevictable:22 dirty:0 writeback:0\x0a slab_reclaimable:19069 slab_unreclaimable:460114\x0a mapped:90 shmem:29 pagetables:21770\x0a sec_pagetables:0 bounce:0\x0a kernel_misc_reclaimable:0\x0a free:61206 free_pcp:1 free_cma:0
May  7 16:08:50 u59 kernel: Node 0 active_anon:12132524kB inactive_anon:1395672kB active_file:176kB inactive_file:172kB unevictable:88kB isolated(anon):384kB isolated(file):0kB mapped:360kB dirty:0kB writeback:0kB shmem:116kB writeback_tmp:0kB kernel_stack:4800kB pagetables:87080kB sec_pagetables:0kB all_unreclaimable? no
May  7 16:08:50 u59 kernel: Node 0 DMA free:15360kB boost:0kB min:12kB low:24kB high:36kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
May  7 16:08:50 u59 kernel: lowmem_reserve[]: 0 1687 15745 15745
May  7 16:08:50 u59 kernel: Node 0 DMA32 free:64640kB boost:0kB min:1720kB low:3444kB high:5168kB reserved_highatomic:12288KB active_anon:1467512kB inactive_anon:190096kB active_file:4kB inactive_file:32kB unevictable:0kB writepending:0kB present:1851064kB managed:1744932kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
May  7 16:08:50 u59 kernel: lowmem_reserve[]: 0 0 14058 14058
May  7 16:08:50 u59 kernel: Node 0 Normal free:164824kB boost:8192kB min:22524kB low:36916kB high:51308kB reserved_highatomic:143360KB active_anon:10665496kB inactive_anon:1204580kB active_file:212kB inactive_file:0kB unevictable:88kB writepending:0kB present:14684160kB managed:14395676kB mlocked:0kB bounce:0kB free_pcp:4kB local_pcp:0kB free_cma:0kB
May  7 16:08:50 u59 kernel: lowmem_reserve[]: 0 0 0 0
May  7 16:08:50 u59 kernel: Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15360kB
May  7 16:08:50 u59 kernel: Node 0 DMA32: 1668*4kB (UMEH) 628*8kB (UMEH) 348*16kB (UMEH) 159*32kB (UMEH) 109*64kB (UMEH) 56*128kB (UMEH) 32*256kB (UMEH) 23*512kB (UEH) 8*1024kB (ME) 0*2048kB 0*4096kB = 64656kB
May  7 16:08:50 u59 kernel: Node 0 Normal: 7863*4kB (UMEH) 5844*8kB (UMEH) 2387*16kB (UMEH) 999*32kB (UMEH) 244*64kB (UMH) 3*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 164364kB
May  7 16:08:50 u59 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
May  7 16:08:50 u59 kernel: 217 total pagecache pages
May  7 16:08:50 u59 kernel: 104 pages in swap cache
May  7 16:08:50 u59 kernel: Free swap  = 0kB
May  7 16:08:50 u59 kernel: Total swap = 16777208kB
May  7 16:08:50 u59 kernel: 4137804 pages RAM
May  7 16:08:50 u59 kernel: 0 pages HighMem/MovableOnly
May  7 16:08:50 u59 kernel: 98812 pages reserved
May  7 16:08:50 u59 kernel: Tasks state (memory values in pages):
May  7 16:08:50 u59 kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
May  7 16:08:50 u59 kernel: [   3094]     0  3094     4880      160    53248      320             0 systemd-udevd
May  7 16:08:50 u59 kernel: [   4270]     0  4270      726       12    40960       64             0 dhcpcd
May  7 16:08:50 u59 kernel: [   4521]   123  4521    20794      131    61440       64             0 chronyd
May  7 16:08:50 u59 kernel: [   4550]     0  4550      623       32    45056       32             0 syslogd
May  7 16:08:50 u59 kernel: [   4587]     0  4587     3130      113    61440      128             0 syslog-ng
May  7 16:08:50 u59 kernel: [   4588]     0  4588    95648       49   102400      416             0 syslog-ng
May  7 16:08:50 u59 kernel: [   4619]     0  4619      866       35    40960      192             0 crond
May  7 16:08:50 u59 kernel: [   4674]  1000  4674     1058       64    53248       32             0 run.sh
May  7 16:08:50 u59 kernel: [   4676]  1000  4676  2481291   595646 16015360  1192849             0 geth
May  7 16:08:50 u59 kernel: [   4706]   389  4706   400563     7636   626688     5159             0 grafana
May  7 16:08:50 u59 kernel: [   4785]     0  4785      852       65    49152        0             0 ntpd
May  7 16:08:50 u59 kernel: [   4786]   321  4786      887       96    49152        0             0 ntpd
May  7 16:08:50 u59 kernel: [   4788]   321  4788      852       96    40960        0             0 ntpd
May  7 16:08:50 u59 kernel: [   4814]   430  4814   562347    14559   991232     7347             0 prometheus
May  7 16:08:50 u59 kernel: [   4846]     0  4846     2324       54    53248      192         -1000 sshd
May  7 16:08:50 u59 kernel: [   4887]     0  4887    43575      177    69632      160             0 zed
May  7 16:08:50 u59 kernel: [   5032]  1000  5032     1058       64    45056       32             0 run.sh
May  7 16:08:50 u59 kernel: [   5034]  1000  5034   311179      128   126976      576             0 mev-boost
May  7 16:08:50 u59 kernel: [   5061]  1000  5061     1058       64    45056       64             0 run.sh
May  7 16:08:50 u59 kernel: [   5063]  1000  5063 17997987  2760813 70189056  2983176             0 beacon-chain
May  7 16:08:50 u59 kernel: [   5105]  1000  5105     1058       96    49152       32             0 run.sh
May  7 16:08:50 u59 kernel: [   5107]  1000  5107   617290     3460   327680     2305             0 validator
May  7 16:08:50 u59 kernel: [   5114]     0  5114     1454       96    49152        0             0 agetty
May  7 16:08:50 u59 kernel: [   5115]     0  5115     1454       64    45056        0             0 agetty
May  7 16:08:50 u59 kernel: [   5116]     0  5116     1454       64    49152        0             0 agetty
May  7 16:08:50 u59 kernel: [   5117]     0  5117     1454       96    49152        0             0 agetty
May  7 16:08:50 u59 kernel: [   5118]     0  5118     1454       64    49152        0             0 agetty
May  7 16:08:50 u59 kernel: [   5119]     0  5119     1454       96    49152        0             0 agetty
May  7 16:08:50 u59 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=openrc.validator,mems_allowed=0,global_oom,task_memcg=/openrc.prysm,task=beacon-chain,pid=5063,uid=1000
May  7 16:08:50 u59 kernel: Out of memory: Killed process 5063 (beacon-chain) total-vm:71991948kB, anon-rss:11043252kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:68544kB oom_score_adj:0

Prysm Version: beacon-chain version Prysm/v5.0.3/38f208d70dc95b12c08403f5c72009aaa10dfe2f. Built at: 2024-04-04 18:31:36+00:00

Geth Version: geth version 1.13.14-stable

OS Release: Linux u59 6.6.14-gentoo #1 SMP PREEMPT_DYNAMIC Sat Feb 3 03:06:01 MST 2024 x86_64 Intel(R) Celeron(R) N5095 @ 2.00GHz GenuineIntel GNU/Linux

The machine has 16GB RAM and 16GB swap. I recently increased swap from 6GB to 16GB, but it appears that is still not enough.

How much swap to you recommend to cover the worst case?

Thanks!

nisdas commented 5 months ago

@tunerooster What flags are you running with ? You can try with --enable-experimental-state to see if it helps reducing resource consumption by prysm

tunerooster commented 5 months ago

@tunerooster What flags are you running with ? You can try with --enable-experimental-state to see if it helps reducing resource consumption by prysm

beacon-chain --p2p-max-peers=150 --pprof --execution-endpoint=http://localhost:8551 --jwt-secret=/mnt/crypto/.ethereum/keystore/jwt.hex --datadir /mnt/crypto/.ethereum/prysm --suggested-fee-recipient 0x68...faeB

I will add: --enable-experimental-state but it may be a while before it happens again.

Thanks for the suggestion. Do you know if this happens to others?

nisdas commented 5 months ago

There is a case here that happens when syncing: https://github.com/prysmaticlabs/prysm/issues/13963

However it appears to be different to your case which appears to be more random

tunerooster commented 5 months ago

I assume you noticed that prysm was using (or trying to use) almost 72GB. That would mean I would need 56GB of swap space (at least). I know nothing about prysm's memory requirements, but that seems like it could be a bug. My staking machine is maxed out at 16GB real RAM (it's a processor limitition I guess), but swap space should mitigate this, particularly for the few short times it neesd more than 16GB. However I hesitate to try to expand swap to, say 64G, without understanding if this is a good thing to do.

I am running with the experimental flag now, so I'll report back with any results.

nisdas commented 5 months ago

That is the virtual memory @tunerooster , from the log prysm was trying to use 11gb of physical memory. The reason that virtual memory used by prysm is so big is because we use a memory mapped database(bolt).

tunerooster commented 5 months ago

I see. So the mmap'd database is independent of swap but is still counted as part of virtual memory. Thank you for clarifying that for me.

But then was prysm trying to use more than my 32GB of RAM + swap? 11GB would not be a problem. What am I missing? Would adding even more swap help?

On Wed, May 8, 2024, 4:31 AM Nishant Das @.***> wrote:

That is the virtual memory @tunerooster https://github.com/tunerooster , from the log prysm was trying to use 11gb of physical memory. The reason that virtual memory used by prysm is so big is because we use a memory mapped database(bolt).

— Reply to this email directly, view it on GitHub https://github.com/prysmaticlabs/prysm/issues/13964#issuecomment-2100267472, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADFMVIUEBRTG7UGJFON6LSLZBH5GTAVCNFSM6AAAAABHMAY4FWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBQGI3DONBXGI . You are receiving this because you were mentioned.Message ID: @.***>