solana-labs / solana

Web-Scale Blockchain for fast, secure, scalable, decentralized apps and marketplaces.
https://solanalabs.com
Apache License 2.0
13.34k stars 4.35k forks source link

RAM usage metric does not seem to be working for SLP (w/ possible hidden OOM) #7859

Closed mvines closed 4 years ago

mvines commented 4 years ago

I've had reports of at least two SLP validator's node experiencing OOMs. But the grafana memory graph isn't working, so it's hard to determine if others are also experiencing unusual memory usage.

Sample dmesg output:

[2296655.524393] solana-banking- invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[2296655.524397] CPU: 7 PID: 22737 Comm: solana-banking- Tainted: G           OE     5.0.0-1026-gcp #27~18.04.1-Ubuntu
[2296655.524398] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[2296655.524398] Call Trace:
[2296655.524405]  dump_stack+0x63/0x85
[2296655.524408]  dump_header+0x57/0x315
[2296655.524411]  ? kvm_sched_clock_read+0x11/0x20
[2296655.524413]  ? sched_clock+0x9/0x10
[2296655.524414]  oom_kill_process+0x254/0x280
[2296655.524415]  out_of_memory+0x11b/0x510
[2296655.524416]  __alloc_pages_slowpath+0xb32/0xe80
[2296655.524418]  __alloc_pages_nodemask+0x2a7/0x2c0
[2296655.524421]  alloc_pages_current+0x6a/0xe0
[2296655.524423]  __page_cache_alloc+0x6a/0xa0
[2296655.524424]  filemap_fault+0x3ab/0x850
[2296655.524427]  ? futex_wait_queue_me+0xd6/0x120
[2296655.524428]  ? filemap_map_pages+0x18f/0x380
[2296655.524430]  ext4_filemap_fault+0x31/0x44
[2296655.524432]  __do_fault+0x57/0x11d
[2296655.524433]  __handle_mm_fault+0xe25/0x1300
[2296655.524435]  handle_mm_fault+0xe3/0x240
[2296655.524437]  __do_page_fault+0x292/0x4d0
[2296655.524442]  ? syscall_exit_probe+0x108/0x130 [falco_probe]
[2296655.524443]  do_page_fault+0x2d/0xf0
[2296655.524445]  ? page_fault+0x8/0x30
[2296655.524446]  page_fault+0x1e/0x30
[2296655.524447] RIP: 0033:0x5555557c830b
[2296655.524451] Code: Bad RIP value.
[2296655.524451] RSP: 002b:00007ffecd9e81c0 EFLAGS: 00010206
[2296655.524452] RAX: 00007fff769959b0 RBX: 00007ffecd9e82a0 RCX: 00007ffff763a1aa
[2296655.524452] RDX: 00007fff74702870 RSI: 0000000000000000 RDI: 00007fff74702870
[2296655.524453] RBP: 00005555561da3d0 R08: 0000000000000000 R09: 00007fff747028a0
[2296655.524453] R10: 00007ffecd9e7f80 R11: 0000000000000206 R12: 00007ffecd9e81d0
[2296655.524454] R13: 00005555561bd6f0 R14: 00007ffecd9e8220 R15: 00005555561bd330
[2296655.524455] Mem-Info:
[2296655.524457] active_anon:7902534 inactive_anon:55 isolated_anon:0
                  active_file:491 inactive_file:803 isolated_file:0
                  unevictable:0 dirty:8 writeback:0 unstable:0
                  slab_reclaimable:143398 slab_unreclaimable:32634
                  mapped:717 shmem:253 pagetables:39524 bounce:0
                  free:59290 free_pcp:733 free_cma:0
[2296655.524459] Node 0 active_anon:31610136kB inactive_anon:220kB active_file:1964kB inactive_file:3212kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:2868kB dirty:32kB writeback:0kB shmem:1012kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[2296655.524459] Node 0 DMA free:15908kB min:32kB low:44kB high:56kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[2296655.524461] lowmem_reserve[]: 0 2940 32122 32122 32122
[2296655.524462] Node 0 DMA32 free:122524kB min:6180kB low:9188kB high:12196kB active_anon:2843152kB inactive_anon:16kB active_file:52kB inactive_file:160kB unevictable:0kB writepending:0kB present:3129332kB managed:3039364kB mlocked:0kB kernel_stack:16kB pagetables:14668kB bounce:0kB free_pcp:896kB local_pcp:264kB free_cma:0kB
[2296655.524464] lowmem_reserve[]: 0 0 29181 29181 29181
[2296655.524465] Node 0 Normal free:98728kB min:141236kB low:171116kB high:200996kB active_anon:28766984kB inactive_anon:204kB active_file:1884kB inactive_file:3844kB unevictable:0kB writepending:0kB present:30408704kB managed:29882108kB mlocked:0kB kernel_stack:7056kB pagetables:143428kB bounce:0kB free_pcp:2144kB local_pcp:1472kB free_cma:0kB
[2296655.524467] lowmem_reserve[]: 0 0 0 0 0
[2296655.524468] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
[2296655.524477] Node 0 DMA32: 81*4kB (UE) 84*8kB (UE) 1185*16kB (UE) 815*32kB (UE) 532*64kB (E) 227*128kB (ME) 39*256kB (UE) 2*512kB (UM) 2*1024kB (UM) 0*2048kB 0*4096kB = 122196kB
[2296655.524481] Node 0 Normal: 5349*4kB (UMEH) 2683*8kB (UMEH) 1773*16kB (UMEH) 825*32kB (UEH) 6*64kB (MH) 4*128kB (H) 1*256kB (H) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 98780kB
[2296655.524485] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[2296655.524486] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[2296655.524486] 1726 total pagecache pages
[2296655.524487] 0 pages in swap cache
[2296655.524487] Swap cache stats: add 0, delete 0, find 0/0
[2296655.524488] Free swap  = 0kB
[2296655.524488] Total swap = 0kB
[2296655.524488] 8388507 pages RAM
[2296655.524489] 0 pages HighMem/MovableOnly
[2296655.524489] 154162 pages reserved
[2296655.524489] 0 pages cma reserved
[2296655.524490] 0 pages hwpoisoned
[2296655.524490] Tasks state (memory values in pages):
[2296655.524491] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[2296655.524498] [    536]     0   536   146976     1417  1150976        0             0 systemd-journal
[2296655.524499] [    543]     0   543    61292       94   114688        0             0 lvmetad
[2296655.524504] [    566]     0   566    10988      455   106496        0         -1000 systemd-udevd
[2296655.524536] [   1054]   100  1054    20010      182   172032        0             0 systemd-network
[2296655.524538] [   1081]   101  1081    17734      240   176128        0             0 systemd-resolve
[2296655.524539] [   1212]     0  1212    43000    11705   372736        0             0 salt-minion
[2296655.524541] [   1220]     0  1220   188004      232   143360        0             0 lxcfs
[2296655.524543] [   1227]     0  1227    42707     1950   217088        0             0 networkd-dispat
[2296655.524544] [   1233]   102  1233   100115     3395   323584        0             0 rsyslogd
[2296655.524546] [   1243]     0  1243    69988      190   163840        0             0 accounts-daemon
[2296655.524548] [   1303]     0  1303    15598      178   151552        0             0 systemd-logind
[2296655.524550] [   1324]     0  1324   549217      408   372736        0             0 nscd
[2296655.524551] [   1386]  1003  1386    28485      362    77824        0             0 node_exporter
[2296655.524553] [   1424]   103  1424     9938      151   106496        0          -900 dbus-daemon
[2296655.524555] [   1452]   111  1452    23324       68    73728        0             0 chronyd
[2296655.524556] [   1464]     0  1464     1126       15    45056        0             0 none
[2296655.524558] [   1499]     0  1499    44819     1967   241664        0             0 unattended-upgr
[2296655.524560] [   1546]     0  1546     1156       16    45056        0             0 sshguard-journa
[2296655.524563] [   1547]     0  1547   151485      157   696320        0             0 journalctl
[2296655.524564] [   1548]     0  1548    20474      100    57344        0             0 sshguard
[2296655.524566] [   1594]     0  1594     2041       14    45056        0             0 agetty
[2296655.524568] [   1598]     0  1598    70281      245   172032        0             0 polkitd
[2296655.524569] [   1629]     0  1629     1660       14    45056        0             0 agetty
[2296655.524571] [   1631]     0  1631     1156       24    45056        0             0 sshg-fw
[2296655.524573] [   1778]   112  1778   123133      694   294912        0         -1000 nslcd
[2296655.524574] [   1870]     0  1870    86043     5440   466944        0             0 falco
[2296655.524576] [   1912]     0  1912   119328    18052   540672        0             0 salt-minion
[2296655.524577] [   1915]     0  1915    16956     3776   167936        0             0 google_network_
[2296655.524579] [   1920]     0  1920    16907     3671   167936        0             0 google_clock_sk
[2296655.524580] [   1922]     0  1922    17010     3777   167936        0          -999 google_accounts
[2296655.524582] [   2039]     0  2039    63337    12540   380928        0             0 salt-minion
[2296655.524583] [   2057]     0  2057    16603      182   159744        0         -1000 sshd
[2296655.524584] [   2060]     0  2060     5875       62    77824        0             0 cron
[2296655.524585] [   2064]     0  2064     5074       33    73728        0             0 atd
[2296655.524587] [  15497]     0 15497   473812     3055   356352        0          -900 snapd
[2296655.524588] [  15299]     0 15299    28163      256   253952        0             0 sshd
[2296655.524590] [  15307]  5015 15307    19193      280   176128        0             0 systemd
[2296655.524591] [  15308]  5015 15308    66021      733   274432        0             0 (sd-pam)
[2296655.524592] [  15441]  5015 15441    28163      257   245760        0             0 sshd
[2296655.524593] [  15442]  5015 15442     3735      427    57344        0             0 bash
[2296655.524594] [  15456]     0 15456    17169      125   167936        0             0 sudo
[2296655.524595] [  15478]     0 15478    16459      118   159744        0             0 su
[2296655.524596] [  15479]     0 15479     3413      111    57344        0             0 bash
[2296655.524598] [  15499]     0 15499    16434      112   159744        0             0 su
[2296655.524599] [  15500]  1004 15500     3766      451    57344        0             0 bash
[2296655.524600] [  22117]     0 22117    28163      256   253952        0             0 sshd
[2296655.524602] [  22248]  5015 22248    28163      259   245760        0             0 sshd
[2296655.524603] [  22249]  5015 22249     3735      428    57344        0             0 bash
[2296655.524604] [  22268]     0 22268    17169      125   167936        0             0 sudo
[2296655.524605] [  22276]     0 22276    16459      118   159744        0             0 su
[2296655.524607] [  22277]     0 22277     3442      122    57344        0             0 bash
[2296655.524608] [  22413]  1004 22413 24059467  7831718 152453120        0             0 solana-validato
[2296655.524610] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/solana.service,task=solana-validato,pid=22413,uid=1004
[2296655.524615] Out of memory: Kill process 22413 (solana-validato) score 955 or sacrifice child
[2296655.533713] Killed process 22413 (solana-validato) total-vm:96237868kB, anon-rss:31326872kB, file-rss:0kB, shmem-rss:0kB
[2296656.547821] oom_reaper: reaped process 22413 (solana-validato), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
mvines commented 4 years ago

Here's a partial stake trace from one of the SLP validators that appears to show evidence of an OOM as well:

 slots: {929717, 929718, 929719, 929720, 929721, 929722, 929723, 929724, 929725, 929726, 929727, 929728, 929729, 929730, 929731, 929732, 929733, 929734, 929735, 929736, 929737, 929738, 929739, 929740, 929741, 929742, 929743, 929744, 929745, 929746, 929747}
[2020-01-21T10:32:43.723454696Z INFO  solana_core::cluster_info] ReceiveUpdates took: 12 ms  len: 2
[2020-01-21T10:32:43.772950966Z INFO  solana_core::cluster_info] ReceiveUpdates took: 41 ms  len: 1
[2020-01-21T10:32:43.780096885Z ERROR solana_core::cluster_info_repair_listener] Serving repair for slot 929749 to Luna1VCsPBE4hghuHaL9UFgimBB3V6u6johyd7hGXBL. Repairee slots: {929717, 929718, 929719, 929720, 929721, 929722, 929723, 929724, 929725, 929726, 929727, 929728, 929729, 929730, 929731, 929732, 929733, 929734, 929735, 929736, 929737, 929738, 929739, 929740, 929741, 929742, 929743, 929744, 929745, 929746, 929747}
[2020-01-21T10:32:43.784794499Z INFO  solana_runtime::bank] bank frozen: 934779 hash: 4x6j7MxQjwDKxtDGytJjLMCdLUAvoAFCnTXxG4E98Bzy accounts_delta: BankHash 6a7b61ce9f59f3967b50d8613d6b01f63aca1914ef6b617c0e550c1c7ff32d6e signature_count: 44 last_blockhash: C64YnnkxUMLEpdc84PmjTdq8Kwv4TiAtQo3AV77R2WV2
[2020-01-21T10:32:43.785188261Z INFO  solana_runtime::bank] accounts hash slot: 934779 stats: BankHashStats { num_removed_accounts: 44, num_added_accounts: 0, num_lamports_stored: 10051303410401, total_data_len: 45540, num_executable_accounts: 0 }
[2020-01-21T10:32:43.791000183Z WARN  solana_core::replay_stage] 3o43fXxTpndqVsCdMi16WNq6aR9er75P364ZkTHrEQJN slot_weight: 934779 15424587707071582546506452 319932299605495709067931415168 934778
[2020-01-21T10:32:43.795203564Z INFO  solana_core::replay_stage] voting: 934779 319932299605495709067931415168
[2020-01-21T10:32:43.807357083Z INFO  solana_ledger::bank_forks] setting snapshot root: 934748
[2020-01-21T10:32:43.803665925Z INFO  solana_core::cluster_info] ReceiveUpdates took: 8 ms  len: 3
[2020-01-21T10:32:43.807162093Z ERROR solana_core::cluster_info_repair_listener] Serving repair for slot 929749 to ArpeD4LKYgza1o6aR5xNTQX3hxeik8URxWNQVpA8wirV. Repairee slots: {929717, 929718, 929719, 929720, 929721, 929722, 929723, 929724, 929725, 929726, 929727, 929728, 929729, 929730, 929731, 929732, 929733, 929734, 929735, 929736, 929737, 929738, 929739, 929740, 929741, 929742, 929743, 929744, 929745, 929746, 929747}
[2020-01-21T10:32:43.834852993Z INFO  solana_core::cluster_info] ReceiveUpdates took: 17 ms  len: 2
[2020-01-21T10:32:43.867447010Z INFO  solana_core::poh_recorder] reset poh from: FVgxSvaJK41JUr1tK237m1NkQ5xteH1QHwmpsUtjD2CR,59825884,934778 to: C64YnnkxUMLEpdc84PmjTdq8Kwv4TiAtQo3AV77R2WV2,934779
[2020-01-21T10:32:43.867579731Z INFO  solana_core::replay_stage] 3o43fXxTpndqVsCdMi16WNq6aR9er75P364ZkTHrEQJN reset PoH to tick 59825920 (within slot 934779). I am not
in the leader schedule yet
[2020-01-21T10:32:43.867601990Z INFO  solana_core::replay_stage] vote bank: Some(934779) reset bank: 934779
[2020-01-21T10:32:43.870812331Z INFO  solana_core::replay_stage] new fork:934780 parent:934779 root:934748
[2020-01-21T10:32:43.921748901Z INFO  solana_core::cluster_info] ReceiveUpdates took: 12 ms  len: 1
[2020-01-21T10:32:43.935683367Z ERROR solana_core::cluster_info_repair_listener] Serving repair for slot 929749 to SPC3m89qwxGbqYdg1GuaoeZtgJD2hYoob6c4aKLG1zu. Repairee slots: {929716, 929717, 929718, 929719, 929720, 929721, 929722, 929723, 929724, 929725, 929726, 929727, 929728, 929729, 929730, 929731, 929732, 929733, 929734, 929735, 929736, 929737, 929738, 929739, 929740, 929741, 929742, 929743, 929744, 929745, 929746, 929747}
[2020-01-21T10:32:43.990848178Z INFO  solana_metrics::metrics] submit response: 204 No Content
[2020-01-21T10:32:44.006634927Z INFO  solana_metrics::metrics] submitting 125 points
[2020-01-21T10:32:44.032318910Z INFO  solana_core::cluster_info] RequestWindowIndex took: 38 ms
thread '<unnamed>' panicked at 'failed to allocate an alternative stack', src/libstd/sys/unix/stack_overflow.rs:134:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
[2020-01-21T10:32:48.784804395Z INFO  solana_metrics::metrics] submitting 979 points
thread '<unnamed>' panicked at 'failed to allocate an alternative stack', src/libstd/sys/unix/stack_overflow.rs:134:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
fatal runtime error: failed to initiate panic, error 5
Aborted

and

Out of memory: Kill process 582 (solana-validato) score 958 or sacrifice child
[13101.676767] Killed process 582 (solana-validato) total-vm:77370672kB, anon-rss:30397572kB, file-rss:0kB, shmem-rss:0kB
[13106.238504] oom_reaper: reaped process 582 (solana-validato), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
mvines commented 4 years ago

Might be tough since this is all in a pile of bash right now, and we're not going back to shipping bash

leoluk commented 4 years ago

This one is fixed, right?

mvines commented 4 years ago

No it's not fixed yet. In our dev environment, there are some bash scripts that run in the background of each node that report memory usage into InfluxDB:

mvines commented 4 years ago

I rolled this out on our nodes, https://github.com/solana-labs/cluster/commit/115ef8f92b0abc10565f5180569c1e4e057529bd. I don't see a nice way to deploy this to the entire fleet of nodes so monitoring our nodes will have to due for now