agent process inflated and cause oom-kill

bhalevy commented 1 year ago

We say that in https://github.com/scylladb/scylla-enterprise/issues/3121 when the oom-killer was invoked on some of the cluster nodes.

2 cases I have the logs are showing scylla-manager-agent consuming almost the whole of the 10GB swap space.

server-60788/dmesg-logs.txt:

kern  :warn  : [Wed Jun 28 00:10:12 2023] google_osconfig invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
kern  :warn  : [Wed Jun 28 00:10:12 2023] CPU: 3 PID: 872592 Comm: google_osconfig Not tainted 5.15.0-1029-gcp #36~20.04.1-Ubuntu
kern  :warn  : [Wed Jun 28 00:10:12 2023] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/16/2023
kern  :warn  : [Wed Jun 28 00:10:12 2023] Call Trace:
kern  :warn  : [Wed Jun 28 00:10:12 2023]  <TASK>
kern  :warn  : [Wed Jun 28 00:10:12 2023]  dump_stack_lvl+0x4a/0x63
kern  :warn  : [Wed Jun 28 00:10:12 2023]  dump_stack+0x10/0x16
kern  :warn  : [Wed Jun 28 00:10:12 2023]  dump_header+0x53/0x225
kern  :warn  : [Wed Jun 28 00:10:12 2023]  oom_kill_process.cold+0xb/0x10
kern  :warn  : [Wed Jun 28 00:10:12 2023]  out_of_memory+0x1dc/0x530
kern  :warn  : [Wed Jun 28 00:10:12 2023]  __alloc_pages_slowpath.constprop.0+0xd32/0xe30
kern  :warn  : [Wed Jun 28 00:10:12 2023]  ? __alloc_pages_slowpath.constprop.0+0xdb6/0xe30
kern  :warn  : [Wed Jun 28 00:10:12 2023]  __alloc_pages+0x2cc/0x310
kern  :warn  : [Wed Jun 28 00:10:12 2023]  alloc_pages+0x90/0x120
kern  :warn  : [Wed Jun 28 00:10:12 2023]  __page_cache_alloc+0x87/0xc0
kern  :warn  : [Wed Jun 28 00:10:12 2023]  pagecache_get_page+0x150/0x530
kern  :warn  : [Wed Jun 28 00:10:12 2023]  ? page_cache_ra_unbounded+0x16a/0x220
kern  :warn  : [Wed Jun 28 00:10:12 2023]  filemap_fault+0x527/0xb60
kern  :warn  : [Wed Jun 28 00:10:12 2023]  ? filemap_map_pages+0x138/0x640
kern  :warn  : [Wed Jun 28 00:10:12 2023]  __do_fault+0x3d/0x120
kern  :warn  : [Wed Jun 28 00:10:12 2023]  do_fault+0x1f9/0x420
kern  :warn  : [Wed Jun 28 00:10:12 2023]  __handle_mm_fault+0x62c/0x840
kern  :warn  : [Wed Jun 28 00:10:12 2023]  handle_mm_fault+0xd8/0x2c0
kern  :warn  : [Wed Jun 28 00:10:12 2023]  do_user_addr_fault+0x1c2/0x660
kern  :warn  : [Wed Jun 28 00:10:12 2023]  exc_page_fault+0x77/0x170
kern  :warn  : [Wed Jun 28 00:10:12 2023]  asm_exc_page_fault+0x27/0x30
kern  :warn  : [Wed Jun 28 00:10:12 2023] RIP: 0033:0x4459df
kern  :warn  : [Wed Jun 28 00:10:12 2023] Code: Unable to access opcode bytes at RIP 0x4459b5.
kern  :warn  : [Wed Jun 28 00:10:12 2023] RSP: 002b:00007fc5813bede0 EFLAGS: 00010206
kern  :warn  : [Wed Jun 28 00:10:12 2023] RAX: 001d49808da08b5c RBX: 000000c000056400 RCX: 0000000000eca3a0
kern  :warn  : [Wed Jun 28 00:10:12 2023] RDX: 000000003000cf5c RSI: 00007fc5813bedb0 RDI: 0000000000000001
kern  :warn  : [Wed Jun 28 00:10:12 2023] RBP: 00007fc5813bee38 R08: 00007fff68d57080 R09: 00000000007dc986
kern  :warn  : [Wed Jun 28 00:10:12 2023] R10: 00007fff68d57090 R11: 00000000d15b535a R12: 00007fc5813bedc0
kern  :warn  : [Wed Jun 28 00:10:12 2023] R13: 00007fff68c3d5cf R14: 000000c0000029c0 R15: 00007fc5813befc0
kern  :warn  : [Wed Jun 28 00:10:12 2023]  </TASK>
kern  :warn  : [Wed Jun 28 00:10:12 2023] Mem-Info:
kern  :warn  : [Wed Jun 28 00:10:12 2023] active_anon:34215 inactive_anon:137638 isolated_anon:0
                                           active_file:9 inactive_file:3542 isolated_file:24
                                           unevictable:7666531 dirty:0 writeback:0
                                           slab_reclaimable:150600 slab_unreclaimable:39248
                                           mapped:22750 shmem:19 pagetables:21663 bounce:0
                                           kernel_misc_reclaimable:0
                                           free:109541 free_pcp:276 free_cma:0
kern  :warn  : [Wed Jun 28 00:10:12 2023] Node 0 active_anon:136860kB inactive_anon:550552kB active_file:36kB inactive_file:14168kB unevictable:30666124kB isolated(anon):0kB isolated(file):96kB mapped:91000kB dirty:0kB writeback:0kB shmem:76kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 26984448kB writeback_tmp:0kB kernel_stack:3904kB pagetables:86652kB all_unreclaimable? no
kern  :warn  : [Wed Jun 28 00:10:12 2023] Node 0 DMA free:11264kB min:28kB low:40kB high:52kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15920kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
kern  :warn  : [Wed Jun 28 00:10:12 2023] lowmem_reserve[]: 0 2988 32078 32078 32078
kern  :warn  : [Wed Jun 28 00:10:12 2023] Node 0 DMA32 free:137048kB min:6292kB low:9352kB high:12412kB reserved_highatomic:14336KB active_anon:0kB inactive_anon:16kB active_file:112kB inactive_file:0kB unevictable:2918420kB writepending:0kB present:3126080kB managed:3060492kB mlocked:2918420kB bounce:0kB free_pcp:368kB local_pcp:248kB free_cma:0kB
kern  :warn  : [Wed Jun 28 00:10:12 2023] lowmem_reserve[]: 0 0 29090 29090 29090
kern  :warn  : [Wed Jun 28 00:10:12 2023] Node 0 Normal free:289852kB min:65352kB low:95140kB high:124928kB reserved_highatomic:239616KB active_anon:136860kB inactive_anon:550536kB active_file:464kB inactive_file:13708kB unevictable:27747704kB writepending:0kB present:30408704kB managed:29796852kB mlocked:27747704kB bounce:0kB free_pcp:736kB local_pcp:164kB free_cma:0kB
kern  :warn  : [Wed Jun 28 00:10:12 2023] lowmem_reserve[]: 0 0 0 0 0
kern  :warn  : [Wed Jun 28 00:10:12 2023] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11264kB
kern  :warn  : [Wed Jun 28 00:10:12 2023] Node 0 DMA32: 65*4kB (UME) 44*8kB (UME) 23*16kB (U) 28*32kB (UME) 29*64kB (UME) 16*128kB (UM) 12*256kB (UME) 8*512kB (UME) 7*1024kB (UME) 3*2048kB (UME) 27*4096kB (M) = 136852kB
kern  :warn  : [Wed Jun 28 00:10:12 2023] Node 0 Normal: 13708*4kB (UMEH) 8117*8kB (UMEH) 6507*16kB (UMEH) 1116*32kB (UMEH) 199*64kB (UMH) 78*128kB (H) 23*256kB (H) 2*512kB (H) 0*1024kB 0*2048kB 0*4096kB = 289224kB
kern  :info  : [Wed Jun 28 00:10:12 2023] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
kern  :info  : [Wed Jun 28 00:10:12 2023] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
kern  :warn  : [Wed Jun 28 00:10:12 2023] 31440 total pagecache pages
kern  :warn  : [Wed Jun 28 00:10:12 2023] 5058 pages in swap cache
kern  :warn  : [Wed Jun 28 00:10:12 2023] Swap cache stats: add 44218870, delete 44214095, find 15408913/27451904
kern  :warn  : [Wed Jun 28 00:10:12 2023] Free swap  = 0kB
kern  :warn  : [Wed Jun 28 00:10:12 2023] Total swap = 10956796kB
kern  :warn  : [Wed Jun 28 00:10:12 2023] 8387676 pages RAM
kern  :warn  : [Wed Jun 28 00:10:12 2023] 0 pages HighMem/MovableOnly
kern  :warn  : [Wed Jun 28 00:10:12 2023] 169500 pages reserved
kern  :warn  : [Wed Jun 28 00:10:12 2023] 0 pages hwpoisoned
kern  :info  : [Wed Jun 28 00:10:12 2023] Tasks state (memory values in pages):
kern  :info  : [Wed Jun 28 00:10:12 2023] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
kern  :info  : [Wed Jun 28 00:10:12 2023] [    352]     0   352    70052     4501    98304        0         -1000 multipathd
kern  :info  : [Wed Jun 28 00:10:12 2023] [    536]     0   536    60260      509   102400      265             0 accounts-daemon
kern  :info  : [Wed Jun 28 00:10:12 2023] [    546]   103   546     1912      578    49152      130          -900 dbus-daemon
kern  :info  : [Wed Jun 28 00:10:12 2023] [    556]     0   556     7480      787    90112     2021             0 networkd-dispat
kern  :info  : [Wed Jun 28 00:10:12 2023] [    585]   114   585   179665     4613   196608      465             0 node_exporter
kern  :info  : [Wed Jun 28 00:10:12 2023] [    611]     0   611      951      479    49152       40             0 atd
kern  :info  : [Wed Jun 28 00:10:12 2023] [    689]     0   689    59107      710    98304      295             0 polkitd
kern  :info  : [Wed Jun 28 00:10:12 2023] [    818]     0   818     1840      409    49152       35             0 agetty
kern  :info  : [Wed Jun 28 00:10:12 2023] [    829]     0   829     1459      352    49152       32             0 agetty
kern  :info  : [Wed Jun 28 00:10:12 2023] [    837]   113   837     3256      445    49152       72             0 chronyd
kern  :info  : [Wed Jun 28 00:10:12 2023] [    838]   113   838     1174        4    49152       42             0 chronyd
kern  :info  : [Wed Jun 28 00:10:12 2023] [  11692]     0 11692   532304     4633  1306624     5219             0 journalbeat
kern  :info  : [Wed Jun 28 00:10:12 2023] [ 844365]     0 844365     2216      435    65536      213         -1000 systemd-udevd
kern  :info  : [Wed Jun 28 00:10:12 2023] [ 844472]   107 844472     2438       36    49152       53             0 uuidd
kern  :info  : [Wed Jun 28 00:10:12 2023] [ 844836]   100 844836     6820      672    73728      204             0 systemd-network
kern  :info  : [Wed Jun 28 00:10:12 2023] [ 844838]     0 844838   287517     1048   225280      631          -999 google_guest_ag
kern  :info  : [Wed Jun 28 00:10:12 2023] [ 844967]     0 844967     4344      623    69632      200             0 systemd-logind
kern  :info  : [Wed Jun 28 00:10:12 2023] [ 844970]     0 844970     2137      415    53248       43             0 cron
kern  :info  : [Wed Jun 28 00:10:12 2023] [ 844980]   101 844980     6106      677    81920      936             0 systemd-resolve
kern  :info  : [Wed Jun 28 00:10:12 2023] [ 844983]     0 844983    40615      588   331776      187          -250 systemd-journal
kern  :info  : [Wed Jun 28 00:10:12 2023] [ 852017]     0 852017     3046      513    61440      239         -1000 sshd
kern  :info  : [Wed Jun 28 00:10:12 2023] [ 856672]   104 856672    56125      447   139264     1141             0 rsyslogd
kern  :info  : [Wed Jun 28 00:10:12 2023] [ 872402]   114 872402 4295068671  7660997 61612032        0          -950 scylla
kern  :info  : [Wed Jun 28 00:10:12 2023] [ 872449]   114 872449   701465     4779   491520    23581             0 scylla-jmx
kern  :info  : [Wed Jun 28 00:10:12 2023] [ 872591]     0 872591   347268     1105   225280      694             0 google_osconfig
kern  :info  : [Wed Jun 28 00:10:12 2023] [ 906556]   114 906556  3107509   147939 23236608  2702710             0 scylla-manager-
kern  :info  : [Wed Jun 28 00:10:12 2023] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/scylla.slice/scylla-helper.slice/scylla-manager-agent.service,task=scylla-manager-,pid=906556,uid=114
kern  :err   : [Wed Jun 28 00:10:12 2023] Out of memory: Killed process 906556 (scylla-manager-) total-vm:12430036kB, anon-rss:591764kB, file-rss:0kB, shmem-rss:0kB, UID:114 pgtables:22692kB oom_score_adj:0

server-60792/dmesg-logs.txt:

kern  :warn  : [Fri Jun 23 17:25:47 2023] google_guest_ag invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-999
kern  :warn  : [Fri Jun 23 17:25:47 2023] CPU: 3 PID: 836294 Comm: google_guest_ag Not tainted 5.15.0-1029-gcp #36~20.04.1-Ubuntu
kern  :warn  : [Fri Jun 23 17:25:47 2023] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/16/2023
kern  :warn  : [Fri Jun 23 17:25:47 2023] Call Trace:
kern  :warn  : [Fri Jun 23 17:25:47 2023]  <TASK>
kern  :warn  : [Fri Jun 23 17:25:47 2023]  dump_stack_lvl+0x4a/0x63
kern  :warn  : [Fri Jun 23 17:25:47 2023]  dump_stack+0x10/0x16
kern  :warn  : [Fri Jun 23 17:25:47 2023]  dump_header+0x53/0x225
kern  :warn  : [Fri Jun 23 17:25:47 2023]  oom_kill_process.cold+0xb/0x10
kern  :warn  : [Fri Jun 23 17:25:47 2023]  out_of_memory+0x1dc/0x530
kern  :warn  : [Fri Jun 23 17:25:47 2023]  __alloc_pages_slowpath.constprop.0+0xd32/0xe30
kern  :warn  : [Fri Jun 23 17:25:47 2023]  ? __alloc_pages_slowpath.constprop.0+0xdb6/0xe30
kern  :warn  : [Fri Jun 23 17:25:47 2023]  __alloc_pages+0x2cc/0x310
kern  :warn  : [Fri Jun 23 17:25:47 2023]  alloc_pages+0x90/0x120
kern  :warn  : [Fri Jun 23 17:25:47 2023]  __page_cache_alloc+0x87/0xc0
kern  :warn  : [Fri Jun 23 17:25:47 2023]  pagecache_get_page+0x150/0x530
kern  :warn  : [Fri Jun 23 17:25:47 2023]  ? page_cache_ra_unbounded+0x16a/0x220
kern  :warn  : [Fri Jun 23 17:25:47 2023]  filemap_fault+0x527/0xb60
kern  :warn  : [Fri Jun 23 17:25:47 2023]  ? filemap_map_pages+0x138/0x640
kern  :warn  : [Fri Jun 23 17:25:47 2023]  __do_fault+0x3d/0x120
kern  :warn  : [Fri Jun 23 17:25:47 2023]  do_fault+0x1f9/0x420
kern  :warn  : [Fri Jun 23 17:25:47 2023]  __handle_mm_fault+0x62c/0x840
kern  :warn  : [Fri Jun 23 17:25:47 2023]  handle_mm_fault+0xd8/0x2c0
kern  :warn  : [Fri Jun 23 17:25:47 2023]  do_user_addr_fault+0x1c2/0x660
kern  :warn  : [Fri Jun 23 17:25:47 2023]  exc_page_fault+0x77/0x170
kern  :warn  : [Fri Jun 23 17:25:47 2023]  asm_exc_page_fault+0x27/0x30
kern  :warn  : [Fri Jun 23 17:25:47 2023] RIP: 0033:0x43b76d
kern  :warn  : [Fri Jun 23 17:25:47 2023] Code: Unable to access opcode bytes at RIP 0x43b743.
kern  :warn  : [Fri Jun 23 17:25:47 2023] RSP: 002b:00007fd9f3002df8 EFLAGS: 00010246
kern  :warn  : [Fri Jun 23 17:25:47 2023] RAX: 0000000000002710 RBX: 0000000000004e20 RCX: 00000000000000c2
kern  :warn  : [Fri Jun 23 17:25:47 2023] RDX: 0000000000af45e0 RSI: 0000000000000000 RDI: 00007fd9f3002dd8
kern  :warn  : [Fri Jun 23 17:25:47 2023] RBP: 00007fd9f3002df0 R08: 000000c000074600 R09: 0000000000782543
kern  :warn  : [Fri Jun 23 17:25:47 2023] R10: 00007ffcd4074090 R11: 0000000000000202 R12: 000000c000000900
kern  :warn  : [Fri Jun 23 17:25:47 2023] R13: 00007ffcd404f1bf R14: 00007ffcd404f2c0 R15: 00007fd9f3002fc0
kern  :warn  : [Fri Jun 23 17:25:47 2023]  </TASK>
kern  :warn  : [Fri Jun 23 17:25:47 2023] Mem-Info:
kern  :warn  : [Fri Jun 23 17:25:47 2023] active_anon:14196 inactive_anon:205065 isolated_anon:0
kern  :warn  : [Fri Jun 23 17:25:47 2023] Node 0 active_anon:56784kB inactive_anon:820260kB active_file:464kB inactive_file:13636kB unevictable:30665268kB isolated(anon):0kB isolated(file):256kB mapped:90124kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 27092992kB writeback_tmp:0kB kernel_stack:3760kB pagetables:87376kB all_unreclaimable? no
kern  :warn  : [Fri Jun 23 17:25:47 2023] Node 0 DMA free:11264kB min:28kB low:40kB high:52kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15920kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
kern  :warn  : [Fri Jun 23 17:25:47 2023] lowmem_reserve[]: 0 2988 32078 32078 32078
kern  :warn  : [Fri Jun 23 17:25:47 2023] Node 0 DMA32 free:138696kB min:6292kB low:9352kB high:12412kB reserved_highatomic:16384KB active_anon:16kB inactive_anon:624kB active_file:0kB inactive_file:0kB unevictable:2916716kB writepending:0kB present:3126080kB managed:3060492kB mlocked:2916716kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
kern  :warn  : [Fri Jun 23 17:25:47 2023] lowmem_reserve[]: 0 0 29090 29090 29090
kern  :warn  : [Fri Jun 23 17:25:47 2023] Node 0 Normal free:230124kB min:61256kB low:91044kB high:120832kB reserved_highatomic:176128KB active_anon:56768kB inactive_anon:819636kB active_file:660kB inactive_file:13308kB unevictable:27748552kB writepending:0kB present:30408704kB managed:29796852kB mlocked:27748552kB bounce:0kB free_pcp:892kB local_pcp:192kB free_cma:0kB
kern  :warn  : [Fri Jun 23 17:25:47 2023] lowmem_reserve[]: 0 0 0 0 0
kern  :warn  : [Fri Jun 23 17:25:47 2023] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11264kB
kern  :warn  : [Fri Jun 23 17:25:47 2023] Node 0 DMA32: 14*4kB (UME) 12*8kB (UME) 7*16kB (M) 4*32kB (UME) 3*64kB (UE) 1*128kB (U) 7*256kB (UME) 6*512kB (UM) 6*1024kB (UME) 2*2048kB (UE) 30*4096kB (M) = 138696kB
kern  :warn  : [Fri Jun 23 17:25:47 2023] Node 0 Normal: 9819*4kB (UMEH) 6609*8kB (UMEH) 3613*16kB (UMH) 1268*32kB (UMH) 330*64kB (UMH) 85*128kB (H) 28*256kB (H) 1*512kB (H) 0*1024kB 0*2048kB 0*4096kB = 230212kB
kern  :info  : [Fri Jun 23 17:25:47 2023] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
kern  :info  : [Fri Jun 23 17:25:47 2023] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
kern  :warn  : [Fri Jun 23 17:25:47 2023] 39125 total pagecache pages
kern  :warn  : [Fri Jun 23 17:25:47 2023] 13073 pages in swap cache
kern  :warn  : [Fri Jun 23 17:25:47 2023] Swap cache stats: add 22658412, delete 22645345, find 8086099/13316196
kern  :warn  : [Fri Jun 23 17:25:47 2023] Free swap  = 0kB
kern  :warn  : [Fri Jun 23 17:25:47 2023] Total swap = 10956796kB
kern  :warn  : [Fri Jun 23 17:25:47 2023] 8387676 pages RAM
kern  :warn  : [Fri Jun 23 17:25:47 2023] 0 pages HighMem/MovableOnly
kern  :warn  : [Fri Jun 23 17:25:47 2023] 169500 pages reserved
kern  :warn  : [Fri Jun 23 17:25:47 2023] 0 pages hwpoisoned
kern  :info  : [Fri Jun 23 17:25:47 2023] Tasks state (memory values in pages):
kern  :info  : [Fri Jun 23 17:25:47 2023] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
kern  :info  : [Fri Jun 23 17:25:47 2023] [    353]     0   353    70052     4501    86016        0         -1000 multipathd
kern  :info  : [Fri Jun 23 17:25:47 2023] [    537]     0   537    60260      679   110592      291             0 accounts-daemon
kern  :info  : [Fri Jun 23 17:25:47 2023] [    547]   103   547     1910      652    49152      128          -900 dbus-daemon
kern  :info  : [Fri Jun 23 17:25:47 2023] [    557]     0   557     7480      790    90112     2020             0 networkd-dispat
kern  :info  : [Fri Jun 23 17:25:47 2023] [    581]   114   581   179665     2115   184320      648             0 node_exporter
kern  :info  : [Fri Jun 23 17:25:47 2023] [    619]     0   619      951      459    45056       50             0 atd
kern  :info  : [Fri Jun 23 17:25:47 2023] [    690]     0   690    59107      681    98304      288             0 polkitd
kern  :info  : [Fri Jun 23 17:25:47 2023] [    820]     0   820     1840      385    49152       35             0 agetty
kern  :info  : [Fri Jun 23 17:25:47 2023] [    835]     0   835     1459      373    49152       33             0 agetty
kern  :info  : [Fri Jun 23 17:25:47 2023] [    839]   113   839     3256      483    45056       57             0 chronyd
kern  :info  : [Fri Jun 23 17:25:47 2023] [    840]   113   840     1174        0    45056       45             0 chronyd
kern  :info  : [Fri Jun 23 17:25:47 2023] [  11637]     0 11637   540423     5303  1376256     5278             0 journalbeat
kern  :info  : [Fri Jun 23 17:25:47 2023] [ 835809]     0 835809     2216      484    61440      234         -1000 systemd-udevd
kern  :info  : [Fri Jun 23 17:25:47 2023] [ 835916]   107 835916     2438       29    53248       52             0 uuidd
kern  :info  : [Fri Jun 23 17:25:47 2023] [ 836231]     0 836231   347268     1057   221184      728             0 google_osconfig
kern  :info  : [Fri Jun 23 17:25:47 2023] [ 836289]   100 836289     6820      586    81920      220             0 systemd-network
kern  :info  : [Fri Jun 23 17:25:47 2023] [ 836291]     0 836291   305950      895   241664      793          -999 google_guest_ag
kern  :info  : [Fri Jun 23 17:25:47 2023] [ 836421]     0 836421     4346      605    77824      210             0 systemd-logind
kern  :info  : [Fri Jun 23 17:25:47 2023] [ 836424]     0 836424     2137      492    53248       72             0 cron
kern  :info  : [Fri Jun 23 17:25:47 2023] [ 836434]   101 836434     6106      679    86016      939             0 systemd-resolve
kern  :info  : [Fri Jun 23 17:25:47 2023] [ 836437]     0 836437   119713      685   921600      182          -250 systemd-journal
kern  :info  : [Fri Jun 23 17:25:47 2023] [ 843465]     0 843465     3046      543    61440      238         -1000 sshd
kern  :info  : [Fri Jun 23 17:25:47 2023] [ 848130]   104 848130    90942     1844   282624    23252             0 rsyslogd
kern  :info  : [Fri Jun 23 17:25:47 2023] [ 863855]   114 863855 4295068671  7660647 61607936        0          -950 scylla
kern  :info  : [Fri Jun 23 17:25:47 2023] [ 863999]   114 863999   701465      900   454656    24516             0 scylla-jmx
kern  :info  : [Fri Jun 23 17:25:47 2023] [ 897161]   114 897161  3053345   192644 23207936  2678762             0 scylla-manager-
kern  :info  : [Fri Jun 23 17:25:47 2023] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/scylla.slice/scylla-helper.slice/scylla-manager-agent.service,task=scylla-manager-,pid=897161,uid=114
kern  :err   : [Fri Jun 23 17:25:47 2023] Out of memory: Killed process 897161 (scylla-manager-) total-vm:12213380kB, anon-rss:770576kB, file-rss:0kB, shmem-rss:0kB, UID:114 pgtables:22664kB oom_score_adj:0

bhalevy commented 1 year ago

logs: 13336.zip

michoecho commented 1 year ago

I'm fairly sure scylla-manager-agent is supposed to be ran in scylla-helper.slice, which is supposed to limit memory usage to 5%, see https://github.com/scylladb/scylla-manager/blob/5420c4d76bc0f40297693ba43aae920472fe35e5/dist/systemd/scylla-manager-agent.service#L20 and https://github.com/scylladb/scylladb/blob/7d35cf8657ab593eebee31c5ae571c43d4a413a0/dist/common/systemd/scylla-helper.slice#L17

But I see in systemd docs that swap has separate limit knobs. Perhaps swap usage doesn't fall under MemoryMax, but only under MemorySwapMax?

bhalevy commented 1 year ago

Scylla version (or git commit hash): Scylla Enterprise 2022.2.6

karol-kokoszka commented 1 year ago

Agent is running in scylla-helper.slice. The MemoryMax parameter is respected, see https://github.com/scylladb/scylla-manager/issues/3298#issuecomment-1438637567 The issue above shows the situation where the manager agent hit the limit and was killed.

Why not to set the MemorySwapMax as @michoecho suggested ? See https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#MemorySwapMax=bytes

bhalevy commented 1 year ago

Agent is running in scylla-helper.slice. The MemoryMax parameter is respected, see #3298 (comment) The issue above shows the situation where the manager agent hit the limit and was killed.

Why not to set the MemorySwapMax as @michoecho suggested ? See https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#MemorySwapMax=bytes

Maybe as a workaround, but why would the manager agent need over 10GB of memory?

bhalevy commented 1 year ago

In any case, even after disabling the agent the node in question kept being hit by oom-kill. Apparently due to having too many files, see https://github.com/scylladb/scylla-enterprise/issues/3124#issuecomment-1620045944

dkropachev commented 1 year ago

@bhalevy , is there agent logs ? I see empty files in 13336.zip

bhalevy commented 1 year ago

@bhalevy , is there agent logs ? I see empty files in 13336.zip

I didn't see any agent logs. Maybe @d-helios can help find them

d-helios commented 1 year ago

we have logs only for the ~ 2, 3 days we will not able to get logs 2 week old

bhalevy commented 1 year ago

It wpuld still be interesting to see the current footprint of the manager agent and if it is excesive also get the respective logs even if it doesn't trigger the oom killer at the moment.

d-helios commented 1 year ago

logs for the last 2 days result.tgz

10.31.188.5/dmesg.txt:kern  :warn  : [Fri Jun 23 17:25:47 2023] google_guest_ag invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-999

bhalevy commented 1 year ago

Yes, it shows that scylla-manager-agent definitely blew up.

kern  :info  : [Fri Jun 23 17:25:47 2023] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
kern  :info  : [Fri Jun 23 17:25:47 2023] [    353]     0   353    70052     4501    86016        0         -1000 multipathd
kern  :info  : [Fri Jun 23 17:25:47 2023] [    537]     0   537    60260      679   110592      291             0 accounts-daemon
kern  :info  : [Fri Jun 23 17:25:47 2023] [    547]   103   547     1910      652    49152      128          -900 dbus-daemon
kern  :info  : [Fri Jun 23 17:25:47 2023] [    557]     0   557     7480      790    90112     2020             0 networkd-dispat
kern  :info  : [Fri Jun 23 17:25:47 2023] [    581]   114   581   179665     2115   184320      648             0 node_exporter
kern  :info  : [Fri Jun 23 17:25:47 2023] [    619]     0   619      951      459    45056       50             0 atd
kern  :info  : [Fri Jun 23 17:25:47 2023] [    690]     0   690    59107      681    98304      288             0 polkitd
kern  :info  : [Fri Jun 23 17:25:47 2023] [    820]     0   820     1840      385    49152       35             0 agetty
kern  :info  : [Fri Jun 23 17:25:47 2023] [    835]     0   835     1459      373    49152       33             0 agetty
kern  :info  : [Fri Jun 23 17:25:47 2023] [    839]   113   839     3256      483    45056       57             0 chronyd
kern  :info  : [Fri Jun 23 17:25:47 2023] [    840]   113   840     1174        0    45056       45             0 chronyd
kern  :info  : [Fri Jun 23 17:25:47 2023] [  11637]     0 11637   540423     5303  1376256     5278             0 journalbeat
kern  :info  : [Fri Jun 23 17:25:47 2023] [ 835809]     0 835809     2216      484    61440      234         -1000 systemd-udevd
kern  :info  : [Fri Jun 23 17:25:47 2023] [ 835916]   107 835916     2438       29    53248       52             0 uuidd
kern  :info  : [Fri Jun 23 17:25:47 2023] [ 836231]     0 836231   347268     1057   221184      728             0 google_osconfig
kern  :info  : [Fri Jun 23 17:25:47 2023] [ 836289]   100 836289     6820      586    81920      220             0 systemd-network
kern  :info  : [Fri Jun 23 17:25:47 2023] [ 836291]     0 836291   305950      895   241664      793          -999 google_guest_ag
kern  :info  : [Fri Jun 23 17:25:47 2023] [ 836421]     0 836421     4346      605    77824      210             0 systemd-logind
kern  :info  : [Fri Jun 23 17:25:47 2023] [ 836424]     0 836424     2137      492    53248       72             0 cron
kern  :info  : [Fri Jun 23 17:25:47 2023] [ 836434]   101 836434     6106      679    86016      939             0 systemd-resolve
kern  :info  : [Fri Jun 23 17:25:47 2023] [ 836437]     0 836437   119713      685   921600      182          -250 systemd-journal
kern  :info  : [Fri Jun 23 17:25:47 2023] [ 843465]     0 843465     3046      543    61440      238         -1000 sshd
kern  :info  : [Fri Jun 23 17:25:47 2023] [ 848130]   104 848130    90942     1844   282624    23252             0 rsyslogd
kern  :info  : [Fri Jun 23 17:25:47 2023] [ 863855]   114 863855 4295068671  7660647 61607936        0          -950 scylla
kern  :info  : [Fri Jun 23 17:25:47 2023] [ 863999]   114 863999   701465      900   454656    24516             0 scylla-jmx
kern  :info  : [Fri Jun 23 17:25:47 2023] [ 897161]   114 897161  3053345   192644 23207936  2678762             0 scylla-manager-
kern  :info  : [Fri Jun 23 17:25:47 2023] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/scylla.slice/scylla-helper.slice/scylla-manager-agent.service,task=scylla-manager-,pid=897161,uid=114

karol-kokoszka commented 1 year ago

@d-helios Is there a chance to get an access to prometheus to query metrics from Fri Jun 23 17:25:47 2023 +- few hours ? Manager agent emits RClone metrics showing how many bytes are transferred + there is standard set of metrics always available:

process_max_fds
process_open_fds // to see how many files were opened
process_virtual_memory_bytes // This includes all types of memory, both in RAM and swapped out.
process_resident_memory_bytes // This results in the amount of memory that belongs specifically to that process in bytes. This excludes swapped out memory pages.
go_goroutines // number of working goroutines
go_memstats_alloc_bytes // a metric which shows how much bytes of memory is allocated on the Heap for the Objects.
go_memstats_mallocs_total – shows how many heap objects are allocated. This is a counter value so you can use rate() to objects allocated/s.

d-helios commented 1 year ago

@d-helios Is there a chance to get an access to prometheus to query metrics from Fri Jun 23 17:25:47 2023 +- few hours ? Manager agent emits RClone metrics showing how many bytes are transferred + there is standard set of metrics always available:

@karol-kokoszka all metrics available in prometheus you can open https://backoffice.prd.dbaas.scyop.net/ choose required cluster and get all this results

karol-kokoszka commented 1 year ago

Ok, it's scylla-cloud and cluster 13336. Will check it.

karol-kokoszka commented 1 year ago

Unfortunately this is cloud environment and we don't collect metrics for scylla-manager-agents..... Due to this bug https://github.com/scylladb/scylla-monitoring/issues/1992

Hard to say what's going on without it, as logs are limited.

noellymedina commented 1 year ago

Hello @karol-kokoszka , please see https://github.com/scylladb/scylla-enterprise/issues/3121#issuecomment-1646031459

scylladb / scylla-manager

agent process inflated and cause oom-kill #3450