openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.61k stars 1.75k forks source link

OOM triggered, suspect ARC to blame #14686

Closed prakashsurya closed 1 year ago

prakashsurya commented 1 year ago

I'm still investigating things, but I wanted to open an issue with the data I have, since I'm thinking it may be related to https://github.com/openzfs/zfs/pull/14359

We have a system/image that's used as our buildserver, used to build various artifacts for our product (packages, VM image, etc.).

Since we integrated https://github.com/openzfs/zfs/pull/14359 into our version of ZFS used on these buildserver images, we've been seeing OOM kills, causing build failures.

In the one case that I've been able to inspect, I see the following:

$ free -m
              total        used        free      shared  buff/cache   available
Mem:           3182        3041         100           0          40           7
Swap:          8191          49        8142
$ cat /proc/meminfo
MemTotal:        3258832 kB
MemFree:          114108 kB

So, the system only has ~3G of RAM.. most of which appears to be used.. but, used by what?

$ ps -e -o pid,rss,vsz,args | sort -n -k2 | tail -n 20
    781   564  74396 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 101:101
  79632   604  14032 sort -n -k2
    786   876  12180 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
    422  1176  19804 /lib/systemd/systemd-udevd
  78964  1284  24420 /lib/systemd/systemd-resolved
    661  1568   8420 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
  79631  1752   8808 ps -e -o pid,rss,vsz,args
    682  2356  17760 /lib/systemd/systemd-logind
  79585  2492 170852 (sd-pam)
  79592  2792   7236 -bash
  79591  3196  13108 sshd: delphix@pts/0
    387  4632  45968 /lib/systemd/systemd-journald
  78963  4664  18588 /usr/bin/python3 /usr/bin/delphix-startup-screen
      1  4900 169900 /sbin/init
  79561  4956  13108 sshd: delphix [priv]
  78961  5148  18800 /usr/bin/python3 /usr/bin/delphix-startup-screen
  79584  6200  18932 /lib/systemd/systemd --user
    688  7996 1424632 /usr/bin/containerd
   2676 11956 1298548 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
  79061 32496 5072372 /usr/bin/telegraf -config /etc/telegraf/telegraf.conf

It doesn't look to be used by processes...

Looking at the ARC, I see this:

  $ arcstat 1
    time  read  ddread  ddh%  dmread  dmh%  pread  ph%   size      c  avail
00:12:48     0       0     0       0     0      0    0   2.7G    99M  -282M
00:12:49     0       0     0       0     0      0    0   2.7G    99M  -282M
00:12:50    16       0     0      16   100      0    0   2.7G    99M  -282M
pd                              4    1498642
pm                              4    2153388558
c                               4    104282624
c_min                           4    104282624
c_max                           4    2502782976
size                            4    2914917016
compressed_size                 4    2753806336
uncompressed_size               4    4813766144
overhead_size                   4    11655680
hdr_size                        4    40684288
data_size                       4    2714360832
metadata_size                   4    51101184
dbuf_size                       4    650328
dnode_size                      4    1624000
bonus_size                      4    444800
anon_size                       4    16384

For a nearly 3G RAM system.. using ~2.9G for ARC seems excessive.. particularly, given that OOM killer is kicking in to kill user space processes.. e.g. why doesn't the ARC evict data?

Here's an example OOM message:

Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940005] telegraf invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940009] CPU: 1 PID: 2789 Comm: telegraf Kdump: loaded Tainted: P           OE     5.4.0-1097-dx2023032205-8e2e150ee-aws #105
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940010] Hardware name: Amazon EC2 c5.large/, BIOS 1.0 10/16/2017
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940011] Call Trace:
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940018]  dump_stack+0x6d/0x8b
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940021]  dump_header+0x4f/0x1eb
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940022]  oom_kill_process.cold+0xb/0x10
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940026]  out_of_memory+0x1cf/0x500
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940029]  __alloc_pages_slowpath+0xdde/0xeb0
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940032]  __alloc_pages_nodemask+0x2d0/0x320
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940035]  alloc_pages_vma+0x7f/0x220
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940037]  __read_swap_cache_async+0x161/0x230
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940039]  read_swap_cache_async+0x2b/0x70
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940040]  swap_cluster_readahead+0x23f/0x2e0
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940044]  ? try_charge+0x709/0x810
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940046]  swapin_readahead+0x5c/0x470
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940047]  ? pagecache_get_page+0x2d/0x300
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940051]  do_swap_page+0x3fb/0x9f0
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940053]  __handle_mm_fault+0x73b/0x7a0
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940055]  handle_mm_fault+0xca/0x200
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940058]  do_user_addr_fault+0x1f9/0x450
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940061]  ? hrtimer_nanosleep+0xc2/0x1c0
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940063]  __do_page_fault+0x58/0x90
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940064]  do_page_fault+0x2c/0xe0
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940066]  do_async_page_fault+0x39/0x70
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940069]  async_page_fault+0x34/0x40
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940071] RIP: 0033:0x44789d
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940073] Code: 48 8d 6c 24 30 48 89 44 24 40 90 48 8d 05 7b 5d fd 09 e8 b6 4e fc ff 48 8b 4c 24 40 31 c0 31 d2 eb 03 48 ff c0 48 89 54 24 10 <48> 8b 1d fc 60 f8 09 48 39 05 fd 60 f8 09 0f 8e aa 00 00 00 48 8b
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940074] RSP: 002b:000000c0000a5ef0 EFLAGS: 00010246
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940076] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000124bac858b7
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940076] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000004
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940077] RBP: 000000c0000a5f20 R08: 0000000000000000 R09: 0000000000000000
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940078] R10: 0000000000000000 R11: 0000000000000246 R12: 000000c0000a5930
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940079] R13: 000000c000a00400 R14: 000000c0000064e0 R15: 0000000000000100
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940081] Mem-Info:
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940085] active_anon:62 inactive_anon:217 isolated_anon:0
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940085]  active_file:30 inactive_file:28 isolated_file:27
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940085]  unevictable:3 dirty:8 writeback:1 unstable:0
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940085]  slab_reclaimable:4803 slab_unreclaimable:62809
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940085]  mapped:59 shmem:10 pagetables:1132 bounce:0
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940085]  free:17451 free_pcp:43 free_cma:0
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940087] Node 0 active_anon:248kB inactive_anon:868kB active_file:120kB inactive_file:112kB unevictable:12kB isolated(anon):0kB isolated(file):108kB mapped:236kB dirty:32kB writeback:4kB shmem:40kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940088] Node 0 DMA free:12832kB min:320kB low:400kB high:480kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940091] lowmem_reserve[]: 0 2712 3137 3137 3137
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940093] Node 0 DMA32 free:53004kB min:58148kB low:72684kB high:87220kB active_anon:336kB inactive_anon:576kB active_file:196kB inactive_file:172kB unevictable:0kB writepending:20kB present:3129256kB managed:2801576kB mlocked:0kB kernel_stack:1748kB pagetables:2692kB bounce:0kB free_pcp:52kB local_pcp:52kB free_cma:0kB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940096] lowmem_reserve[]: 0 0 424 424 424
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940098] Node 0 Normal free:3968kB min:9108kB low:11384kB high:13660kB active_anon:480kB inactive_anon:804kB active_file:248kB inactive_file:220kB unevictable:12kB writepending:16kB present:800768kB managed:441896kB mlocked:12kB kernel_stack:2668kB pagetables:1836kB bounce:0kB free_pcp:120kB local_pcp:120kB free_cma:0kB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940101] lowmem_reserve[]: 0 0 0 0 0
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940103] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB (U) 0*1024kB 2*2048kB (UM) 2*4096kB (M) = 12800kB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940108] Node 0 DMA32: 855*4kB (UME) 6191*8kB (UME) 3*16kB (UM) 1*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 53028kB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940114] Node 0 Normal: 323*4kB (UME) 325*8kB (UME) 30*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 4372kB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940121] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940122] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940122] 95 total pagecache pages
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940124] 29 pages in swap cache
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940125] Swap cache stats: add 166746, delete 166717, find 3815/5884
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940125] Free swap  = 7737852kB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940126] Total swap = 8388604kB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940126] 986504 pages RAM
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940127] 0 pages HighMem/MovableOnly
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940127] 171796 pages reserved
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940128] 0 pages cma reserved
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940128] 0 pages hwpoisoned
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940129] Unreclaimable slab info:
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940129] Name                      Used          Total
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940142] nf_conntrack              56KB         71KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940226] zfs_znode_hold_cache          7KB          7KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940241] zfs_znode_cache          815KB       2565KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940243] zil_zcw_cache              7KB          7KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940244] zil_lwb_cache             23KB         23KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940327] dmu_buf_impl_t          1213KB       6913KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940330] arc_buf_t                125KB        212KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940331] arc_buf_hdr_t_full      39110KB      39110KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940376] dnode_t                 3285KB       8171KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940389] sa_cache                 283KB        705KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940390] abd_t                  13772KB      13772KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940391] lz4_cache                 64KB         64KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940402] zio_buf_comb_16384       3040KB       4608KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940403] zio_buf_comb_14336        192KB        384KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940404] zio_buf_comb_12288        168KB        240KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940405] zio_buf_comb_10240        168KB        288KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940407] zio_buf_comb_8192        224KB        384KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940408] zio_buf_comb_7168        224KB        384KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940410] zio_buf_comb_6144        128KB        288KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940411] zio_buf_comb_5120        192KB        352KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940413] zio_buf_comb_4096        160KB        288KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940414] zio_buf_comb_3584        224KB        350KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940416] zio_buf_comb_3072        300KB        420KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940417] zio_buf_comb_2560        390KB        510KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940418] zio_buf_comb_2048        352KB        480KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940419] zio_buf_comb_1536        315KB        441KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940431] zio_buf_comb_1024      15384KB      15632KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940432] zio_buf_comb_512       30400KB      30400KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940433] zio_link_cache           127KB        143KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940434] zio_cache                248KB        248KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940523] zfs_btree_leaf_cache      11328KB      32192KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940525] mqueue_inode_cache         15KB         15KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940528] PING                      48KB         48KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940529] RAW                       32KB         32KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940530] tw_sock_TCP               27KB         27KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940531] request_sock_TCP           7KB          7KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940532] TCP                       61KB         61KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940533] hugetlbfs_inode_cache         30KB         30KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940534] eventpoll_pwq              7KB          7KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940535] request_queue             61KB         61KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940536] blkdev_ioc                15KB         15KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940537] biovec-max               320KB        352KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940537] biovec-128                64KB         64KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940538] biovec-64                 64KB         64KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940539] khugepaged_mm_slot          7KB          7KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940540] user_namespace             7KB          7KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940541] uid_cache                 20KB         20KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940542] skbuff_ext_cache           7KB          7KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940543] skbuff_fclone_cache         57KB         88KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940544] skbuff_head_cache         68KB         68KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940545] file_lock_cache            7KB          7KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940545] fsnotify_mark_connector          8KB          8KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940546] net_namespace             57KB         57KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940547] task_delay_info           27KB         27KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940548] taskstats                 15KB         15KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940549] proc_dir_entry           110KB        110KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940549] pde_opener                 7KB          7KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940551] shmem_inode_cache        417KB        417KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940585] kernfs_node_cache       8045KB       8275KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940587] mnt_cache                212KB        232KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940589] filp                      76KB        120KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940590] names_cache              160KB        160KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940592] lsm_file_cache            67KB         67KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940593] uts_namespace             15KB         15KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940595] vm_area_struct           107KB        138KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940595] mm_struct                 63KB         63KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940596] files_cache               47KB         47KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940597] signal_cache             220KB        220KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940598] sighand_cache            371KB        371KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940600] task_struct             1067KB       1189KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940600] cred_jar                  39KB         39KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940603] anon_vma_chain            26KB         68KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940604] anon_vma                  80KB        114KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940605] pid                       32KB         32KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940606] Acpi-Operand              78KB         78KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940607] Acpi-Parse                11KB         11KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940607] Acpi-State                27KB         27KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940608] numa_policy                3KB          3KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940609] trace_event_file         118KB        118KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940610] ftrace_event_field        179KB        179KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940611] pool_workqueue           127KB        136KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940612] task_group                90KB         90KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940616] vmap_area                100KB        196KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940628] kmalloc-8k              1816KB       3648KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940629] kmalloc-4k              4996KB       5044KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940630] kmalloc-2k              2692KB       2692KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940698] kmalloc-1k              7002KB      15088KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940824] kmalloc-512             2701KB      11496KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940825] kmalloc-256              307KB        320KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940826] kmalloc-192              500KB        500KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940828] kmalloc-128              210KB        244KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940867] kmalloc-96               348KB       1449KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.940977] kmalloc-64              1039KB       5252KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941030] kmalloc-32              2499KB       3476KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941031] kmalloc-16               120KB        120KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941032] kmalloc-8                 48KB         48KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941033] kmem_cache_node           84KB         84KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941036] kmem_cache               525KB        574KB
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941037] Tasks state (memory values in pages):
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941037] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941049] [    387]     0   387    11492        4   114688      327          -250 systemd-journal
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941052] [    422]     0   422     4951        5    65536      346         -1000 systemd-udevd
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941054] [    471]     0   471     2839        0    53248       55         -1000 auditd
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941056] [    638]   102   638     6847        3    73728      238             0 systemd-network
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941058] [    640]   103   640     6104        0    90112      979             0 systemd-resolve
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941060] [    661]   104   661     2105        1    57344      209          -900 dbus-daemon
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941062] [    675]     0   675      965        0    45056       33             0 rngd
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941064] [    677]   106   677    56027        0    77824      399             0 rsyslogd
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941066] [    682]     0   682     4440        0    77824      350             0 systemd-logind
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941067] [    684]     0   684    59475        0    81920      254             0 zed
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941069] [    688]     0   688   356158        0   307200     3249          -999 containerd
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941071] [    719]     0   719     1393        0    49152       62             0 cron
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941073] [    728]     0   728     4732       19    77824     1441             0 delphix-startup
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941075] [    736]     0   736     4707       17    73728     1395             0 delphix-startup
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941077] [    781]   101   781    18599        0    53248      151             0 ntpd
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941079] [    786]     0   786     3045        4    61440      227         -1000 sshd
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941081] [    948]   105   948     1694        2    53248       51             0 uuidd
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941084] [   2675]     0  2675     1723        1    57344       61             0 delphix-telegra
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941085] [   2676]     0  2676   324637        0   339968     4783          -500 dockerd
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941087] [   2788]     0  2788  1268093        0   462848     7472             0 telegraf
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941089] [   2808]     0  2808      622        0    40960       21             0 none
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941091] [   2946] 65433  2946     4731        1    73728      340             0 systemd
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941093] [   2949] 65433  2949    26028        5    94208      984             0 (sd-pam)
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941095] [   2999] 65433  2999   859203        0   851968    50810             0 java
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941097] [  16453] 65433 16453      653        0    45056       26             0 sh
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941099] [  16454] 65433 16454      653        0    45056       27             0 sh
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941100] [  16455] 65433 16455     1723        1    53248       65             0 bash
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941102] [  16458]     0 16458     2116        0    53248      110             0 sudo
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941104] [  16459]     0 16459   608915        0   454656    17872             0 java
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941106] [  16496]     0 16496   736176        0   933888    67597             0 java
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941108] [  17841]     0 17841     1724        1    45056       80             0 run-live-build.
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941110] [  17858]     0 17858   197201        0   200704      957             0 aptly
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941111] [  17997]     0 17997      960        1    49152       76             0 lb_build
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941113] [  18012]     0 18012     1724        1    53248       65             0 build
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941115] [  18013]     0 18013      960        1    40960       78             0 lb_build
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941117] [  18014]     0 18014     1372        0    49152       26             0 tee
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941119] [  72928]     0 72928      686        1    45056       77             0 lb_binary
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941121] [  72966]     0 72966      686        1    45056       78             0 lb_binary_chroo
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941123] [  73019]     0 73019     3751       32    65536      474             0 cp
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941124] [  73576] 65433 73576     1369        0    53248       21             0 sleep
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941126] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-65433.slice/session-5.scope,task=java,pid=16496,uid=0
Mar 28 22:14:31 ip-10-110-220-95 kernel: [ 1257.941194] Out of memory: Killed process 16496 (java) total-vm:2944704kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:912kB oom_score_adj:0

Additionally, I've attached a g'ziped copy of syslog from the system.. syslog.gz

I'm still actively investigating this issue.. and not certain that https://github.com/openzfs/zfs/pull/14359 is to blame yet.. but things point it so far.. For example, an image generated prior to that change being integrated, does not have these OOM issues.. but an image generated after that change was integrated does have these OOM issues.. that coupled with the fact that nearly all 3G of the system are being used by the ARC, is leading me to suspect it's the changes to the ARC eviction at play here..

As I get more information, I'll report back.. but wanted to post the information I have so far, in case somebody more familiar could chime in, and maybe help me pin point the problem..

CC: @amotin @behlendorf @allanjude

amotin commented 1 year ago

It seems like ARC was trying to shrink to the minimum, but can't. I have no clue why could it be. It could help if you could provide full kstat and/or arc_summary output. I wonder if the data can somehow be non-evictable, or it is really some math issue.

prakashsurya commented 1 year ago

I lost access to the system (build eventually times out, and the system is cleaned up), but here's arcstats from when I was able to log in and inspect..

$ cat /proc/spl/kstat/zfs/arcstats
9 1 0x01 147 39984 4026867198 8436358124041
name                            type data
hits                            4    10390029
iohits                          4    178833
misses                          4    365409
demand_data_hits                4    5136518
demand_data_iohits              4    121054
demand_data_misses              4    91780
demand_metadata_hits            4    5162519
demand_metadata_iohits          4    3635
demand_metadata_misses          4    121016
prefetch_data_hits              4    9243
prefetch_data_iohits            4    943
prefetch_data_misses            4    146114
prefetch_metadata_hits          4    81749
prefetch_metadata_iohits        4    53201
prefetch_metadata_misses        4    6499
mru_hits                        4    2336106
mru_ghost_hits                  4    31069
mfu_hits                        4    8053923
mfu_ghost_hits                  4    84648
uncached_hits                   4    0
deleted                         4    469736
mutex_miss                      4    1058
access_skip                     4    1
evict_skip                      4    4147591
evict_not_enough                4    386821
evict_l2_cached                 4    0
evict_l2_eligible               4    75218220544
evict_l2_eligible_mfu           4    16507243520
evict_l2_eligible_mru           4    58710977024
evict_l2_ineligible             4    476856320
evict_l2_skip                   4    0
hash_elements                   4    149534
hash_elements_max               4    149536
hash_collisions                 4    65230
hash_chains                     4    17601
hash_chain_max                  4    4
meta                            4    575429547
pd                              4    1791783
pm                              4    2153388558
c                               4    104282624
c_min                           4    104282624
c_max                           4    2502782976
size                            4    2914759656
compressed_size                 4    2752201216
uncompressed_size               4    4811579904
overhead_size                   4    13123072
hdr_size                        4    40684048
data_size                       4    2713842688
metadata_size                   4    51481600
dbuf_size                       4    649936
dnode_size                      4    1621000
bonus_size                      4    444160
anon_size                       4    2519040
anon_data                       4    2502656
anon_metadata                   4    16384
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    2750262784
mru_data                        4    2708664832
mru_metadata                    4    41597952
mru_evictable_data              4    2708664832
mru_evictable_metadata          4    37024768
mru_ghost_size                  4    92644352
mru_ghost_data                  4    14796800
mru_ghost_metadata              4    77847552
mru_ghost_evictable_data        4    14796800
mru_ghost_evictable_metadata    4    77847552
mfu_size                        4    12542464
mfu_data                        4    2675200
mfu_metadata                    4    9867264
mfu_evictable_data              4    0
mfu_evictable_metadata          4    4100608
mfu_ghost_size                  4    293185024
mfu_ghost_data                  4    261363200
mfu_ghost_metadata              4    31821824
mfu_ghost_evictable_data        4    261363200
mfu_ghost_evictable_metadata    4    31821824
uncached_size                   4    0
uncached_data                   4    0
uncached_metadata               4    0
uncached_evictable_data         4    0
uncached_evictable_metadata     4    0
l2_hits                         4    0
l2_misses                       4    0
l2_prefetch_asize               4    0
l2_mru_asize                    4    0
l2_mfu_asize                    4    0
l2_bufc_data_asize              4    0
l2_bufc_metadata_asize          4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_lock_retry            4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
l2_log_blk_writes               4    0
l2_log_blk_avg_asize            4    0
l2_log_blk_asize                4    0
l2_log_blk_count                4    0
l2_data_to_meta_ratio           4    0
l2_rebuild_success              4    0
l2_rebuild_unsupported          4    0
l2_rebuild_io_errors            4    0
l2_rebuild_dh_errors            4    0
l2_rebuild_cksum_lb_errors      4    0
l2_rebuild_lowmem               4    0
l2_rebuild_size                 4    0
l2_rebuild_asize                4    0
l2_rebuild_bufs                 4    0
l2_rebuild_bufs_precached       4    0
l2_rebuild_log_blks             4    0
memory_throttle_count           4    477
memory_direct_count             4    14008
memory_indirect_count           4    184797
memory_all_bytes                4    3337043968
memory_free_bytes               4    124661760
memory_available_bytes          3    -292468736
arc_no_grow                     4    1
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    23893
arc_meta_used                   4    94880744
arc_dnode_limit                 4    250278297
async_upgrade_sync              4    121467
predictive_prefetch             4    297746
demand_hit_predictive_prefetch  4    34197
demand_iohit_predictive_prefetch 4    123807
prescient_prefetch              4    3
demand_hit_prescient_prefetch   4    3
demand_iohit_prescient_prefetch 4    0
arc_need_free                   4    83321856
arc_sys_free                    4    417130496
arc_raw_size                    4    0
cached_only_in_progress         4    0
abd_chunk_waste_size            4    106036224
amotin commented 1 year ago

According to small "pd" value, the size of MRU data should be very small, but that is where most of the ARC size reside according to the stats. I'll think more about it, but if you can reproduce the problem, it could help with understanding to dump everything possible around the arc_evict_impl(arc_mfu, ARC_BUFC_METADATA, e) call in case some math does not work right there.

prakashsurya commented 1 year ago

but if you can reproduce the problem

I can.. it might take me half a day, but I should be able to reproduce it..

it could help with understanding to dump everything possible around the arc_evict_impl(arc_mfu, ARC_BUFC_METADATA, e) call

any chance you could provide me a patch, or open a PR, with the code you'd like me to test with? I can get that done tomorrow (PST).

I appreciate the help!

grwilson commented 1 year ago

@amotin we run our systems with zfs_arc_meta_strategy = 0 which was removed. Is the equivalent to set zfs_arc_meta_balance = 1? I wonder if that would help us avoid the OOM.

behlendorf commented 1 year ago

The allocation which failed is also very surprising. It's a tiny zero order GFP_KERNEL allocation which should have been able to get a page via direct reclaim from the ARC. The memory_direct_count = 14008 kstat indicates we were running direct reclaim frequently and arc_c was driven down to arc_c_min so we sure are trying.

gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0

I'd say we may not be pruning the Linux dentry and inode caches aggressively enough via arc_prune_async() and allowing pinned metadata buffers to build up. We did consolidate a few calls to arc_prune_async(), but the kstats suggest that isn't the case given the small number of dnodes and metadata buffers size. As @amotin mentioned I think we're going to need to add some instrumentation to see why nothing is being evicted.

One experiment you might try is setting zfs_arc_evict_batch_limit = 0 and running echo 3 >/proc/sys/vm/drop_caches. This should drop the ARC and might at least help rule out some kind of fundamental accounting problem.

@grwilson it looks like you must also be increasing the default zfs_arc_max?

prakashsurya commented 1 year ago

One experiment you might try is setting

i.e. reproduce a system to this point, and then run these commands, and see what happens to the ARC? I can work on that..

it looks like you must also be increasing the default zfs_arc_max?

$ cat /sys/module/zfs/parameters/zfs_arc_max
0
prakashsurya commented 1 year ago

I'd say we may not be pruning the Linux dentry and inode caches aggressively enough via arc_prune_async() and allowing pinned metadata buffers to build up.

If this was the case, would the stats show things accounted for in mru_data like I see?

mru_data                        4    2708664832
mru_metadata                    4    41597952
behlendorf commented 1 year ago

reproduce a system to this point, and then run these commands, and see what happens to the ARC?

Right. What we'd expect is the Linux virtual memory system to apply pressure to the ARC until everything has been reclaimed down to zfs_arc_min. Almost exactly like what should happen when direct reclaim is invoked in a node with low memory.

If this was the case, would the stats show things accounted for in mru_data like I see?

That's the thing, I'd expect it to be accounted for as metadata not data. Which is what makes me think that's not what's going on unless somehow we're somehow miscategorizing these buffers.

prakashsurya commented 1 year ago

OK.. I appreciate the input and help so far. I'm trying to reproduce the issue now, so I can try the drop_caches experiment.. haven't hit the OOM yet, but I'll keep kicking off builds until I do..

Additionally, if there's any patches you'd like me to deploy and test with, feel free to send them to me and I can get them into our buildserver image that has the problem. It's not clear to me what information would help root cause so far.

prakashsurya commented 1 year ago

I haven't been able to reproduce the issue so far today, after running like 30+ builds 🤷 before it'd hit relatively frequently, maybe every 3-4 builds...

behlendorf commented 1 year ago

It's always something! I also noticed I made a terrible typo above, you want to set zfs_arc_shrinker_limit = 0 not zfs_arc_evict_batch_limit = 0. If you're ever able to reliably reproduce this you could also try and increasing this value so we try and release more memory from the ARC on every direct reclaim invocation. Here's the relevant comment from arc_shrinker_count(). Of course none of this explains why the ARC wouldn't be able to evict the buffers.

        /*
         * This code is reached in the "direct reclaim" case, where the
         * kernel (outside ZFS) is trying to allocate a page, and the system
         * is low on memory.
         *
         * The kernel's shrinker code doesn't understand how many pages the
         * ARC's callback actually frees, so it may ask the ARC to shrink a
         * lot for one page allocation. This is problematic because it may
         * take a long time, thus delaying the page allocation, and because
         * it may force the ARC to unnecessarily shrink very small.
         *
         * Therefore, we limit the amount of data that we say is evictable,
         * which limits the amount that the shrinker will ask us to evict for
         * one page allocation attempt.
         *
         * In practice, we may be asked to shrink 4x the limit to satisfy one
         * page allocation, before the kernel's shrinker code gives up on us.
         * When that happens, we rely on the kernel code to find the pages
         * that we freed before invoking the OOM killer.  This happens in
         * __alloc_pages_slowpath(), which retries and finds the pages we
         * freed when it calls get_page_from_freelist().
         *
         * See also the comment above zfs_arc_shrinker_limit.
         */
prakashsurya commented 1 year ago

Hit again:

$ cat /proc/spl/kstat/zfs/arcstats
9 1 0x01 147 39984 3953761573 11885741844167
name                            type data
hits                            4    9638365
iohits                          4    189495
misses                          4    340531
demand_data_hits                4    4496576
demand_data_iohits              4    135335
demand_data_misses              4    52141
demand_metadata_hits            4    5048597
demand_metadata_iohits          4    3581
demand_metadata_misses          4    123481
prefetch_data_hits              4    9586
prefetch_data_iohits            4    1383
prefetch_data_misses            4    158403
prefetch_metadata_hits          4    83606
prefetch_metadata_iohits        4    49196
prefetch_metadata_misses        4    6506
mru_hits                        4    2393008
mru_ghost_hits                  4    29075
mfu_hits                        4    7245357
mfu_ghost_hits                  4    37085
uncached_hits                   4    0
deleted                         4    517070
mutex_miss                      4    829
access_skip                     4    1
evict_skip                      4    2189947
evict_not_enough                4    184782
evict_l2_cached                 4    0
evict_l2_eligible               4    69967979520
evict_l2_eligible_mfu           4    10643459072
evict_l2_eligible_mru           4    59324520448
evict_l2_ineligible             4    254993408
evict_l2_skip                   4    0
hash_elements                   4    145916
hash_elements_max               4    145917
hash_collisions                 4    66779
hash_chains                     4    17019
hash_chain_max                  4    4
meta                            4    1012096539
pd                              4    303887742
pm                              4    2243957928
c                               4    105199872
c_min                           4    105199872
c_max                           4    2524796928
size                            4    2929218120
compressed_size                 4    2768971776
uncompressed_size               4    4978176512
overhead_size                   4    12706816
hdr_size                        4    39701440
data_size                       4    2730064896
metadata_size                   4    51613696
dbuf_size                       4    750288
dnode_size                      4    1819000
bonus_size                      4    508480
anon_size                       4    1188352
anon_data                       4    1171968
anon_metadata                   4    16384
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    2768146432
mru_data                        4    2726401024
mru_metadata                    4    41745408
mru_evictable_data              4    2726401024
mru_evictable_metadata          4    36540416
mru_ghost_size                  4    63939584
mru_ghost_data                  4    23429632
mru_ghost_metadata              4    40509952
mru_ghost_evictable_data        4    23429632
mru_ghost_evictable_metadata    4    40509952
mfu_size                        4    12343808
mfu_data                        4    2491904
mfu_metadata                    4    9851904
mfu_evictable_data              4    0
mfu_evictable_metadata          4    3789824
mfu_ghost_size                  4    251945984
mfu_ghost_data                  4    211022336
mfu_ghost_metadata              4    40923648
mfu_ghost_evictable_data        4    211022336
mfu_ghost_evictable_metadata    4    40923648
uncached_size                   4    0
uncached_data                   4    0
uncached_metadata               4    0
uncached_evictable_data         4    0
uncached_evictable_metadata     4    0
l2_hits                         4    0
l2_misses                       4    0
l2_prefetch_asize               4    0
l2_mru_asize                    4    0
l2_mfu_asize                    4    0
l2_bufc_data_asize              4    0
l2_bufc_metadata_asize          4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_lock_retry            4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
l2_log_blk_writes               4    0
l2_log_blk_avg_asize            4    0
l2_log_blk_asize                4    0
l2_log_blk_count                4    0
l2_data_to_meta_ratio           4    0
l2_rebuild_success              4    0
l2_rebuild_unsupported          4    0
l2_rebuild_io_errors            4    0
l2_rebuild_dh_errors            4    0
l2_rebuild_cksum_lb_errors      4    0
l2_rebuild_lowmem               4    0
l2_rebuild_size                 4    0
l2_rebuild_asize                4    0
l2_rebuild_bufs                 4    0
l2_rebuild_bufs_precached       4    0
l2_rebuild_log_blks             4    0
memory_throttle_count           4    57
memory_direct_count             4    3626
memory_indirect_count           4    115746
memory_all_bytes                4    3366395904
memory_free_bytes               4    142213120
memory_available_bytes          3    -278586368
arc_no_grow                     4    1
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    19670
arc_meta_used                   4    94392904
arc_dnode_limit                 4    252479692
async_upgrade_sync              4    135710
predictive_prefetch             4    308677
demand_hit_predictive_prefetch  4    34640
demand_iohit_predictive_prefetch 4    137563
prescient_prefetch              4    3
demand_hit_prescient_prefetch   4    3
demand_iohit_prescient_prefetch 4    0
arc_need_free                   4    68391424
arc_sys_free                    4    420799488
arc_raw_size                    4    0
cached_only_in_progress         4    0
abd_chunk_waste_size            4    104760320
 1257.892709] sudo invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[ 1257.892712] CPU: 0 PID: 16473 Comm: sudo Kdump: loaded Tainted: P           OE     5.4.0-1097-dx2023032205-8e2e150ee-aws #105
[ 1257.892712] Hardware name: Amazon EC2 c5.large/, BIOS 1.0 10/16/2017
[ 1257.892713] Call Trace:
[ 1257.892719]  dump_stack+0x6d/0x8b
[ 1257.892722]  dump_header+0x4f/0x1eb
[ 1257.892724]  oom_kill_process.cold+0xb/0x10
[ 1257.892727]  out_of_memory+0x1cf/0x500
[ 1257.892730]  __alloc_pages_slowpath+0xdde/0xeb0
[ 1257.892733]  ? __switch_to+0x397/0x470
[ 1257.892735]  __alloc_pages_nodemask+0x2d0/0x320
[ 1257.892738]  alloc_pages_current+0x87/0xe0
[ 1257.892741]  __page_cache_alloc+0x72/0x90
[ 1257.892742]  pagecache_get_page+0xbf/0x300
[ 1257.892744]  filemap_fault+0x83e/0xa50
[ 1257.892747]  ? mem_cgroup_charge_statistics+0x51/0xe0
[ 1257.892749]  ? xas_find+0x17f/0x1c0
[ 1257.892750]  ? filemap_map_pages+0x24c/0x380
[ 1257.892754]  __do_fault+0x3c/0x170
[ 1257.892755]  do_fault+0x24b/0x640
[ 1257.892757]  __handle_mm_fault+0x4c5/0x7a0
[ 1257.892759]  handle_mm_fault+0xca/0x200
[ 1257.892762]  do_user_addr_fault+0x1f9/0x450
[ 1257.892763]  __do_page_fault+0x58/0x90
[ 1257.892765]  do_page_fault+0x2c/0xe0
[ 1257.892767]  do_async_page_fault+0x39/0x70
[ 1257.892769]  async_page_fault+0x34/0x40
[ 1257.892770] RIP: 0033:0x7f8665da0379
[ 1257.892774] Code: Bad RIP value.
[ 1257.892775] RSP: 002b:00007ffd494b3a70 EFLAGS: 00010202
[ 1257.892776] RAX: 0000000000000000 RBX: 0000562bd28c98a0 RCX: 00007f8665bea166
[ 1257.892777] RDX: 0000000000000000 RSI: 00007ffd494b3920 RDI: 0000000000000002
[ 1257.892777] RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000
[ 1257.892778] R10: 0000000000000008 R11: 0000000000000246 R12: 0000562bd28d99c0
[ 1257.892779] R13: 00007ffd494b3bc0 R14: 0000000000000005 R15: 00007ffd494b3f80
[ 1257.892780] Mem-Info:
[ 1257.892783] active_anon:7 inactive_anon:19 isolated_anon:7
                active_file:2 inactive_file:17 isolated_file:0
                unevictable:3 dirty:9 writeback:1 unstable:0
                slab_reclaimable:4583 slab_unreclaimable:58236
                mapped:0 shmem:1 pagetables:571 bounce:0
                free:18077 free_pcp:67 free_cma:0
[ 1257.892786] Node 0 active_anon:28kB inactive_anon:76kB active_file:8kB inactive_file:68kB unevictable:12kB isolated(anon):28kB isolated(file):0kB mapped:0kB dirty:36kB writeback:4kB shmem:4kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[ 1257.892787] Node 0 DMA free:12940kB min:316kB low:392kB high:468kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 1257.892790] lowmem_reserve[]: 0 2680 3165 3165 3165
[ 1257.892791] Node 0 DMA32 free:53128kB min:56956kB low:71192kB high:85428kB active_anon:48kB inactive_anon:48kB active_file:0kB inactive_file:44kB unevictable:0kB writepending:4kB present:3129256kB managed:2775564kB mlocked:0kB kernel_stack:372kB pagetables:936kB bounce:0kB free_pcp:44kB local_pcp:36kB free_cma:0kB
[ 1257.892794] lowmem_reserve[]: 0 0 484 484 484
[ 1257.892796] Node 0 Normal free:6240kB min:10304kB low:12880kB high:15456kB active_anon:0kB inactive_anon:36kB active_file:232kB inactive_file:0kB unevictable:12kB writepending:36kB present:829440kB managed:496572kB mlocked:12kB kernel_stack:2188kB pagetables:1348kB bounce:0kB free_pcp:224kB local_pcp:224kB free_cma:0kB
[ 1257.892799] lowmem_reserve[]: 0 0 0 0 0
[ 1257.892800] Node 0 DMA: 1*4kB (U) 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 2*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 2*2048kB (UM) 2*4096kB (M) = 12940kB
[ 1257.892807] Node 0 DMA32: 243*4kB (UME) 6213*8kB (UE) 70*16kB (U) 64*32kB (U) 1*64kB (U) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 53908kB
[ 1257.892813] Node 0 Normal: 88*4kB (UME) 547*8kB (UE) 35*16kB (UE) 39*32kB (U) 1*64kB (U) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6600kB
[ 1257.892819] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 1257.892820] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 1257.892820] 23 total pagecache pages
[ 1257.892822] 1 pages in swap cache
[ 1257.892823] Swap cache stats: add 184846, delete 184845, find 8573/20459
[ 1257.892823] Free swap  = 8333820kB
[ 1257.892824] Total swap = 8388604kB
[ 1257.892824] 993672 pages RAM
[ 1257.892825] 0 pages HighMem/MovableOnly
[ 1257.892825] 171798 pages reserved
[ 1257.892826] 0 pages cma reserved
[ 1257.892826] 0 pages hwpoisoned
[ 1257.892827] Unreclaimable slab info:
[ 1257.892827] Name                      Used          Total
[ 1257.892841] nf_conntrack              48KB         63KB
[ 1257.892913] zfs_znode_hold_cache          7KB          7KB
[ 1257.892923] zfs_znode_cache          330KB       1524KB
[ 1257.892925] zil_zcw_cache              7KB          7KB
[ 1257.892926] zil_lwb_cache             31KB         31KB
[ 1257.892991] dmu_buf_impl_t           949KB       5841KB
[ 1257.892994] arc_buf_t                146KB        252KB
[ 1257.892995] arc_buf_hdr_t_full      38210KB      38210KB
[ 1257.893026] dnode_t                 2382KB       6187KB
[ 1257.893034] sa_cache                 148KB        399KB
[ 1257.893035] abd_t                  13637KB      13637KB
[ 1257.893036] lz4_cache                 64KB         64KB
[ 1257.893043] zio_buf_comb_16384       2256KB       3328KB
[ 1257.893044] zio_buf_comb_14336        160KB        384KB
[ 1257.893045] zio_buf_comb_12288        144KB        216KB
[ 1257.893046] zio_buf_comb_10240        144KB        264KB
[ 1257.893048] zio_buf_comb_8192        128KB        288KB
[ 1257.893049] zio_buf_comb_7168        160KB        320KB
[ 1257.893050] zio_buf_comb_6144        224KB        384KB
[ 1257.893052] zio_buf_comb_5120        224KB        384KB
[ 1257.893053] zio_buf_comb_4096        156KB        288KB
[ 1257.893054] zio_buf_comb_3584        283KB        409KB
[ 1257.893055] zio_buf_comb_3072        360KB        480KB
[ 1257.893056] zio_buf_comb_2560        480KB        600KB
[ 1257.893057] zio_buf_comb_2048        160KB        288KB
[ 1257.893059] zio_buf_comb_1536        189KB        315KB
[ 1257.893074] zio_buf_comb_1024      15229KB      15664KB
[ 1257.893075] zio_buf_comb_512       30248KB      30248KB
[ 1257.893076] zio_link_cache            67KB         83KB
[ 1257.893077] zio_cache                197KB        247KB
[ 1257.893149] zfs_btree_leaf_cache       8264KB      26464KB
[ 1257.893150] mqueue_inode_cache         15KB         15KB
[ 1257.893153] PING                      64KB         64KB
[ 1257.893154] RAW                       32KB         32KB
[ 1257.893155] tw_sock_TCP               31KB         31KB
[ 1257.893156] request_sock_TCP           7KB          7KB
[ 1257.893157] TCP                       61KB         61KB
[ 1257.893158] hugetlbfs_inode_cache         30KB         30KB
[ 1257.893159] eventpoll_pwq              7KB          7KB
[ 1257.893160] request_queue             61KB         61KB
[ 1257.893160] blkdev_ioc                15KB         15KB
[ 1257.893161] biovec-max               352KB        352KB
[ 1257.893162] biovec-128                64KB         64KB
[ 1257.893163] biovec-64                 48KB         48KB
[ 1257.893164] khugepaged_mm_slot          7KB          7KB
[ 1257.893165] user_namespace             7KB          7KB
[ 1257.893166] uid_cache                 16KB         16KB
[ 1257.893166] skbuff_ext_cache           7KB          7KB
[ 1257.893167] skbuff_fclone_cache        112KB        112KB
[ 1257.893168] skbuff_head_cache         20KB         36KB
[ 1257.893169] file_lock_cache           11KB         11KB
[ 1257.893170] fsnotify_mark_connector          8KB          8KB
[ 1257.893171] net_namespace             57KB         57KB
[ 1257.893172] task_delay_info           19KB         19KB
[ 1257.893172] taskstats                 15KB         15KB
[ 1257.893173] proc_dir_entry           114KB        114KB
[ 1257.893174] pde_opener                 7KB          7KB
[ 1257.893176] shmem_inode_cache        402KB        402KB
[ 1257.893214] kernfs_node_cache       7809KB       8068KB
[ 1257.893216] mnt_cache                201KB        221KB
[ 1257.893217] filp                      87KB        124KB
[ 1257.893218] names_cache              160KB        160KB
[ 1257.893219] lsm_file_cache            59KB         59KB
[ 1257.893220] uts_namespace             15KB         15KB
[ 1257.893221] vm_area_struct           214KB        223KB
[ 1257.893221] mm_struct                 63KB         63KB
[ 1257.893222] files_cache               47KB         47KB
[ 1257.893223] signal_cache             171KB        236KB
[ 1257.893224] sighand_cache            332KB        371KB
[ 1257.893226] task_struct              953KB       1067KB
[ 1257.893226] cred_jar                  43KB         43KB
[ 1257.893228] anon_vma_chain           198KB        244KB
[ 1257.893229] anon_vma                 178KB        178KB
[ 1257.893229] pid                       36KB         36KB
[ 1257.893230] Acpi-Operand              74KB         74KB
[ 1257.893230] Acpi-Parse                15KB         15KB
[ 1257.893231] Acpi-State                27KB         27KB
[ 1257.893231] numa_policy                3KB          3KB
[ 1257.893232] trace_event_file         118KB        118KB
[ 1257.893232] ftrace_event_field        183KB        183KB
[ 1257.893233] pool_workqueue           108KB        108KB
[ 1257.893234] task_group                82KB         82KB
[ 1257.893236] vmap_area                126KB        188KB
[ 1257.893244] kmalloc-8k              1432KB       2880KB
[ 1257.893246] kmalloc-4k              4640KB       4736KB
[ 1257.893247] kmalloc-2k              3064KB       3064KB
[ 1257.893304] kmalloc-1k              6077KB      13356KB
[ 1257.893403] kmalloc-512             2262KB       9736KB
[ 1257.893405] kmalloc-256              292KB        308KB
[ 1257.893405] kmalloc-192              492KB        492KB
[ 1257.893407] kmalloc-128              231KB        248KB
[ 1257.893439] kmalloc-96               428KB       1724KB
[ 1257.893553] kmalloc-64               977KB       5440KB
[ 1257.893608] kmalloc-32              2481KB       3784KB
[ 1257.893609] kmalloc-16               120KB        120KB
[ 1257.893609] kmalloc-8                 48KB         48KB
[ 1257.893610] kmem_cache_node           80KB         80KB
[ 1257.893613] kmem_cache               497KB        551KB
[ 1257.893613] Tasks state (memory values in pages):
[ 1257.893614] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[ 1257.893618] [    379]     0   379    11489        0   110592      341          -250 systemd-journal
[ 1257.893620] [    416]     0   416     4951        0    61440      341         -1000 systemd-udevd
[ 1257.893621] [    468]     0   468     2839        0    49152       57         -1000 auditd
[ 1257.893622] [    635]   102   635     6847        0    73728      241             0 systemd-network
[ 1257.893624] [    663]   104   663     2106        1    49152      210          -900 dbus-daemon
[ 1257.893625] [    673]     0   673      965        0    49152       32             0 rngd
[ 1257.893626] [    674]   106   674    56027        0    81920      450             0 rsyslogd
[ 1257.893627] [    676]     0   676     4407        0    73728      313             0 systemd-logind
[ 1257.893628] [    679]     0   679    59475        0    81920      278             0 zed
[ 1257.893629] [    680]     0   680   356094        0   299008     3258          -999 containerd
[ 1257.893630] [    728]     0   728     1393        0    53248       67             0 cron
[ 1257.893632] [    736]   101   736    18599        0    61440      154             0 ntpd
[ 1257.893633] [    753]     0   753     3045        2    65536      228         -1000 sshd
[ 1257.893634] [    941]   105   941     1694        0    49152       52             0 uuidd
[ 1257.893635] [   2658]     0  2658     1723        0    45056       61             0 delphix-telegra
[ 1257.893636] [   2659]     0  2659   324637        0   327680     4868          -500 dockerd
[ 1257.893637] [   2790]     0  2790      622        0    45056       20             0 none
[ 1257.893638] [   2944] 65433  2944     4731        0    81920      343             0 systemd
[ 1257.893640] [  16468] 65433 16468      653        0    45056       25             0 sh
[ 1257.893641] [  16469] 65433 16469      653        0    45056       31             0 sh
[ 1257.893642] [  16470] 65433 16470     1724        1    49152       67             0 bash
[ 1257.893643] [  16473]     0 16473     2116        0    53248      121             0 sudo
[ 1257.893644] [  17852]     0 17852     1724        1    57344       80             0 run-live-build.
[ 1257.893645] [  18003]     0 18003      960        1    49152       76             0 lb_build
[ 1257.893646] [  18018]     0 18018     1724        1    49152       66             0 build
[ 1257.893647] [  18019]     0 18019      960        1    40960       77             0 lb_build
[ 1257.893648] [  18020]     0 18020     1372        0    45056       27             0 tee
[ 1257.893650] [  76532]     0 76532      686        0    40960       77             0 lb_binary
[ 1257.893651] [  76570]     0 76570      686        1    40960       78             0 lb_binary_chroo
[ 1257.893652] [  76623]     0 76623     3726        0    69632      519             0 cp
[ 1257.893653] [  77171] 65433 77171     1369        0    49152       21             0 sleep
[ 1257.893654] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-65433.slice/session-5.scope,task=cp,pid=76623,uid=0
[ 1257.893660] Out of memory: Killed process 76623 (cp) total-vm:14904kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:68kB oom_score_adj:0

Run experiment:

$ sudo su - root
root@ip-10-110-226-245:~# echo 0 > /sys/module/zfs/parameters/zfs_arc_shrinker_limit
root@ip-10-110-226-245:~# echo 3 >/proc/sys/vm/drop_caches

While waiting for drop_caches to return:

$ arcstat 5
    time  read  ddread  ddh%  dmread  dmh%  pread  ph%   size      c  avail
00:48:20     0       0     0       0     0      0    0   2.7G   100M  -269M
00:48:25    73      25   100      47   100      0    0   2.7G   100M  -266M
00:48:30    83      27   100      56   100      0    0   2.7G   100M  -270M
00:48:35    73      26   100      46   100      0    0   2.7G   100M  -266M
00:48:40    83      27   100      56   100      0    0   2.7G   100M  -269M
00:48:45    73      26   100      46   100      0    0   2.7G   100M  -266M
00:48:50   117      27   100      90   100      0    0   2.7G   100M  -266M
00:48:55    78      26    92      52   100      0    0   2.7G   100M  -262M
00:49:00    85      26    96      58   100      0    0   2.7G   100M  -264M
00:49:05    81      26   100      55   100      0    0   2.7G   100M  -263M
00:49:10    83      26   100      56   100      0    0   2.7G   100M  -265M
00:49:15    73      26   100      46   100      0    0   2.7G   100M  -262M
00:49:20    84      27   100      57   100      0    0   2.7G   100M  -265M

nothing seems to be happening w.r.t. the ARC?

I should have the system available for maybe 10 more hours (before it times out), if there's anything I could collect from it to help..?

amotin commented 1 year ago

we run our systems with zfs_arc_meta_strategy = 0 which was removed. Is the equivalent to set zfs_arc_meta_balance = 1? I wonder if that would help us avoid the OOM.

@grwilson The code behind zfs_arc_meta_strategy was removed completely. There is nothing left of it. I am not sure what was your motivation behind setting that tunable. Setting zfs_arc_meta_balance = 1 would cause metadata to be evicted first if there are any data ghost hits at all, that I really doubt is your goal. zfs_arc_meta_balance should not affect OOMs, but only what should be evicted from ARC under pressure.

rincebrain commented 1 year ago

If it's an EC2 VM, doesn't Amazon let you hibernate the VM, then you could snapshot the volume and be able to bring it back up and see it broken like this very conveniently on demand as long as you kept the snapshot?

prakashsurya commented 1 year ago

after 6 hours, it's finally able to evict data out of the ARC:

$ arcstat 1
    time  read  ddread  ddh%  dmread  dmh%  pread  ph%   size      c  avail
04:15:24     0       0     0       0     0      0    0   461M   506M   1.9G
04:15:25   153       0     0     153   100      0    0   460M   506M   1.9G
04:15:26     0       0     0       0     0      0    0   460M   506M   1.9G

I have some grafana stuff enabled, so I can see clearly, it took 6 hours for this to occur:

image

Hm.. not sure what happened, but I stepped away, and now that I'm able to start poking around again, it looks like the ARC has evicted.. I don't know why it couldn't (wouldn't?) do that for ~6 hours.. Hm..

Also, here's arcstats at this point now.. after it's finally been able to evict..

$ cat /proc/spl/kstat/zfs/arcstats
9 1 0x01 147 39984 3953761573 25074486980481
name                            type data
hits                            4    11624683
iohits                          4    191778
misses                          4    430403
demand_data_hits                4    5138324
demand_data_iohits              4    135905
demand_data_misses              4    133482
demand_metadata_hits            4    6378098
demand_metadata_iohits          4    3624
demand_metadata_misses          4    124047
prefetch_data_hits              4    12446
prefetch_data_iohits            4    2502
prefetch_data_misses            4    166206
prefetch_metadata_hits          4    95815
prefetch_metadata_iohits        4    49747
prefetch_metadata_misses        4    6668
mru_hits                        4    2502908
mru_ghost_hits                  4    29342
mfu_hits                        4    9121775
mfu_ghost_hits                  4    118668
uncached_hits                   4    0
deleted                         4    660487
mutex_miss                      4    53091
access_skip                     4    1
evict_skip                      4    116543957
evict_not_enough                4    18118732
evict_l2_cached                 4    0
evict_l2_eligible               4    85043187712
evict_l2_eligible_mfu           4    20706493952
evict_l2_eligible_mru           4    64336693760
evict_l2_ineligible             4    545715200
evict_l2_skip                   4    0
hash_elements                   4    10717
hash_elements_max               4    148339
hash_collisions                 4    72651
hash_chains                     4    121
hash_chain_max                  4    5
meta                            4    378628870
pd                              4    17354729
pm                              4    2243815708
c                               4    536762624
c_min                           4    105199872
c_max                           4    2524796928
size                            4    483649488
compressed_size                 4    436979200
uncompressed_size               4    756879872
overhead_size                   4    27902976
hdr_size                        4    2934944
data_size                       4    447864320
metadata_size                   4    17017856
dbuf_size                       4    1032528
dnode_size                      4    2292000
bonus_size                      4    665280
anon_size                       4    2321920
anon_data                       4    2305536
anon_metadata                   4    16384
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    146936320
mru_data                        4    140604928
mru_metadata                    4    6331392
mru_evictable_data              4    130033664
mru_evictable_metadata          4    491520
mru_ghost_size                  4    57907712
mru_ghost_data                  4    18083840
mru_ghost_metadata              4    39823872
mru_ghost_evictable_data        4    18083840
mru_ghost_evictable_metadata    4    39823872
mfu_size                        4    315623936
mfu_data                        4    304953856
mfu_metadata                    4    10670080
mfu_evictable_data              4    299189248
mfu_evictable_metadata          4    991232
mfu_ghost_size                  4    41371648
mfu_ghost_data                  4    4413440
mfu_ghost_metadata              4    36958208
mfu_ghost_evictable_data        4    4413440
mfu_ghost_evictable_metadata    4    36958208
uncached_size                   4    0
uncached_data                   4    0
uncached_metadata               4    0
uncached_evictable_data         4    0
uncached_evictable_metadata     4    0
l2_hits                         4    0
l2_misses                       4    0
l2_prefetch_asize               4    0
l2_mru_asize                    4    0
l2_mfu_asize                    4    0
l2_bufc_data_asize              4    0
l2_bufc_metadata_asize          4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_lock_retry            4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
l2_free_on_write                4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_asize                        4    0
l2_hdr_size                     4    0
l2_log_blk_writes               4    0
l2_log_blk_avg_asize            4    0
l2_log_blk_asize                4    0
l2_log_blk_count                4    0
l2_data_to_meta_ratio           4    0
l2_rebuild_success              4    0
l2_rebuild_unsupported          4    0
l2_rebuild_io_errors            4    0
l2_rebuild_dh_errors            4    0
l2_rebuild_cksum_lb_errors      4    0
l2_rebuild_lowmem               4    0
l2_rebuild_size                 4    0
l2_rebuild_asize                4    0
l2_rebuild_bufs                 4    0
l2_rebuild_bufs_precached       4    0
l2_rebuild_log_blks             4    0
memory_throttle_count           4    82
memory_direct_count             4    1041333318
memory_indirect_count           4    7797896
memory_all_bytes                4    3366395904
memory_free_bytes               4    2489212928
memory_available_bytes          3    2068413440
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    19705
arc_meta_used                   4    23942608
arc_dnode_limit                 4    252479692
async_upgrade_sync              4    136280
predictive_prefetch             4    333381
demand_hit_predictive_prefetch  4    40833
demand_iohit_predictive_prefetch 4    138814
prescient_prefetch              4    3
demand_hit_prescient_prefetch   4    3
demand_iohit_prescient_prefetch 4    0
arc_need_free                   4    0
arc_sys_free                    4    420799488
arc_raw_size                    4    0
cached_only_in_progress         4    0
abd_chunk_waste_size            4    11842560
prakashsurya commented 1 year ago

Here's a graph of CPU for that same timeframe as above, for the memory:

image

I think the spike in "system" around 00:47:45 is when I started the drop_caches experiment from my prior comment above.

prakashsurya commented 1 year ago

And just in case it's useful to verify.. looks like both of the 2 CPUs on the system were busy during that time..

image

prakashsurya commented 1 year ago

Also, FWIW.. as I logged back in hours later, and started poking around.. I ran:

$ sudo BPFTRACE_BTF=/sys/kernel/btf/vmlinux bpftrace -e 'kretprobe:arc_is_overflowing { printf("returned: %u\n", retval); } '
returned: 2
returned: 2
returned: 2
...

so I think ARC_OVF_SEVERE was set at that time..

right now, now that the ARC has evicted.. it's returning 0:

$ sudo BPFTRACE_BTF=/sys/kernel/btf/vmlinux bpftrace -e 'kretprobe:arc_is_overflowing { printf("returned: %u\n", retval); } '
returned: 0
returned: 0
returned: 0

so, I think at the point where I was able to log back in and poke around, the system hadn't yet evicted.. but while I was trying to inspect the system, eviction was able to occur.. I don't know if, why, or how my shell activity would have affected ARC eviction, but just wanted to mention it..

amotin commented 1 year ago

@prakashsurya Could you please test this patch: https://github.com/openzfs/zfs/pull/14692 ?

prakashsurya commented 1 year ago

Could you please test this patch: https://github.com/openzfs/zfs/pull/14692 ?

Sure, let me start working on this.. I'll be out part of the day today, so I might not be able to reproduce until tomorrow, but I'll work on this.

Not sure how useful this is, I think it just gives us the same information we already suspected, but here's some graphs of ARC stat info, over that same time period of the same system that I did the drop_caches experiment on (same system as graphs above):

image

prakashsurya commented 1 year ago

image

We can see, around 21:44, c begins to collapse, but size does not. Then by 21:45, c looks to be about as low as it'll go, and size and data begin to grow. Then at 21:48, size and data hit some limit, and remain high, until eviction can finally do something about it.

Still not clear why it behaved this way.

prakashsurya commented 1 year ago

Correlating those timestamps to the eviction counters is also interesting..

image

prakashsurya commented 1 year ago

Why is evict_skip increasing so much? presumably, the kernel is asking us to evict, we're skipping eviction for whatever reason.. then eventually, the kernel just stops asking us to evict, since don't actually evict anything the many times it's asked us to do so.. at which point, we just "sit there", with size well above c, but never actually evict back down..

So, maybe we need to understand why evict_skip is getting triggered? It's been so long since I've been in this code, so I'm kind of guessing at this point.. but the fact that evict_skip is increasing feels relevant.. and I think it correlates with the timestamps and values of size and c..

grwilson commented 1 year ago

@grwilson it looks like you must also be increasing the default zfs_arc_max?

We also run with a larger arc by default. We use 3/4 of all memory vs upstream which use 5/8.

prakashsurya commented 1 year ago
 3989 >       /* prefetch buffers have a minimum lifespan */
 3990 >       if ((hdr->b_flags & (ARC_FLAG_PREFETCH | ARC_FLAG_INDIRECT)) &&
 3991 >           ddi_get_lbolt() - hdr->b_l1hdr.b_arc_access <
 3992 >           MSEC_TO_TICK(min_lifetime)) {
 3993 >       >       ARCSTAT_BUMP(arcstat_evict_skip);
 3994 >       >       return (bytes_evicted);
 3995 >       }

Hm.. could we be filling the ARC with prefetched data, and then not allowing that to be evicted due to this logic?

amotin commented 1 year ago

Hm.. could we be filling the ARC with prefetched data, and then not allowing that to be evicted due to this logic?

Theoretically we could, but for that we would have to be very bad in our prefetch decisions, since first demand read clears the prefetch flag, allowing the eviction, and even if not then buffers should be evicted at most 1 or 6 seconds later. Considering that eviction thread should try it again and again I am not sure what value of evict_skip would I consider suspicious, but I can't say that 4M I see makes me worry too much.

amotin commented 1 year ago

Could you please test this patch: #14692 ?

Sure, let me start working on this.. I'll be out part of the day today, so I might not be able to reproduce until tomorrow, but I'll work on this.

@prakashsurya Please let me know how it goes and if I should continue searching.

prakashsurya commented 1 year ago

@amotin sorry for the delay, I kicked off 10 builds with your change applied.. lets see how those go..

With that said, I'm skeptical it's a signed-ness issue at play here, since I'm seeing arcstats_evict_not_enough being triggered.. perhaps I'm misunderstanding, though..

prakashsurya commented 1 year ago

Do we have any theories as to why evict_skip increases in the graphs above? is that normal?

amotin commented 1 year ago

@prakashsurya With MFU data state being persistently beaten into the ground, I am absolutely not surprised by evict_skip and evict_not_enough growing same time. I can't quantify their exact values without knowing how many times eviction was called in general, but so far it all fits within my understanding of the problem that should be fixed by the signed-ness PR.

prakashsurya commented 1 year ago

OK.. if it makes sense to you, then that sounds good to me.

I was able to get #14692 applied and ran through about 30 builds, none of them failed. I'd say, lets work to get that integrated..

Then I can work to get our product builds running those bits, and see if we run into any problems.. looks good so far.

amotin commented 1 year ago

Closing this with the https://github.com/openzfs/zfs/pull/14692 merged.

prakashsurya commented 1 year ago

Just to close the loop on our end, we pulled #14692 into our buildserver images, and haven't seen the OOMs yet with that applied.. @amotin thanks for you help!