Closed prakashsurya closed 1 year ago
It seems like ARC was trying to shrink to the minimum, but can't. I have no clue why could it be. It could help if you could provide full kstat and/or arc_summary output. I wonder if the data can somehow be non-evictable, or it is really some math issue.
I lost access to the system (build eventually times out, and the system is cleaned up), but here's arcstats from when I was able to log in and inspect..
$ cat /proc/spl/kstat/zfs/arcstats
9 1 0x01 147 39984 4026867198 8436358124041
name type data
hits 4 10390029
iohits 4 178833
misses 4 365409
demand_data_hits 4 5136518
demand_data_iohits 4 121054
demand_data_misses 4 91780
demand_metadata_hits 4 5162519
demand_metadata_iohits 4 3635
demand_metadata_misses 4 121016
prefetch_data_hits 4 9243
prefetch_data_iohits 4 943
prefetch_data_misses 4 146114
prefetch_metadata_hits 4 81749
prefetch_metadata_iohits 4 53201
prefetch_metadata_misses 4 6499
mru_hits 4 2336106
mru_ghost_hits 4 31069
mfu_hits 4 8053923
mfu_ghost_hits 4 84648
uncached_hits 4 0
deleted 4 469736
mutex_miss 4 1058
access_skip 4 1
evict_skip 4 4147591
evict_not_enough 4 386821
evict_l2_cached 4 0
evict_l2_eligible 4 75218220544
evict_l2_eligible_mfu 4 16507243520
evict_l2_eligible_mru 4 58710977024
evict_l2_ineligible 4 476856320
evict_l2_skip 4 0
hash_elements 4 149534
hash_elements_max 4 149536
hash_collisions 4 65230
hash_chains 4 17601
hash_chain_max 4 4
meta 4 575429547
pd 4 1791783
pm 4 2153388558
c 4 104282624
c_min 4 104282624
c_max 4 2502782976
size 4 2914759656
compressed_size 4 2752201216
uncompressed_size 4 4811579904
overhead_size 4 13123072
hdr_size 4 40684048
data_size 4 2713842688
metadata_size 4 51481600
dbuf_size 4 649936
dnode_size 4 1621000
bonus_size 4 444160
anon_size 4 2519040
anon_data 4 2502656
anon_metadata 4 16384
anon_evictable_data 4 0
anon_evictable_metadata 4 0
mru_size 4 2750262784
mru_data 4 2708664832
mru_metadata 4 41597952
mru_evictable_data 4 2708664832
mru_evictable_metadata 4 37024768
mru_ghost_size 4 92644352
mru_ghost_data 4 14796800
mru_ghost_metadata 4 77847552
mru_ghost_evictable_data 4 14796800
mru_ghost_evictable_metadata 4 77847552
mfu_size 4 12542464
mfu_data 4 2675200
mfu_metadata 4 9867264
mfu_evictable_data 4 0
mfu_evictable_metadata 4 4100608
mfu_ghost_size 4 293185024
mfu_ghost_data 4 261363200
mfu_ghost_metadata 4 31821824
mfu_ghost_evictable_data 4 261363200
mfu_ghost_evictable_metadata 4 31821824
uncached_size 4 0
uncached_data 4 0
uncached_metadata 4 0
uncached_evictable_data 4 0
uncached_evictable_metadata 4 0
l2_hits 4 0
l2_misses 4 0
l2_prefetch_asize 4 0
l2_mru_asize 4 0
l2_mfu_asize 4 0
l2_bufc_data_asize 4 0
l2_bufc_metadata_asize 4 0
l2_feeds 4 0
l2_rw_clash 4 0
l2_read_bytes 4 0
l2_write_bytes 4 0
l2_writes_sent 4 0
l2_writes_done 4 0
l2_writes_error 4 0
l2_writes_lock_retry 4 0
l2_evict_lock_retry 4 0
l2_evict_reading 4 0
l2_evict_l1cached 4 0
l2_free_on_write 4 0
l2_abort_lowmem 4 0
l2_cksum_bad 4 0
l2_io_error 4 0
l2_size 4 0
l2_asize 4 0
l2_hdr_size 4 0
l2_log_blk_writes 4 0
l2_log_blk_avg_asize 4 0
l2_log_blk_asize 4 0
l2_log_blk_count 4 0
l2_data_to_meta_ratio 4 0
l2_rebuild_success 4 0
l2_rebuild_unsupported 4 0
l2_rebuild_io_errors 4 0
l2_rebuild_dh_errors 4 0
l2_rebuild_cksum_lb_errors 4 0
l2_rebuild_lowmem 4 0
l2_rebuild_size 4 0
l2_rebuild_asize 4 0
l2_rebuild_bufs 4 0
l2_rebuild_bufs_precached 4 0
l2_rebuild_log_blks 4 0
memory_throttle_count 4 477
memory_direct_count 4 14008
memory_indirect_count 4 184797
memory_all_bytes 4 3337043968
memory_free_bytes 4 124661760
memory_available_bytes 3 -292468736
arc_no_grow 4 1
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 23893
arc_meta_used 4 94880744
arc_dnode_limit 4 250278297
async_upgrade_sync 4 121467
predictive_prefetch 4 297746
demand_hit_predictive_prefetch 4 34197
demand_iohit_predictive_prefetch 4 123807
prescient_prefetch 4 3
demand_hit_prescient_prefetch 4 3
demand_iohit_prescient_prefetch 4 0
arc_need_free 4 83321856
arc_sys_free 4 417130496
arc_raw_size 4 0
cached_only_in_progress 4 0
abd_chunk_waste_size 4 106036224
According to small "pd" value, the size of MRU data should be very small, but that is where most of the ARC size reside according to the stats. I'll think more about it, but if you can reproduce the problem, it could help with understanding to dump everything possible around the arc_evict_impl(arc_mfu, ARC_BUFC_METADATA, e) call in case some math does not work right there.
but if you can reproduce the problem
I can.. it might take me half a day, but I should be able to reproduce it..
it could help with understanding to dump everything possible around the arc_evict_impl(arc_mfu, ARC_BUFC_METADATA, e) call
any chance you could provide me a patch, or open a PR, with the code you'd like me to test with? I can get that done tomorrow (PST).
I appreciate the help!
@amotin we run our systems with zfs_arc_meta_strategy = 0
which was removed. Is the equivalent to set zfs_arc_meta_balance = 1
? I wonder if that would help us avoid the OOM.
The allocation which failed is also very surprising. It's a tiny zero order GFP_KERNEL allocation which should have been able to get a page via direct reclaim from the ARC. The memory_direct_count = 14008
kstat indicates we were running direct reclaim frequently and arc_c
was driven down to arc_c_min
so we sure are trying.
gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
I'd say we may not be pruning the Linux dentry and inode caches aggressively enough via arc_prune_async()
and allowing pinned metadata buffers to build up. We did consolidate a few calls to arc_prune_async()
, but the kstats suggest that isn't the case given the small number of dnodes and metadata buffers size. As @amotin mentioned I think we're going to need to add some instrumentation to see why nothing is being evicted.
One experiment you might try is setting zfs_arc_evict_batch_limit = 0
and running echo 3 >/proc/sys/vm/drop_caches
. This should drop the ARC and might at least help rule out some kind of fundamental accounting problem.
@grwilson it looks like you must also be increasing the default zfs_arc_max
?
One experiment you might try is setting
i.e. reproduce a system to this point, and then run these commands, and see what happens to the ARC? I can work on that..
it looks like you must also be increasing the default zfs_arc_max?
$ cat /sys/module/zfs/parameters/zfs_arc_max
0
I'd say we may not be pruning the Linux dentry and inode caches aggressively enough via arc_prune_async() and allowing pinned metadata buffers to build up.
If this was the case, would the stats show things accounted for in mru_data
like I see?
mru_data 4 2708664832
mru_metadata 4 41597952
reproduce a system to this point, and then run these commands, and see what happens to the ARC?
Right. What we'd expect is the Linux virtual memory system to apply pressure to the ARC until everything has been reclaimed down to zfs_arc_min
. Almost exactly like what should happen when direct reclaim is invoked in a node with low memory.
If this was the case, would the stats show things accounted for in mru_data like I see?
That's the thing, I'd expect it to be accounted for as metadata not data. Which is what makes me think that's not what's going on unless somehow we're somehow miscategorizing these buffers.
OK.. I appreciate the input and help so far. I'm trying to reproduce the issue now, so I can try the drop_caches
experiment.. haven't hit the OOM yet, but I'll keep kicking off builds until I do..
Additionally, if there's any patches you'd like me to deploy and test with, feel free to send them to me and I can get them into our buildserver image that has the problem. It's not clear to me what information would help root cause so far.
I haven't been able to reproduce the issue so far today, after running like 30+ builds 🤷 before it'd hit relatively frequently, maybe every 3-4 builds...
It's always something! I also noticed I made a terrible typo above, you want to set zfs_arc_shrinker_limit = 0
not zfs_arc_evict_batch_limit = 0
. If you're ever able to reliably reproduce this you could also try and increasing this value so we try and release more memory from the ARC on every direct reclaim invocation. Here's the relevant comment from arc_shrinker_count()
. Of course none of this explains why the ARC wouldn't be able to evict the buffers.
/*
* This code is reached in the "direct reclaim" case, where the
* kernel (outside ZFS) is trying to allocate a page, and the system
* is low on memory.
*
* The kernel's shrinker code doesn't understand how many pages the
* ARC's callback actually frees, so it may ask the ARC to shrink a
* lot for one page allocation. This is problematic because it may
* take a long time, thus delaying the page allocation, and because
* it may force the ARC to unnecessarily shrink very small.
*
* Therefore, we limit the amount of data that we say is evictable,
* which limits the amount that the shrinker will ask us to evict for
* one page allocation attempt.
*
* In practice, we may be asked to shrink 4x the limit to satisfy one
* page allocation, before the kernel's shrinker code gives up on us.
* When that happens, we rely on the kernel code to find the pages
* that we freed before invoking the OOM killer. This happens in
* __alloc_pages_slowpath(), which retries and finds the pages we
* freed when it calls get_page_from_freelist().
*
* See also the comment above zfs_arc_shrinker_limit.
*/
Hit again:
$ cat /proc/spl/kstat/zfs/arcstats
9 1 0x01 147 39984 3953761573 11885741844167
name type data
hits 4 9638365
iohits 4 189495
misses 4 340531
demand_data_hits 4 4496576
demand_data_iohits 4 135335
demand_data_misses 4 52141
demand_metadata_hits 4 5048597
demand_metadata_iohits 4 3581
demand_metadata_misses 4 123481
prefetch_data_hits 4 9586
prefetch_data_iohits 4 1383
prefetch_data_misses 4 158403
prefetch_metadata_hits 4 83606
prefetch_metadata_iohits 4 49196
prefetch_metadata_misses 4 6506
mru_hits 4 2393008
mru_ghost_hits 4 29075
mfu_hits 4 7245357
mfu_ghost_hits 4 37085
uncached_hits 4 0
deleted 4 517070
mutex_miss 4 829
access_skip 4 1
evict_skip 4 2189947
evict_not_enough 4 184782
evict_l2_cached 4 0
evict_l2_eligible 4 69967979520
evict_l2_eligible_mfu 4 10643459072
evict_l2_eligible_mru 4 59324520448
evict_l2_ineligible 4 254993408
evict_l2_skip 4 0
hash_elements 4 145916
hash_elements_max 4 145917
hash_collisions 4 66779
hash_chains 4 17019
hash_chain_max 4 4
meta 4 1012096539
pd 4 303887742
pm 4 2243957928
c 4 105199872
c_min 4 105199872
c_max 4 2524796928
size 4 2929218120
compressed_size 4 2768971776
uncompressed_size 4 4978176512
overhead_size 4 12706816
hdr_size 4 39701440
data_size 4 2730064896
metadata_size 4 51613696
dbuf_size 4 750288
dnode_size 4 1819000
bonus_size 4 508480
anon_size 4 1188352
anon_data 4 1171968
anon_metadata 4 16384
anon_evictable_data 4 0
anon_evictable_metadata 4 0
mru_size 4 2768146432
mru_data 4 2726401024
mru_metadata 4 41745408
mru_evictable_data 4 2726401024
mru_evictable_metadata 4 36540416
mru_ghost_size 4 63939584
mru_ghost_data 4 23429632
mru_ghost_metadata 4 40509952
mru_ghost_evictable_data 4 23429632
mru_ghost_evictable_metadata 4 40509952
mfu_size 4 12343808
mfu_data 4 2491904
mfu_metadata 4 9851904
mfu_evictable_data 4 0
mfu_evictable_metadata 4 3789824
mfu_ghost_size 4 251945984
mfu_ghost_data 4 211022336
mfu_ghost_metadata 4 40923648
mfu_ghost_evictable_data 4 211022336
mfu_ghost_evictable_metadata 4 40923648
uncached_size 4 0
uncached_data 4 0
uncached_metadata 4 0
uncached_evictable_data 4 0
uncached_evictable_metadata 4 0
l2_hits 4 0
l2_misses 4 0
l2_prefetch_asize 4 0
l2_mru_asize 4 0
l2_mfu_asize 4 0
l2_bufc_data_asize 4 0
l2_bufc_metadata_asize 4 0
l2_feeds 4 0
l2_rw_clash 4 0
l2_read_bytes 4 0
l2_write_bytes 4 0
l2_writes_sent 4 0
l2_writes_done 4 0
l2_writes_error 4 0
l2_writes_lock_retry 4 0
l2_evict_lock_retry 4 0
l2_evict_reading 4 0
l2_evict_l1cached 4 0
l2_free_on_write 4 0
l2_abort_lowmem 4 0
l2_cksum_bad 4 0
l2_io_error 4 0
l2_size 4 0
l2_asize 4 0
l2_hdr_size 4 0
l2_log_blk_writes 4 0
l2_log_blk_avg_asize 4 0
l2_log_blk_asize 4 0
l2_log_blk_count 4 0
l2_data_to_meta_ratio 4 0
l2_rebuild_success 4 0
l2_rebuild_unsupported 4 0
l2_rebuild_io_errors 4 0
l2_rebuild_dh_errors 4 0
l2_rebuild_cksum_lb_errors 4 0
l2_rebuild_lowmem 4 0
l2_rebuild_size 4 0
l2_rebuild_asize 4 0
l2_rebuild_bufs 4 0
l2_rebuild_bufs_precached 4 0
l2_rebuild_log_blks 4 0
memory_throttle_count 4 57
memory_direct_count 4 3626
memory_indirect_count 4 115746
memory_all_bytes 4 3366395904
memory_free_bytes 4 142213120
memory_available_bytes 3 -278586368
arc_no_grow 4 1
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 19670
arc_meta_used 4 94392904
arc_dnode_limit 4 252479692
async_upgrade_sync 4 135710
predictive_prefetch 4 308677
demand_hit_predictive_prefetch 4 34640
demand_iohit_predictive_prefetch 4 137563
prescient_prefetch 4 3
demand_hit_prescient_prefetch 4 3
demand_iohit_prescient_prefetch 4 0
arc_need_free 4 68391424
arc_sys_free 4 420799488
arc_raw_size 4 0
cached_only_in_progress 4 0
abd_chunk_waste_size 4 104760320
1257.892709] sudo invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[ 1257.892712] CPU: 0 PID: 16473 Comm: sudo Kdump: loaded Tainted: P OE 5.4.0-1097-dx2023032205-8e2e150ee-aws #105
[ 1257.892712] Hardware name: Amazon EC2 c5.large/, BIOS 1.0 10/16/2017
[ 1257.892713] Call Trace:
[ 1257.892719] dump_stack+0x6d/0x8b
[ 1257.892722] dump_header+0x4f/0x1eb
[ 1257.892724] oom_kill_process.cold+0xb/0x10
[ 1257.892727] out_of_memory+0x1cf/0x500
[ 1257.892730] __alloc_pages_slowpath+0xdde/0xeb0
[ 1257.892733] ? __switch_to+0x397/0x470
[ 1257.892735] __alloc_pages_nodemask+0x2d0/0x320
[ 1257.892738] alloc_pages_current+0x87/0xe0
[ 1257.892741] __page_cache_alloc+0x72/0x90
[ 1257.892742] pagecache_get_page+0xbf/0x300
[ 1257.892744] filemap_fault+0x83e/0xa50
[ 1257.892747] ? mem_cgroup_charge_statistics+0x51/0xe0
[ 1257.892749] ? xas_find+0x17f/0x1c0
[ 1257.892750] ? filemap_map_pages+0x24c/0x380
[ 1257.892754] __do_fault+0x3c/0x170
[ 1257.892755] do_fault+0x24b/0x640
[ 1257.892757] __handle_mm_fault+0x4c5/0x7a0
[ 1257.892759] handle_mm_fault+0xca/0x200
[ 1257.892762] do_user_addr_fault+0x1f9/0x450
[ 1257.892763] __do_page_fault+0x58/0x90
[ 1257.892765] do_page_fault+0x2c/0xe0
[ 1257.892767] do_async_page_fault+0x39/0x70
[ 1257.892769] async_page_fault+0x34/0x40
[ 1257.892770] RIP: 0033:0x7f8665da0379
[ 1257.892774] Code: Bad RIP value.
[ 1257.892775] RSP: 002b:00007ffd494b3a70 EFLAGS: 00010202
[ 1257.892776] RAX: 0000000000000000 RBX: 0000562bd28c98a0 RCX: 00007f8665bea166
[ 1257.892777] RDX: 0000000000000000 RSI: 00007ffd494b3920 RDI: 0000000000000002
[ 1257.892777] RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000
[ 1257.892778] R10: 0000000000000008 R11: 0000000000000246 R12: 0000562bd28d99c0
[ 1257.892779] R13: 00007ffd494b3bc0 R14: 0000000000000005 R15: 00007ffd494b3f80
[ 1257.892780] Mem-Info:
[ 1257.892783] active_anon:7 inactive_anon:19 isolated_anon:7
active_file:2 inactive_file:17 isolated_file:0
unevictable:3 dirty:9 writeback:1 unstable:0
slab_reclaimable:4583 slab_unreclaimable:58236
mapped:0 shmem:1 pagetables:571 bounce:0
free:18077 free_pcp:67 free_cma:0
[ 1257.892786] Node 0 active_anon:28kB inactive_anon:76kB active_file:8kB inactive_file:68kB unevictable:12kB isolated(anon):28kB isolated(file):0kB mapped:0kB dirty:36kB writeback:4kB shmem:4kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[ 1257.892787] Node 0 DMA free:12940kB min:316kB low:392kB high:468kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 1257.892790] lowmem_reserve[]: 0 2680 3165 3165 3165
[ 1257.892791] Node 0 DMA32 free:53128kB min:56956kB low:71192kB high:85428kB active_anon:48kB inactive_anon:48kB active_file:0kB inactive_file:44kB unevictable:0kB writepending:4kB present:3129256kB managed:2775564kB mlocked:0kB kernel_stack:372kB pagetables:936kB bounce:0kB free_pcp:44kB local_pcp:36kB free_cma:0kB
[ 1257.892794] lowmem_reserve[]: 0 0 484 484 484
[ 1257.892796] Node 0 Normal free:6240kB min:10304kB low:12880kB high:15456kB active_anon:0kB inactive_anon:36kB active_file:232kB inactive_file:0kB unevictable:12kB writepending:36kB present:829440kB managed:496572kB mlocked:12kB kernel_stack:2188kB pagetables:1348kB bounce:0kB free_pcp:224kB local_pcp:224kB free_cma:0kB
[ 1257.892799] lowmem_reserve[]: 0 0 0 0 0
[ 1257.892800] Node 0 DMA: 1*4kB (U) 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 2*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 2*2048kB (UM) 2*4096kB (M) = 12940kB
[ 1257.892807] Node 0 DMA32: 243*4kB (UME) 6213*8kB (UE) 70*16kB (U) 64*32kB (U) 1*64kB (U) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 53908kB
[ 1257.892813] Node 0 Normal: 88*4kB (UME) 547*8kB (UE) 35*16kB (UE) 39*32kB (U) 1*64kB (U) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6600kB
[ 1257.892819] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 1257.892820] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 1257.892820] 23 total pagecache pages
[ 1257.892822] 1 pages in swap cache
[ 1257.892823] Swap cache stats: add 184846, delete 184845, find 8573/20459
[ 1257.892823] Free swap = 8333820kB
[ 1257.892824] Total swap = 8388604kB
[ 1257.892824] 993672 pages RAM
[ 1257.892825] 0 pages HighMem/MovableOnly
[ 1257.892825] 171798 pages reserved
[ 1257.892826] 0 pages cma reserved
[ 1257.892826] 0 pages hwpoisoned
[ 1257.892827] Unreclaimable slab info:
[ 1257.892827] Name Used Total
[ 1257.892841] nf_conntrack 48KB 63KB
[ 1257.892913] zfs_znode_hold_cache 7KB 7KB
[ 1257.892923] zfs_znode_cache 330KB 1524KB
[ 1257.892925] zil_zcw_cache 7KB 7KB
[ 1257.892926] zil_lwb_cache 31KB 31KB
[ 1257.892991] dmu_buf_impl_t 949KB 5841KB
[ 1257.892994] arc_buf_t 146KB 252KB
[ 1257.892995] arc_buf_hdr_t_full 38210KB 38210KB
[ 1257.893026] dnode_t 2382KB 6187KB
[ 1257.893034] sa_cache 148KB 399KB
[ 1257.893035] abd_t 13637KB 13637KB
[ 1257.893036] lz4_cache 64KB 64KB
[ 1257.893043] zio_buf_comb_16384 2256KB 3328KB
[ 1257.893044] zio_buf_comb_14336 160KB 384KB
[ 1257.893045] zio_buf_comb_12288 144KB 216KB
[ 1257.893046] zio_buf_comb_10240 144KB 264KB
[ 1257.893048] zio_buf_comb_8192 128KB 288KB
[ 1257.893049] zio_buf_comb_7168 160KB 320KB
[ 1257.893050] zio_buf_comb_6144 224KB 384KB
[ 1257.893052] zio_buf_comb_5120 224KB 384KB
[ 1257.893053] zio_buf_comb_4096 156KB 288KB
[ 1257.893054] zio_buf_comb_3584 283KB 409KB
[ 1257.893055] zio_buf_comb_3072 360KB 480KB
[ 1257.893056] zio_buf_comb_2560 480KB 600KB
[ 1257.893057] zio_buf_comb_2048 160KB 288KB
[ 1257.893059] zio_buf_comb_1536 189KB 315KB
[ 1257.893074] zio_buf_comb_1024 15229KB 15664KB
[ 1257.893075] zio_buf_comb_512 30248KB 30248KB
[ 1257.893076] zio_link_cache 67KB 83KB
[ 1257.893077] zio_cache 197KB 247KB
[ 1257.893149] zfs_btree_leaf_cache 8264KB 26464KB
[ 1257.893150] mqueue_inode_cache 15KB 15KB
[ 1257.893153] PING 64KB 64KB
[ 1257.893154] RAW 32KB 32KB
[ 1257.893155] tw_sock_TCP 31KB 31KB
[ 1257.893156] request_sock_TCP 7KB 7KB
[ 1257.893157] TCP 61KB 61KB
[ 1257.893158] hugetlbfs_inode_cache 30KB 30KB
[ 1257.893159] eventpoll_pwq 7KB 7KB
[ 1257.893160] request_queue 61KB 61KB
[ 1257.893160] blkdev_ioc 15KB 15KB
[ 1257.893161] biovec-max 352KB 352KB
[ 1257.893162] biovec-128 64KB 64KB
[ 1257.893163] biovec-64 48KB 48KB
[ 1257.893164] khugepaged_mm_slot 7KB 7KB
[ 1257.893165] user_namespace 7KB 7KB
[ 1257.893166] uid_cache 16KB 16KB
[ 1257.893166] skbuff_ext_cache 7KB 7KB
[ 1257.893167] skbuff_fclone_cache 112KB 112KB
[ 1257.893168] skbuff_head_cache 20KB 36KB
[ 1257.893169] file_lock_cache 11KB 11KB
[ 1257.893170] fsnotify_mark_connector 8KB 8KB
[ 1257.893171] net_namespace 57KB 57KB
[ 1257.893172] task_delay_info 19KB 19KB
[ 1257.893172] taskstats 15KB 15KB
[ 1257.893173] proc_dir_entry 114KB 114KB
[ 1257.893174] pde_opener 7KB 7KB
[ 1257.893176] shmem_inode_cache 402KB 402KB
[ 1257.893214] kernfs_node_cache 7809KB 8068KB
[ 1257.893216] mnt_cache 201KB 221KB
[ 1257.893217] filp 87KB 124KB
[ 1257.893218] names_cache 160KB 160KB
[ 1257.893219] lsm_file_cache 59KB 59KB
[ 1257.893220] uts_namespace 15KB 15KB
[ 1257.893221] vm_area_struct 214KB 223KB
[ 1257.893221] mm_struct 63KB 63KB
[ 1257.893222] files_cache 47KB 47KB
[ 1257.893223] signal_cache 171KB 236KB
[ 1257.893224] sighand_cache 332KB 371KB
[ 1257.893226] task_struct 953KB 1067KB
[ 1257.893226] cred_jar 43KB 43KB
[ 1257.893228] anon_vma_chain 198KB 244KB
[ 1257.893229] anon_vma 178KB 178KB
[ 1257.893229] pid 36KB 36KB
[ 1257.893230] Acpi-Operand 74KB 74KB
[ 1257.893230] Acpi-Parse 15KB 15KB
[ 1257.893231] Acpi-State 27KB 27KB
[ 1257.893231] numa_policy 3KB 3KB
[ 1257.893232] trace_event_file 118KB 118KB
[ 1257.893232] ftrace_event_field 183KB 183KB
[ 1257.893233] pool_workqueue 108KB 108KB
[ 1257.893234] task_group 82KB 82KB
[ 1257.893236] vmap_area 126KB 188KB
[ 1257.893244] kmalloc-8k 1432KB 2880KB
[ 1257.893246] kmalloc-4k 4640KB 4736KB
[ 1257.893247] kmalloc-2k 3064KB 3064KB
[ 1257.893304] kmalloc-1k 6077KB 13356KB
[ 1257.893403] kmalloc-512 2262KB 9736KB
[ 1257.893405] kmalloc-256 292KB 308KB
[ 1257.893405] kmalloc-192 492KB 492KB
[ 1257.893407] kmalloc-128 231KB 248KB
[ 1257.893439] kmalloc-96 428KB 1724KB
[ 1257.893553] kmalloc-64 977KB 5440KB
[ 1257.893608] kmalloc-32 2481KB 3784KB
[ 1257.893609] kmalloc-16 120KB 120KB
[ 1257.893609] kmalloc-8 48KB 48KB
[ 1257.893610] kmem_cache_node 80KB 80KB
[ 1257.893613] kmem_cache 497KB 551KB
[ 1257.893613] Tasks state (memory values in pages):
[ 1257.893614] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[ 1257.893618] [ 379] 0 379 11489 0 110592 341 -250 systemd-journal
[ 1257.893620] [ 416] 0 416 4951 0 61440 341 -1000 systemd-udevd
[ 1257.893621] [ 468] 0 468 2839 0 49152 57 -1000 auditd
[ 1257.893622] [ 635] 102 635 6847 0 73728 241 0 systemd-network
[ 1257.893624] [ 663] 104 663 2106 1 49152 210 -900 dbus-daemon
[ 1257.893625] [ 673] 0 673 965 0 49152 32 0 rngd
[ 1257.893626] [ 674] 106 674 56027 0 81920 450 0 rsyslogd
[ 1257.893627] [ 676] 0 676 4407 0 73728 313 0 systemd-logind
[ 1257.893628] [ 679] 0 679 59475 0 81920 278 0 zed
[ 1257.893629] [ 680] 0 680 356094 0 299008 3258 -999 containerd
[ 1257.893630] [ 728] 0 728 1393 0 53248 67 0 cron
[ 1257.893632] [ 736] 101 736 18599 0 61440 154 0 ntpd
[ 1257.893633] [ 753] 0 753 3045 2 65536 228 -1000 sshd
[ 1257.893634] [ 941] 105 941 1694 0 49152 52 0 uuidd
[ 1257.893635] [ 2658] 0 2658 1723 0 45056 61 0 delphix-telegra
[ 1257.893636] [ 2659] 0 2659 324637 0 327680 4868 -500 dockerd
[ 1257.893637] [ 2790] 0 2790 622 0 45056 20 0 none
[ 1257.893638] [ 2944] 65433 2944 4731 0 81920 343 0 systemd
[ 1257.893640] [ 16468] 65433 16468 653 0 45056 25 0 sh
[ 1257.893641] [ 16469] 65433 16469 653 0 45056 31 0 sh
[ 1257.893642] [ 16470] 65433 16470 1724 1 49152 67 0 bash
[ 1257.893643] [ 16473] 0 16473 2116 0 53248 121 0 sudo
[ 1257.893644] [ 17852] 0 17852 1724 1 57344 80 0 run-live-build.
[ 1257.893645] [ 18003] 0 18003 960 1 49152 76 0 lb_build
[ 1257.893646] [ 18018] 0 18018 1724 1 49152 66 0 build
[ 1257.893647] [ 18019] 0 18019 960 1 40960 77 0 lb_build
[ 1257.893648] [ 18020] 0 18020 1372 0 45056 27 0 tee
[ 1257.893650] [ 76532] 0 76532 686 0 40960 77 0 lb_binary
[ 1257.893651] [ 76570] 0 76570 686 1 40960 78 0 lb_binary_chroo
[ 1257.893652] [ 76623] 0 76623 3726 0 69632 519 0 cp
[ 1257.893653] [ 77171] 65433 77171 1369 0 49152 21 0 sleep
[ 1257.893654] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-65433.slice/session-5.scope,task=cp,pid=76623,uid=0
[ 1257.893660] Out of memory: Killed process 76623 (cp) total-vm:14904kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:68kB oom_score_adj:0
Run experiment:
$ sudo su - root
root@ip-10-110-226-245:~# echo 0 > /sys/module/zfs/parameters/zfs_arc_shrinker_limit
root@ip-10-110-226-245:~# echo 3 >/proc/sys/vm/drop_caches
While waiting for drop_caches
to return:
$ arcstat 5
time read ddread ddh% dmread dmh% pread ph% size c avail
00:48:20 0 0 0 0 0 0 0 2.7G 100M -269M
00:48:25 73 25 100 47 100 0 0 2.7G 100M -266M
00:48:30 83 27 100 56 100 0 0 2.7G 100M -270M
00:48:35 73 26 100 46 100 0 0 2.7G 100M -266M
00:48:40 83 27 100 56 100 0 0 2.7G 100M -269M
00:48:45 73 26 100 46 100 0 0 2.7G 100M -266M
00:48:50 117 27 100 90 100 0 0 2.7G 100M -266M
00:48:55 78 26 92 52 100 0 0 2.7G 100M -262M
00:49:00 85 26 96 58 100 0 0 2.7G 100M -264M
00:49:05 81 26 100 55 100 0 0 2.7G 100M -263M
00:49:10 83 26 100 56 100 0 0 2.7G 100M -265M
00:49:15 73 26 100 46 100 0 0 2.7G 100M -262M
00:49:20 84 27 100 57 100 0 0 2.7G 100M -265M
nothing seems to be happening w.r.t. the ARC?
I should have the system available for maybe 10 more hours (before it times out), if there's anything I could collect from it to help..?
we run our systems with
zfs_arc_meta_strategy = 0
which was removed. Is the equivalent to setzfs_arc_meta_balance = 1
? I wonder if that would help us avoid the OOM.
@grwilson The code behind zfs_arc_meta_strategy was removed completely. There is nothing left of it. I am not sure what was your motivation behind setting that tunable. Setting zfs_arc_meta_balance = 1 would cause metadata to be evicted first if there are any data ghost hits at all, that I really doubt is your goal. zfs_arc_meta_balance should not affect OOMs, but only what should be evicted from ARC under pressure.
If it's an EC2 VM, doesn't Amazon let you hibernate the VM, then you could snapshot the volume and be able to bring it back up and see it broken like this very conveniently on demand as long as you kept the snapshot?
after 6 hours, it's finally able to evict data out of the ARC:
$ arcstat 1
time read ddread ddh% dmread dmh% pread ph% size c avail
04:15:24 0 0 0 0 0 0 0 461M 506M 1.9G
04:15:25 153 0 0 153 100 0 0 460M 506M 1.9G
04:15:26 0 0 0 0 0 0 0 460M 506M 1.9G
I have some grafana stuff enabled, so I can see clearly, it took 6 hours for this to occur:
Hm.. not sure what happened, but I stepped away, and now that I'm able to start poking around again, it looks like the ARC has evicted.. I don't know why it couldn't (wouldn't?) do that for ~6 hours.. Hm..
Also, here's arcstats
at this point now.. after it's finally been able to evict..
$ cat /proc/spl/kstat/zfs/arcstats
9 1 0x01 147 39984 3953761573 25074486980481
name type data
hits 4 11624683
iohits 4 191778
misses 4 430403
demand_data_hits 4 5138324
demand_data_iohits 4 135905
demand_data_misses 4 133482
demand_metadata_hits 4 6378098
demand_metadata_iohits 4 3624
demand_metadata_misses 4 124047
prefetch_data_hits 4 12446
prefetch_data_iohits 4 2502
prefetch_data_misses 4 166206
prefetch_metadata_hits 4 95815
prefetch_metadata_iohits 4 49747
prefetch_metadata_misses 4 6668
mru_hits 4 2502908
mru_ghost_hits 4 29342
mfu_hits 4 9121775
mfu_ghost_hits 4 118668
uncached_hits 4 0
deleted 4 660487
mutex_miss 4 53091
access_skip 4 1
evict_skip 4 116543957
evict_not_enough 4 18118732
evict_l2_cached 4 0
evict_l2_eligible 4 85043187712
evict_l2_eligible_mfu 4 20706493952
evict_l2_eligible_mru 4 64336693760
evict_l2_ineligible 4 545715200
evict_l2_skip 4 0
hash_elements 4 10717
hash_elements_max 4 148339
hash_collisions 4 72651
hash_chains 4 121
hash_chain_max 4 5
meta 4 378628870
pd 4 17354729
pm 4 2243815708
c 4 536762624
c_min 4 105199872
c_max 4 2524796928
size 4 483649488
compressed_size 4 436979200
uncompressed_size 4 756879872
overhead_size 4 27902976
hdr_size 4 2934944
data_size 4 447864320
metadata_size 4 17017856
dbuf_size 4 1032528
dnode_size 4 2292000
bonus_size 4 665280
anon_size 4 2321920
anon_data 4 2305536
anon_metadata 4 16384
anon_evictable_data 4 0
anon_evictable_metadata 4 0
mru_size 4 146936320
mru_data 4 140604928
mru_metadata 4 6331392
mru_evictable_data 4 130033664
mru_evictable_metadata 4 491520
mru_ghost_size 4 57907712
mru_ghost_data 4 18083840
mru_ghost_metadata 4 39823872
mru_ghost_evictable_data 4 18083840
mru_ghost_evictable_metadata 4 39823872
mfu_size 4 315623936
mfu_data 4 304953856
mfu_metadata 4 10670080
mfu_evictable_data 4 299189248
mfu_evictable_metadata 4 991232
mfu_ghost_size 4 41371648
mfu_ghost_data 4 4413440
mfu_ghost_metadata 4 36958208
mfu_ghost_evictable_data 4 4413440
mfu_ghost_evictable_metadata 4 36958208
uncached_size 4 0
uncached_data 4 0
uncached_metadata 4 0
uncached_evictable_data 4 0
uncached_evictable_metadata 4 0
l2_hits 4 0
l2_misses 4 0
l2_prefetch_asize 4 0
l2_mru_asize 4 0
l2_mfu_asize 4 0
l2_bufc_data_asize 4 0
l2_bufc_metadata_asize 4 0
l2_feeds 4 0
l2_rw_clash 4 0
l2_read_bytes 4 0
l2_write_bytes 4 0
l2_writes_sent 4 0
l2_writes_done 4 0
l2_writes_error 4 0
l2_writes_lock_retry 4 0
l2_evict_lock_retry 4 0
l2_evict_reading 4 0
l2_evict_l1cached 4 0
l2_free_on_write 4 0
l2_abort_lowmem 4 0
l2_cksum_bad 4 0
l2_io_error 4 0
l2_size 4 0
l2_asize 4 0
l2_hdr_size 4 0
l2_log_blk_writes 4 0
l2_log_blk_avg_asize 4 0
l2_log_blk_asize 4 0
l2_log_blk_count 4 0
l2_data_to_meta_ratio 4 0
l2_rebuild_success 4 0
l2_rebuild_unsupported 4 0
l2_rebuild_io_errors 4 0
l2_rebuild_dh_errors 4 0
l2_rebuild_cksum_lb_errors 4 0
l2_rebuild_lowmem 4 0
l2_rebuild_size 4 0
l2_rebuild_asize 4 0
l2_rebuild_bufs 4 0
l2_rebuild_bufs_precached 4 0
l2_rebuild_log_blks 4 0
memory_throttle_count 4 82
memory_direct_count 4 1041333318
memory_indirect_count 4 7797896
memory_all_bytes 4 3366395904
memory_free_bytes 4 2489212928
memory_available_bytes 3 2068413440
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 19705
arc_meta_used 4 23942608
arc_dnode_limit 4 252479692
async_upgrade_sync 4 136280
predictive_prefetch 4 333381
demand_hit_predictive_prefetch 4 40833
demand_iohit_predictive_prefetch 4 138814
prescient_prefetch 4 3
demand_hit_prescient_prefetch 4 3
demand_iohit_prescient_prefetch 4 0
arc_need_free 4 0
arc_sys_free 4 420799488
arc_raw_size 4 0
cached_only_in_progress 4 0
abd_chunk_waste_size 4 11842560
Here's a graph of CPU for that same timeframe as above, for the memory:
I think the spike in "system" around 00:47:45 is when I started the drop_caches
experiment from my prior comment above.
And just in case it's useful to verify.. looks like both of the 2 CPUs on the system were busy during that time..
Also, FWIW.. as I logged back in hours later, and started poking around.. I ran:
$ sudo BPFTRACE_BTF=/sys/kernel/btf/vmlinux bpftrace -e 'kretprobe:arc_is_overflowing { printf("returned: %u\n", retval); } '
returned: 2
returned: 2
returned: 2
...
so I think ARC_OVF_SEVERE
was set at that time..
right now, now that the ARC has evicted.. it's returning 0:
$ sudo BPFTRACE_BTF=/sys/kernel/btf/vmlinux bpftrace -e 'kretprobe:arc_is_overflowing { printf("returned: %u\n", retval); } '
returned: 0
returned: 0
returned: 0
so, I think at the point where I was able to log back in and poke around, the system hadn't yet evicted.. but while I was trying to inspect the system, eviction was able to occur.. I don't know if, why, or how my shell activity would have affected ARC eviction, but just wanted to mention it..
@prakashsurya Could you please test this patch: https://github.com/openzfs/zfs/pull/14692 ?
Could you please test this patch: https://github.com/openzfs/zfs/pull/14692 ?
Sure, let me start working on this.. I'll be out part of the day today, so I might not be able to reproduce until tomorrow, but I'll work on this.
Not sure how useful this is, I think it just gives us the same information we already suspected, but here's some graphs of ARC stat info, over that same time period of the same system that I did the drop_caches
experiment on (same system as graphs above):
We can see, around 21:44, c
begins to collapse, but size
does not. Then by 21:45, c
looks to be about as low as it'll go, and size
and data
begin to grow. Then at 21:48, size
and data
hit some limit, and remain high, until eviction can finally do something about it.
Still not clear why it behaved this way.
Correlating those timestamps to the eviction counters is also interesting..
Why is evict_skip
increasing so much? presumably, the kernel is asking us to evict, we're skipping eviction for whatever reason.. then eventually, the kernel just stops asking us to evict, since don't actually evict anything the many times it's asked us to do so.. at which point, we just "sit there", with size
well above c
, but never actually evict back down..
So, maybe we need to understand why evict_skip
is getting triggered? It's been so long since I've been in this code, so I'm kind of guessing at this point.. but the fact that evict_skip
is increasing feels relevant.. and I think it correlates with the timestamps and values of size
and c
..
@grwilson it looks like you must also be increasing the default
zfs_arc_max
?
We also run with a larger arc by default. We use 3/4 of all memory vs upstream which use 5/8.
3989 > /* prefetch buffers have a minimum lifespan */
3990 > if ((hdr->b_flags & (ARC_FLAG_PREFETCH | ARC_FLAG_INDIRECT)) &&
3991 > ddi_get_lbolt() - hdr->b_l1hdr.b_arc_access <
3992 > MSEC_TO_TICK(min_lifetime)) {
3993 > > ARCSTAT_BUMP(arcstat_evict_skip);
3994 > > return (bytes_evicted);
3995 > }
Hm.. could we be filling the ARC with prefetched data, and then not allowing that to be evicted due to this logic?
Hm.. could we be filling the ARC with prefetched data, and then not allowing that to be evicted due to this logic?
Theoretically we could, but for that we would have to be very bad in our prefetch decisions, since first demand read clears the prefetch flag, allowing the eviction, and even if not then buffers should be evicted at most 1 or 6 seconds later. Considering that eviction thread should try it again and again I am not sure what value of evict_skip would I consider suspicious, but I can't say that 4M I see makes me worry too much.
Could you please test this patch: #14692 ?
Sure, let me start working on this.. I'll be out part of the day today, so I might not be able to reproduce until tomorrow, but I'll work on this.
@prakashsurya Please let me know how it goes and if I should continue searching.
@amotin sorry for the delay, I kicked off 10 builds with your change applied.. lets see how those go..
With that said, I'm skeptical it's a signed-ness issue at play here, since I'm seeing arcstats_evict_not_enough
being triggered.. perhaps I'm misunderstanding, though..
Do we have any theories as to why evict_skip
increases in the graphs above? is that normal?
@prakashsurya With MFU data state being persistently beaten into the ground, I am absolutely not surprised by evict_skip and evict_not_enough growing same time. I can't quantify their exact values without knowing how many times eviction was called in general, but so far it all fits within my understanding of the problem that should be fixed by the signed-ness PR.
OK.. if it makes sense to you, then that sounds good to me.
I was able to get #14692 applied and ran through about 30 builds, none of them failed. I'd say, lets work to get that integrated..
Then I can work to get our product builds running those bits, and see if we run into any problems.. looks good so far.
Closing this with the https://github.com/openzfs/zfs/pull/14692 merged.
Just to close the loop on our end, we pulled #14692 into our buildserver images, and haven't seen the OOMs yet with that applied.. @amotin thanks for you help!
I'm still investigating things, but I wanted to open an issue with the data I have, since I'm thinking it may be related to https://github.com/openzfs/zfs/pull/14359
We have a system/image that's used as our buildserver, used to build various artifacts for our product (packages, VM image, etc.).
Since we integrated https://github.com/openzfs/zfs/pull/14359 into our version of ZFS used on these buildserver images, we've been seeing OOM kills, causing build failures.
In the one case that I've been able to inspect, I see the following:
So, the system only has ~3G of RAM.. most of which appears to be used.. but, used by what?
It doesn't look to be used by processes...
Looking at the ARC, I see this:
For a nearly 3G RAM system.. using ~2.9G for ARC seems excessive.. particularly, given that OOM killer is kicking in to kill user space processes.. e.g. why doesn't the ARC evict data?
Here's an example OOM message:
Additionally, I've attached a g'ziped copy of
syslog
from the system.. syslog.gzI'm still actively investigating this issue.. and not certain that https://github.com/openzfs/zfs/pull/14359 is to blame yet.. but things point it so far.. For example, an image generated prior to that change being integrated, does not have these OOM issues.. but an image generated after that change was integrated does have these OOM issues.. that coupled with the fact that nearly all 3G of the system are being used by the ARC, is leading me to suspect it's the changes to the ARC eviction at play here..
As I get more information, I'll report back.. but wanted to post the information I have so far, in case somebody more familiar could chime in, and maybe help me pin point the problem..
CC: @amotin @behlendorf @allanjude