Open rincebrain opened 6 years ago
I don't see a bug here, perhaps better to discuss on the mail list?
What you're calling target doesn't seem correct to me because the size follows target, but that isn't shown here.
I am seeing strangeness wrt unexpected target size drop. @rincebrain can you dig up the no_grow, memory_indirect_count, and memory_direct_count during the time in question? Something like the attached.
@richardelling Sadly, I don't have that particular stat recorded; I can add it, but it's obviously not retroactive.
https://openzfs.topicbox.com/groups/developer/T10a105c53bcce15c-M53cff3b459ecfa98e446241c sounds remarkably like it could do something similar to what I reported.
Agree, at the least this is an unintentional bug from the aggsum checkin. I'll be testing a trivial fix for this in the next day or so and see if it prevents the target reduction.
That said, I don't understand why your size didn't shrink along with the target.
May I suggest/request this come in 0.7.10 (#7792) as well, unless it turns out the trivial fix sets something on fire, so we don't have to wait for 0.8 for the MFU/MRU to stop see-sawing? :)
@richardelling I think what happened is that, since the eviction bug made it always try, it kept reducing the target it was trying to hit long after it had run out of things to evict in there, but once it ran out of things to drop from the MRU, it stopped successfully dropping anything, until it cratered out and then the other codepaths resulted in the target normalizing...until MRU got to be a large enough portion of the ARC, and then we repeat. (That would explain why target craters, slowly grows, then almost exactly tracks arcsz until it hits a limit and reaps. I don't know where the 50% comes from, though, since zfs_arc_meta_limit_percent is 75 and none of the other math seems to work out to that value when composed. )
My current values for those two counters on that machine are:
memory_direct_count 4 856767
memory_indirect_count 4 8464100
methinks there is more than one issue at play. The trivial change to arc_adjust() is correct, but doesn’t solve the target reduction I see in automated testing.
We definitely need to get the trival aggsum fix reviewed and in to master to address that issue. But there's something else going on which needs to be understood since none of that code is part of the 0.7.x series.
@richardelling's graphs from https://github.com/zfsonlinux/zfs/issues/7820#issuecomment-417051916 are pretty interesting. The memory_*direct_count
's get bumped once each time the kernel, not ZFS, determines the system is low on memory. The graphs show that the kernel, for some reason, starts consistently asking ZFS to release some memory. Which ZFS does, then adjusts the new target downward accordingly. The expectation is that at some point the kernel will stop asking and we'll settle in to a new steady state. But for some reason that's not happening.
This might be the result of a kernel change in memory management behavior, but I haven't gone looking for a specific commit yet. So it's possible this only effects newer kernel.
If either of you are game, could you try commit f7312913c9c7c6558121bc1ee2f9296b5d037c1d on one of your test systems to confirm this is what's going on.
After patching the source you'll need to set zfs_arc_shrinker_enabled=0
to test the change. The idea here is to allow the arc_reclaim
thread to solely manage the indirect memory reclaim from the ARC. Only direct memory reclaim requests by the kernel will result in memory being freed. These should happen when the kernel needs memory now. Likely because a rapid change in the memory requirements. This should prevent the collapse.
I'm game; sorry I hadn't already noticed that the arc_adjust code refactors postdate 0.7-release, I'm building an 0.7 spin with f731291 now to give a whirl...
@behlendorf you're a step ahead of me, I was just fixin' to code that almost exactly the same way. I'll get it into test later today.
Meanwhile, the other fix is trivial but AFAIK applies to master. AFAIK, aggsums weren't backported to 0.7
dammitalltohell... hit the wrong button
@richardelling PR's with the trivial fix against master are welcome!
There is still a case where target falls as memory_direct_count starts.
Out of curiosity, what's recording that data and spitting those graphs out? (My above ones are from netdata, but it makes it more difficult than I'd like to add new graphs.)
@richardelling is it possible that with your workload something on the system is legitimately using that memory? Direct reclaim shouldn't be invoked unless the memory is needed elsewhere. Which kernel version are you testing with?
Thanks for being the straight guy :-) I'll present all of that at the OpenZFS 2018 conference soon. The short answer is you can use some combination of: telegraf or node_exporter to collect all of the stats, influxdb or prometheus to store and do math, then grafana to make pretty pictures.
@behlendorf though I rarely test on generic distros, this is an exception. RHEL 7.4 kernel 3.10.0-693.21.1.el7.x86_64
System has 128GiB and ZFS is the only workload under test. That said, getting to the bottom of why the kernel wants reclaim isn't well instrumented. I'm tempted to expose last_free_memory
and last_free_reason
as kstats to get better insight.
Yup, it's definitely still happening, though setting zfs_arc_shrinker_enabled=0 seems to cause arcsz+target to sometimes vary even higher?
I think you can guess without me specifying where I set zfs_arc_shrinker_enabled=0, from my above statement.
Ratio of MFU/MRU. Nothing interesting seems to have changed when I changed that tunable.
Lots of direct memory throttles once I turned shrinker off, bursts of indirect throttles with it on, and the number of ARC elements goes from a steady state to tiny.
(I reproduced this by doing a zfs send -Rec on a 7TB dataset to /dev/null; it happened almost immediately. Confusingly, I couldn't use this to reproduce it right after reboot into the version with shrinker_enabled in the module...)
This seems to match my experience as well. There are cases where memory_direct_count increases even though we have tens of GB available. I'll try to dig into it more next week.
@rincebrain if possible would you mind cherry-picking this https://github.com/zfsonlinux/zfs/pull/8000/commits/4c4fb6281c014633131733557144e0a0eb0bcec5 fix and see if it resolves the issue. Unfortunately I wasn't able to reproduce the issue as easily as you seem to be able too. If you are still able to reproduce the issue the full arcstats
output from the bad state would be helpful, I didn't see it above.
Cherrypicked it onto 0.7.11; at first glance, it seems to do something useful.
(I made these two graphs by having arc_min=arc_max, piping a large send into /dev/null, then setting arc_min to 2GB once arcsz was arc_max for a bit. )
Before 4c4fb62:
After:
It also now steady-states around arc_max even after the send is over, which (given the absence of memory pressure on the system) seems like the correct behavior.
Closing. This should be resolved by 4c4fb62 which will be in 0.7.12.
Huh, I thought I commented in here that this appears to only be a partial fix, but apparently not.
Well, as the first line says, it certainly helps and reduces the frequency, but the ARC still ends up periodically dumping to arc_min without any obvious correlated external memory pressure.
e: Amusingly, it took about 2.5 minutes to reproduce when I dropped arc_min from =arc_max (48 GiB) to 32 GiB.
The system was not, AFAICT, under any memory pressure when this occurred - available memory shot up, and has not gone down again since.
It just completely evicted the recent queue. I might be misunderstanding the ARC expected behavior, but I don't think I expect it to drop 100% of the MRU as soon as it tries allocating and is at arc_max? I suppose that could be reasonable if the entire content of the MRU is old enough that it'd be evicted without the arc_min setting, but zooming out it does seem like the MRU elements were being hit prior to the mass eviction. It still seems strange that it'd evict 100% of one and not the other, though.
Type | Version/Name |
---|---|
Distribution Name | Debian |
Distribution Version | 9.6 |
Linux Kernel | 4.9.0-8-amd64 (4.9.130-2 (2018-10-27))) |
Architecture | amd64 |
ZFS Version | 0.7.12 |
SPL Version | 0.7.12 |
I may have a related issue. My ARC sits at zfs_arc_min despite no apparent memory pressure (30+ GB free memory) and my MRU is essentially pushed to near zero (0.2% of ARC).
total used free shared buff/cache available
Mem: 128908 88659 38708 51 1540 39316
Swap: 0 0 0
ZFS Subsystem Report Sun Jan 13 20:36:21 2019
ARC Summary: (HEALTHY)
Memory Throttle Count: 0
ARC Misc:
Deleted: 4.25M
Mutex Misses: 1.21k
Evict Skips: 7.85M
ARC Size: 80.00% 80.00 GiB
Target Size: (Adaptive) 80.01% 80.01 GiB
Min Size (Hard Limit): 80.00% 80.00 GiB
Max Size (High Water): 1:1 100.00 GiB
ARC Size Breakdown:
Recently Used Cache Size: 0.21% 172.15 MiB
Frequently Used Cache Size: 99.79% 79.35 GiB
ARC Hash Breakdown:
Elements Max: 1.44M
Elements Current: 65.00% 936.15k
Collisions: 609.45k
Chain Max: 5
Chains: 24.35k
ARC Total accesses: 641.02M
Cache Hit Ratio: 96.69% 619.79M
Cache Miss Ratio: 3.31% 21.22M
Actual Hit Ratio: 96.59% 619.16M
Data Demand Efficiency: 98.82% 143.29M
Data Prefetch Efficiency: 17.48% 4.59M
CACHE HITS BY CACHE LIST:
Most Recently Used: 12.78% 79.19M
Most Frequently Used: 87.12% 539.97M
Most Recently Used Ghost: 0.20% 1.23M
Most Frequently Used Ghost: 0.00% 1.34k
CACHE HITS BY DATA TYPE:
Demand Data: 22.85% 141.60M
Prefetch Data: 0.13% 801.63k
Demand Metadata: 76.95% 476.94M
Prefetch Metadata: 0.07% 450.29k
CACHE MISSES BY DATA TYPE:
Demand Data: 7.97% 1.69M
Prefetch Data: 17.83% 3.78M
Demand Metadata: 73.39% 15.58M
Prefetch Metadata: 0.82% 173.67k
L2 ARC Summary: (HEALTHY)
Low Memory Aborts: 6
Free on Write: 0
R/W Clashes: 0
Bad Checksums: 0
IO Errors: 0
L2 ARC Size: (Adaptive) 2.23 GiB
Compressed: 28.65% 654.65 MiB
Header Size: 0.00% 0 Bytes
L2 ARC Breakdown: 21.22M
Hit Ratio: 0.04% 9.53k
Miss Ratio: 99.96% 21.21M
Feeds: 123.95k
L2 ARC Writes:
Writes Sent: 100.00% 36.14k
DMU Prefetch Efficiency: 207.78M
Hit Ratio: 1.49% 3.10M
Miss Ratio: 98.51% 204.68M
ZFS Tunables:
dbuf_cache_hiwater_pct 10
dbuf_cache_lowater_pct 10
dbuf_cache_max_bytes 104857600
dbuf_cache_max_shift 5
dmu_object_alloc_chunk_shift 7
ignore_hole_birth 1
l2arc_feed_again 1
l2arc_feed_min_ms 200
l2arc_feed_secs 1
l2arc_headroom 2
l2arc_headroom_boost 200
l2arc_noprefetch 1
l2arc_norw 0
l2arc_write_boost 8388608
l2arc_write_max 8388608
metaslab_aliquot 524288
metaslab_bias_enabled 1
metaslab_debug_load 0
metaslab_debug_unload 0
metaslab_fragmentation_factor_enabled 1
metaslab_lba_weighting_enabled 1
metaslab_preload_enabled 1
metaslabs_per_vdev 200
send_holes_without_birth_time 1
spa_asize_inflation 24
spa_config_path /etc/zfs/zpool.cache
spa_load_verify_data 1
spa_load_verify_maxinflight 10000
spa_load_verify_metadata 1
spa_slop_shift 5
zfetch_array_rd_sz 1048576
zfetch_max_distance 8388608
zfetch_max_streams 8
zfetch_min_sec_reap 2
zfs_abd_scatter_enabled 1
zfs_abd_scatter_max_order 10
zfs_admin_snapshot 1
zfs_arc_average_blocksize 8192
zfs_arc_dnode_limit 0
zfs_arc_dnode_limit_percent 10
zfs_arc_dnode_reduce_percent 10
zfs_arc_grow_retry 0
zfs_arc_lotsfree_percent 10
zfs_arc_max 107374182400
zfs_arc_meta_adjust_restarts 4096
zfs_arc_meta_limit 0
zfs_arc_meta_limit_percent 75
zfs_arc_meta_min 10737418240
zfs_arc_meta_prune 10000
zfs_arc_meta_strategy 1
zfs_arc_min 85899345920
zfs_arc_min_prefetch_lifespan 0
zfs_arc_p_dampener_disable 1
zfs_arc_p_min_shift 0
zfs_arc_pc_percent 0
zfs_arc_shrink_shift 0
zfs_arc_sys_free 0
zfs_autoimport_disable 1
zfs_checksums_per_second 20
zfs_compressed_arc_enabled 1
zfs_dbgmsg_enable 0
zfs_dbgmsg_maxsize 4194304
zfs_dbuf_state_index 0
zfs_deadman_checktime_ms 5000
zfs_deadman_enabled 1
zfs_deadman_synctime_ms 1000000
zfs_dedup_prefetch 0
zfs_delay_min_dirty_percent 60
zfs_delay_scale 500000
zfs_delays_per_second 20
zfs_delete_blocks 20480
zfs_dirty_data_max 4294967296
zfs_dirty_data_max_max 4294967296
zfs_dirty_data_max_max_percent 25
zfs_dirty_data_max_percent 10
zfs_dirty_data_sync 67108864
zfs_dmu_offset_next_sync 0
zfs_expire_snapshot 300
zfs_flags 0
zfs_free_bpobj_enabled 1
zfs_free_leak_on_eio 0
zfs_free_max_blocks 100000
zfs_free_min_time_ms 1000
zfs_immediate_write_sz 32768
zfs_max_recordsize 1048576
zfs_mdcomp_disable 0
zfs_metaslab_fragmentation_threshold 70
zfs_metaslab_segment_weight_enabled 1
zfs_metaslab_switch_threshold 2
zfs_mg_fragmentation_threshold 85
zfs_mg_noalloc_threshold 0
zfs_multihost_fail_intervals 5
zfs_multihost_history 0
zfs_multihost_import_intervals 10
zfs_multihost_interval 1000
zfs_multilist_num_sublists 0
zfs_no_scrub_io 0
zfs_no_scrub_prefetch 0
zfs_nocacheflush 0
zfs_nopwrite_enabled 1
zfs_object_mutex_size 64
zfs_pd_bytes_max 52428800
zfs_per_txg_dirty_frees_percent 30
zfs_prefetch_disable 0
zfs_read_chunk_size 1048576
zfs_read_history 0
zfs_read_history_hits 0
zfs_recover 0
zfs_recv_queue_length 16777216
zfs_resilver_delay 2
zfs_resilver_min_time_ms 3000
zfs_scan_idle 50
zfs_scan_ignore_errors 0
zfs_scan_min_time_ms 1000
zfs_scrub_delay 4
zfs_send_corrupt_data 0
zfs_send_queue_length 16777216
zfs_sync_pass_deferred_free 2
zfs_sync_pass_dont_compress 5
zfs_sync_pass_rewrite 2
zfs_sync_taskq_batch_pct 75
zfs_top_maxinflight 32
zfs_txg_history 0
zfs_txg_timeout 5
zfs_vdev_aggregation_limit 131072
zfs_vdev_async_read_max_active 3
zfs_vdev_async_read_min_active 1
zfs_vdev_async_write_active_max_dirty_percent 60
zfs_vdev_async_write_active_min_dirty_percent 30
zfs_vdev_async_write_max_active 10
zfs_vdev_async_write_min_active 2
zfs_vdev_cache_bshift 16
zfs_vdev_cache_max 16384
zfs_vdev_cache_size 0
zfs_vdev_max_active 1000
zfs_vdev_mirror_non_rotating_inc 0
zfs_vdev_mirror_non_rotating_seek_inc 1
zfs_vdev_mirror_rotating_inc 0
zfs_vdev_mirror_rotating_seek_inc 5
zfs_vdev_mirror_rotating_seek_offset 1048576
zfs_vdev_queue_depth_pct 1000
zfs_vdev_raidz_impl [fastest] original scalar sse2 ssse3
zfs_vdev_read_gap_limit 32768
zfs_vdev_scheduler noop
zfs_vdev_scrub_max_active 2
zfs_vdev_scrub_min_active 1
zfs_vdev_sync_read_max_active 10
zfs_vdev_sync_read_min_active 10
zfs_vdev_sync_write_max_active 10
zfs_vdev_sync_write_min_active 10
zfs_vdev_write_gap_limit 4096
zfs_zevent_cols 80
zfs_zevent_console 0
zfs_zevent_len_max 640
zil_replay_disable 0
zil_slog_bulk 786432
zio_delay_max 30000
zio_dva_throttle_enabled 1
zio_requeue_io_start_cut_in_line 1
zio_taskq_batch_pct 75
zvol_inhibit_dev 0
zvol_major 230
zvol_max_discard_blocks 16384
zvol_prefetch_bytes 131072
zvol_request_sync 0
zvol_threads 32
zvol_volmode 1
root@storage-server:/etc/openvpn# cat /proc/spl/kstat/zfs/arcstats
13 1 0x01 96 26112 34656842033 127935894113184
name type data
hits 4 622723998
misses 4 21332134
demand_data_hits 4 142369649
demand_data_misses 4 1700775
demand_metadata_hits 4 479096386
demand_metadata_misses 4 15655646
prefetch_data_hits 4 805936
prefetch_data_misses 4 3801181
prefetch_metadata_hits 4 452027
prefetch_metadata_misses 4 174532
mru_hits 4 79310752
mru_ghost_hits 4 1232252
mfu_hits 4 542775912
mfu_ghost_hits 4 1340
deleted 4 4257143
mutex_miss 4 1215
access_skip 4 490
evict_skip 4 7881662
evict_not_enough 4 295699
evict_l2_cached 4 1304838656
evict_l2_eligible 4 179141345792
evict_l2_ineligible 4 1091968961536
evict_l2_skip 4 0
hash_elements 4 941329
hash_elements_max 4 1440242
hash_collisions 4 612202
hash_chains 4 24534
hash_chain_max 4 5
p 4 480386928
c 4 86023009400
c_min 4 85899345920
c_max 4 107374182400
size 4 85741000384
compressed_size 4 84978538496
uncompressed_size 4 87820712960
overhead_size 4 257947648
hdr_size 4 317168928
data_size 4 84086763008
metadata_size 4 1149723136
dbuf_size 4 39354120
dnode_size 4 110780952
bonus_size 4 37210240
anon_size 4 7573504
anon_evictable_data 4 0
anon_evictable_metadata 4 0
mru_size 4 472462848
mru_evictable_data 4 295796736
mru_evictable_metadata 4 1918464
mru_ghost_size 4 85465085952
mru_ghost_evictable_data 4 80576998912
mru_ghost_evictable_metadata 4 4888087040
mfu_size 4 84756449792
mfu_evictable_data 4 83609634304
mfu_evictable_metadata 4 940071424
mfu_ghost_size 4 201733120
mfu_ghost_evictable_data 4 201733120
mfu_ghost_evictable_metadata 4 0
l2_hits 4 9540
l2_misses 4 21315322
l2_feeds 4 124888
l2_rw_clash 4 0
l2_read_bytes 4 47256064
l2_write_bytes 4 756345856
l2_writes_sent 4 36177
l2_writes_done 4 36177
l2_writes_error 4 0
l2_writes_lock_retry 4 0
l2_evict_lock_retry 4 0
l2_evict_reading 4 0
l2_evict_l1cached 4 0
l2_free_on_write 4 0
l2_abort_lowmem 4 6
l2_cksum_bad 4 0
l2_io_error 4 0
l2_size 4 2396663296
l2_asize 4 686565888
l2_hdr_size 4 0
memory_throttle_count 4 0
memory_direct_count 4 711073
memory_indirect_count 4 170203
memory_all_bytes 4 135170146304
memory_free_bytes 4 41371328512
memory_available_bytes 3 39259295744
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 0
arc_meta_used 4 1654237376
arc_meta_limit 4 80530636800
arc_dnode_limit 4 10737418240
arc_meta_max 4 4702071400
arc_meta_min 4 10737418240
sync_wait_for_async 4 23803
demand_hit_predictive_prefetch 4 1669466
arc_need_free 4 0
arc_sys_free 4 2112033536
I also made another observation about my ARC behavior. The MRU cache tends toward zero - usually stabalizing at 100-200 MiB. As noted above, ARC tends to stay at zfs_arc_min even when the system has tens of GBs of free/available memory.
Here are updated weekly graphs of memory and ARC usage as well as output from arc_summary and kstats. The system has been up around 9 1/2 days, so it should be at steady-state for ARC.
------------------------------------------------------------------------
ZFS Subsystem Report Mon Jan 21 16:52:14 2019
ARC Summary: (HEALTHY)
Memory Throttle Count: 0
ARC Misc:
Deleted: 29.87M
Mutex Misses: 8.75k
Evict Skips: 64.61M
ARC Size: 69.30% 69.30 GiB
Target Size: (Adaptive) 90.07% 90.07 GiB
Min Size (Hard Limit): 90.00% 90.00 GiB
Max Size (High Water): 1:1 100.00 GiB
ARC Size Breakdown:
Recently Used Cache Size: 2.17% 1.47 GiB
Frequently Used Cache Size: 97.83% 66.20 GiB
ARC Hash Breakdown:
Elements Max: 4.67M
Elements Current: 86.12% 4.02M
Collisions: 8.53M
Chain Max: 6
Chains: 407.29k
ARC Total accesses: 3.35G
Cache Hit Ratio: 95.80% 3.21G
Cache Miss Ratio: 4.20% 140.57M
Actual Hit Ratio: 95.61% 3.20G
Data Demand Efficiency: 98.52% 747.94M
Data Prefetch Efficiency: 16.73% 28.35M
CACHE HITS BY CACHE LIST:
Most Recently Used: 9.37% 300.30M
Most Frequently Used: 90.43% 2.90G
Most Recently Used Ghost: 0.35% 11.25M
Most Frequently Used Ghost: 0.00% 21.08k
CACHE HITS BY DATA TYPE:
Demand Data: 22.99% 736.89M
Prefetch Data: 0.15% 4.74M
Demand Metadata: 76.46% 2.45G
Prefetch Metadata: 0.41% 13.14M
CACHE MISSES BY DATA TYPE:
Demand Data: 7.86% 11.04M
Prefetch Data: 16.80% 23.61M
Demand Metadata: 71.83% 100.98M
Prefetch Metadata: 3.51% 4.94M
DMU Prefetch Efficiency: 1.19G
Hit Ratio: 1.20% 14.22M
Miss Ratio: 98.80% 1.17G
ZFS Tunables:
dbuf_cache_hiwater_pct 10
dbuf_cache_lowater_pct 10
dbuf_cache_max_bytes 104857600
dbuf_cache_max_shift 5
dmu_object_alloc_chunk_shift 7
ignore_hole_birth 1
l2arc_feed_again 1
l2arc_feed_min_ms 200
l2arc_feed_secs 1
l2arc_headroom 2
l2arc_headroom_boost 200
l2arc_noprefetch 1
l2arc_norw 0
l2arc_write_boost 8388608
l2arc_write_max 8388608
metaslab_aliquot 524288
metaslab_bias_enabled 1
metaslab_debug_load 0
metaslab_debug_unload 0
metaslab_fragmentation_factor_enabled 1
metaslab_lba_weighting_enabled 1
metaslab_preload_enabled 1
metaslabs_per_vdev 200
send_holes_without_birth_time 1
spa_asize_inflation 24
spa_config_path /etc/zfs/zpool.cache
spa_load_verify_data 1
spa_load_verify_maxinflight 10000
spa_load_verify_metadata 1
spa_slop_shift 5
zfetch_array_rd_sz 1048576
zfetch_max_distance 8388608
zfetch_max_streams 8
zfetch_min_sec_reap 2
zfs_abd_scatter_enabled 1
zfs_abd_scatter_max_order 10
zfs_admin_snapshot 1
zfs_arc_average_blocksize 8192
zfs_arc_dnode_limit 0
zfs_arc_dnode_limit_percent 10
zfs_arc_dnode_reduce_percent 10
zfs_arc_grow_retry 0
zfs_arc_lotsfree_percent 10
zfs_arc_max 107374182400
zfs_arc_meta_adjust_restarts 4096
zfs_arc_meta_limit 0
zfs_arc_meta_limit_percent 75
zfs_arc_meta_min 10737418240
zfs_arc_meta_prune 10000
zfs_arc_meta_strategy 1
zfs_arc_min 96636764160
zfs_arc_min_prefetch_lifespan 0
zfs_arc_p_dampener_disable 1
zfs_arc_p_min_shift 0
zfs_arc_pc_percent 0
zfs_arc_shrink_shift 0
zfs_arc_sys_free 0
zfs_autoimport_disable 1
zfs_checksums_per_second 20
zfs_compressed_arc_enabled 1
zfs_dbgmsg_enable 0
zfs_dbgmsg_maxsize 4194304
zfs_dbuf_state_index 0
zfs_deadman_checktime_ms 5000
zfs_deadman_enabled 1
zfs_deadman_synctime_ms 1000000
zfs_dedup_prefetch 0
zfs_delay_min_dirty_percent 60
zfs_delay_scale 500000
zfs_delays_per_second 20
zfs_delete_blocks 20480
zfs_dirty_data_max 4294967296
zfs_dirty_data_max_max 4294967296
zfs_dirty_data_max_max_percent 25
zfs_dirty_data_max_percent 10
zfs_dirty_data_sync 67108864
zfs_dmu_offset_next_sync 0
zfs_expire_snapshot 300
zfs_flags 0
zfs_free_bpobj_enabled 1
zfs_free_leak_on_eio 0
zfs_free_max_blocks 100000
zfs_free_min_time_ms 1000
zfs_immediate_write_sz 32768
zfs_max_recordsize 1048576
zfs_mdcomp_disable 0
zfs_metaslab_fragmentation_threshold 70
zfs_metaslab_segment_weight_enabled 1
zfs_metaslab_switch_threshold 2
zfs_mg_fragmentation_threshold 85
zfs_mg_noalloc_threshold 0
zfs_multihost_fail_intervals 5
zfs_multihost_history 0
zfs_multihost_import_intervals 10
zfs_multihost_interval 1000
zfs_multilist_num_sublists 0
zfs_no_scrub_io 0
zfs_no_scrub_prefetch 0
zfs_nocacheflush 0
zfs_nopwrite_enabled 1
zfs_object_mutex_size 64
zfs_pd_bytes_max 52428800
zfs_per_txg_dirty_frees_percent 30
zfs_prefetch_disable 0
zfs_read_chunk_size 1048576
zfs_read_history 0
zfs_read_history_hits 0
zfs_recover 0
zfs_recv_queue_length 16777216
zfs_resilver_delay 2
zfs_resilver_min_time_ms 3000
zfs_scan_idle 50
zfs_scan_ignore_errors 0
zfs_scan_min_time_ms 1000
zfs_scrub_delay 4
zfs_send_corrupt_data 0
zfs_send_queue_length 16777216
zfs_sync_pass_deferred_free 2
zfs_sync_pass_dont_compress 5
zfs_sync_pass_rewrite 2
zfs_sync_taskq_batch_pct 75
zfs_top_maxinflight 32
zfs_txg_history 0
zfs_txg_timeout 5
zfs_vdev_aggregation_limit 131072
zfs_vdev_async_read_max_active 3
zfs_vdev_async_read_min_active 1
zfs_vdev_async_write_active_max_dirty_percent 60
zfs_vdev_async_write_active_min_dirty_percent 30
zfs_vdev_async_write_max_active 10
zfs_vdev_async_write_min_active 2
zfs_vdev_cache_bshift 16
zfs_vdev_cache_max 16384
zfs_vdev_cache_size 0
zfs_vdev_max_active 1000
zfs_vdev_mirror_non_rotating_inc 0
zfs_vdev_mirror_non_rotating_seek_inc 1
zfs_vdev_mirror_rotating_inc 0
zfs_vdev_mirror_rotating_seek_inc 5
zfs_vdev_mirror_rotating_seek_offset 1048576
zfs_vdev_queue_depth_pct 1000
zfs_vdev_raidz_impl [fastest] original scalar sse2 ssse3
zfs_vdev_read_gap_limit 32768
zfs_vdev_scheduler noop
zfs_vdev_scrub_max_active 2
zfs_vdev_scrub_min_active 1
zfs_vdev_sync_read_max_active 10
zfs_vdev_sync_read_min_active 10
zfs_vdev_sync_write_max_active 10
zfs_vdev_sync_write_min_active 10
zfs_vdev_write_gap_limit 4096
zfs_zevent_cols 80
zfs_zevent_console 0
zfs_zevent_len_max 640
zil_replay_disable 0
zil_slog_bulk 786432
zio_delay_max 30000
zio_dva_throttle_enabled 1
zio_requeue_io_start_cut_in_line 1
zio_taskq_batch_pct 75
zvol_inhibit_dev 0
zvol_major 230
zvol_max_discard_blocks 16384
zvol_prefetch_bytes 131072
zvol_request_sync 0
zvol_threads 32
zvol_volmode 1
13 1 0x01 96 26112 34656842033 804777653833568
name type data
hits 4 3205852475
misses 4 140567980
demand_data_hits 4 736894151
demand_data_misses 4 11042858
demand_metadata_hits 4 2451071653
demand_metadata_misses 4 100976111
prefetch_data_hits 4 4743754
prefetch_data_misses 4 23608835
prefetch_metadata_hits 4 13142917
prefetch_metadata_misses 4 4940176
mru_hits 4 300295430
mru_ghost_hits 4 11254932
mfu_hits 4 2899165514
mfu_ghost_hits 4 21080
deleted 4 29873669
mutex_miss 4 8754
access_skip 4 1977
evict_skip 4 64612791
evict_not_enough 4 1469200
evict_l2_cached 4 216456410112
evict_l2_eligible 4 2796371940864
evict_l2_ineligible 4 4620242411520
evict_l2_skip 4 0
hash_elements 4 4024597
hash_elements_max 4 4673090
hash_collisions 4 8526637
hash_chains 4 407278
hash_chain_max 4 6
p 4 1304205685
c 4 96710664648
c_min 4 96636764160
c_max 4 107374182400
size 4 74415300672
compressed_size 4 72316176896
uncompressed_size 4 90511407616
overhead_size 4 351967744
hdr_size 4 1353753080
data_size 4 65247920640
metadata_size 4 7420224000
dbuf_size 4 84170424
dnode_size 4 228276368
bonus_size 4 80956160
anon_size 4 4094464
anon_evictable_data 4 0
anon_evictable_metadata 4 0
mru_size 4 1579768832
mru_evictable_data 4 1520373248
mru_evictable_metadata 4 10283008
mru_ghost_size 4 94362664448
mru_ghost_evictable_data 4 52083600384
mru_ghost_evictable_metadata 4 42279064064
mfu_size 4 71084281344
mfu_evictable_data 4 63615495168
mfu_evictable_metadata 4 7061716480
mfu_ghost_size 4 1458978816
mfu_ghost_evictable_data 4 1458978816
mfu_ghost_evictable_metadata 4 0
l2_hits 4 117318
l2_misses 4 70174985
l2_feeds 4 401206
l2_rw_clash 4 0
l2_read_bytes 4 10449594368
l2_write_bytes 4 201448042496
l2_writes_sent 4 134405
l2_writes_done 4 134405
l2_writes_error 4 0
l2_writes_lock_retry 4 2
l2_evict_lock_retry 4 1
l2_evict_reading 4 0
l2_evict_l1cached 4 235879
l2_free_on_write 4 34
l2_abort_lowmem 4 38
l2_cksum_bad 4 0
l2_io_error 4 0
l2_size 4 0
l2_asize 4 0
l2_hdr_size 4 0
memory_throttle_count 4 0
memory_direct_count 4 3432416
memory_indirect_count 4 1689117
memory_all_bytes 4 135170146304
memory_free_bytes 4 49069887488
memory_available_bytes 3 46957854720
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 0
arc_meta_used 4 9167380032
arc_meta_limit 4 80530636800
arc_dnode_limit 4 10737418240
arc_meta_max 4 17917926576
arc_meta_min 4 10737418240
sync_wait_for_async 4 215029
demand_hit_predictive_prefetch 4 7582640
arc_need_free 4 0
arc_sys_free 4 2112033536
Reopening by request
I've noticed that when zfs_arc_min
equals zfs_arc_max
, the MRU doesn't trend toward zero like it did when I had zfs_arc_min
set to a smaller value. Specifically, I had zfs_arc_min
set to 90 GiB and zfs_arc_max
set to 100 GiB. Since it consistently sat at 90 GiB, I just set zfs_arc_min
to match zfs_arc_max
. Whereas my MRU was previously only a few hundred MB in size, it's now much larger, over 10 GiB.
I am running 0.7.13 and have recently also noticed wildly fluctuating ARC target size:
I would have to agree this exists in 0.8.1 and 0.8.2.
i believe that #9548 explains (at least partly) this behavior. Furthermore, i have written some noninvasive tracing code that verifies that arc_adjust_cb's mishandling of arc_need_free as described in #9548 can indeed occur and will leave arc_need_free in an inappropriate state. i've also written a patch for arc_adjust_cb that adjusts arc_need_free in a way i believe is appropriate.
As far as i understand (and i would appreciate corrections), the relevant moving parts are:
The kernel calls arc_shrinker_func with a struct shrink_control *sc
, which can increment arcstat_need_free
by ptob(nr_to_scan)
-- how many bytes the kernel wants us to free.
arc_need_free
and arcstat_need_free
are two names for the same uint64_t
(from the #define
and ARCSTAT
macro in arc_impl.h:
#define arc_need_free ARCSTAT(arcstat_need_free) /* bytes to be freed */
in arc.c, arc_reap_cb calls arc_reduce_target_size with the MAX of (arc_c >> arc_shrink_shift) - arc_available_memory()
and arc_need_free
:
free_memory = arc_available_memory();
int64_t to_free =
(arc_c >> arc_shrink_shift) - free_memory;
if (to_free > 0) {
#ifdef _KERNEL
to_free = MAX(to_free, arc_need_free);
#endif
arc_reduce_target_size(to_free);
arc_reduce_target_size adjusts arc_c
by its argument. If arc_need_free
is big, it will dominate to_free
and thus be given to arc_adjust_cb -- and will therefore determine the adjusted arc_c
. Also the return value of arc_available_memory can be affected by the value of arc_need_free (i haven't looked into how exactly).
arc_adjust() will look at arc_c
and will actually do the necessary evictions to get the ARC size under that, and it return how many bytes were evicted.
arc_adjust() is called by arc_adjust_cb. This is where i believe the problem lives:
static void
arc_adjust_cb(void *arg, zthr_t *zthr)
{
uint64_t evicted = 0;
fstrans_cookie_t cookie = spl_fstrans_mark();
/* Evict from cache */
evicted = arc_adjust();
/*
* If evicted is zero, we couldn't evict anything
* via arc_adjust(). This could be due to hash lock
* collisions, but more likely due to the majority of
* arc buffers being unevictable. Therefore, even if
* arc_size is above arc_c, another pass is unlikely to
* be helpful and could potentially cause us to enter an
* infinite loop. Additionally, zthr_iscancelled() is
* checked here so that if the arc is shutting down, the
* broadcast will wake any remaining arc adjust waiters.
*/
mutex_enter(&arc_adjust_lock);
arc_adjust_needed = !zthr_iscancelled(arc_adjust_zthr) &&
evicted > 0 && aggsum_compare(&arc_size, arc_c) > 0;
if (!arc_adjust_needed) {
/*
* We're either no longer overflowing, or we
* can't evict anything more, so we should wake
* arc_get_data_impl() sooner.
*/
cv_broadcast(&arc_adjust_waiters_cv);
arc_need_free = 0;
}
Who can modify arc_need_free (aka arcstat_need_free)? arc_shrinker_func can only increase it, and the only way it can be decreased is via the "arc_need_free = 0" line in arc_adjust_cb. i believe that if arc_adjust() evicts N bytes of memory, that arc_need_free should be decremented by N (or set to 0 if arc_adjust evicted more than arc_need_free bytes). If arc_need_free is left unchanged at a nonzero value even through arc_adjust evicted nonzero bytes, we will be evicting more than necessary (double counting) the next time arc_adjust_cb runs.
arc_need_free can only decrease if it is zeroized -- which only happens if arc_adjust_needed is zero. However, it is possible for arc_adjust to evict nonzero bytes without arc_adjust_needed being changed! This is even the case if arc_adjust successfully evicted more bytes than arc_need_free!
To verify that this was happening, i patched arc_adjust_cb as follows (note that this does not affect any ZFS state -- this is noninvasive):
arc_adjust_cb(void *arg, zthr_t *zthr)
{
uint64_t evicted = 0;
int debt_adjusted = 0;
int debt_cleared = 0;
uint64_t old_arc_need_free = 0;
fstrans_cookie_t cookie = spl_fstrans_mark();
/* Evict from cache */
evicted = arc_adjust();
zfs_dbgmsg("XXX: STG 1: PREV arcstat_need_free is %llu", arc_need_free);
zfs_dbgmsg("XXX: STG 1: arc_adjust evicted %llu", evicted);
old_arc_need_free = arc_need_free;
/*
* If evicted is zero, we couldn't evict anything
* via arc_adjust(). This could be due to hash lock
* collisions, but more likely due to the majority of
* arc buffers being unevictable. Therefore, even if
* arc_size is above arc_c, another pass is unlikely to
* be helpful and could potentially cause us to enter an
* infinite loop. Additionally, zthr_iscancelled() is
* checked here so that if the arc is shutting down, the
* broadcast will wake any remaining arc adjust waiters.
*/
mutex_enter(&arc_adjust_lock);
if (arc_need_free > evicted) {
/* arc_need_free (aka kernel shrinker) threshold is extant, but not satisfied yet */
debt_adjusted = 1;
zfs_dbgmsg("XXX: STG 2: I WOULD ADJUST DEBT HERE");
} else if (evicted > arc_need_free) { /* We satisfied the arc_need_free debt so we clear arc_need_free */
debt_cleared = 1;
zfs_dbgmsg("XXX: STG 3: I WOULD ANNUL THE DEBT HERE, ENTIRELY");
}
arc_adjust_needed = !zthr_iscancelled(arc_adjust_zthr) &&
evicted > 0 && aggsum_compare(&arc_size, arc_c) > 0;
if (!arc_adjust_needed) {
/*
* We're either no longer overflowing, or we
* can't evict anything more, so we should wake
* arc_get_data_impl() sooner.
*/
cv_broadcast(&arc_adjust_waiters_cv);
zfs_dbgmsg("XXX: STG 4: NEW arcstat_need_free is ZERO since arc_adjust_needed == 0");
arc_need_free = 0;
} else {
if ((debt_adjusted == 1) && (old_arc_need_free != 0)) {
zfs_dbgmsg("XXXXXXXXXXXX OLD CODE WOULD HAVE NOT CREDITED THE EVICTION AMOUNT XXXXXXXXXXXX");
}
if ((debt_cleared == 1) && (old_arc_need_free != 0)) {
zfs_dbgmsg("XXXXXXXXXXXX OLD CODE WOULD HAVE NOT ELIMINATED THE DEBT ALTOGETHER XXXXXXXXXXXX");
}
}
mutex_exit(&arc_adjust_lock);
spl_fstrans_unmark(cookie);
}
The cases i wanted to detect were "arc_need_free is nonzero at the start of arc_adjust_cb, arc_adjust() evicts more than arc_need_free bytes, and the existing ZFS code does not zero arc_need_free" and "arc_need_free is nonzero at the start of arc_adjust_cb, arc_adjust() evicts nonzero and less than arc_need_free bytes, and the existing ZFS code does not decrement arc_need_free"
After recompiling and rebooting with the patched ZFS module; i ran a workload that both fills the ARC (find . -print0|xargs -0 b2sum
in a directory with hundreds of gigabytes of anime) and also provides memory pressure (stress --verbose -m 10 --vm-bytes 4000M -t 10
on a 32GB system) and looked at /proc/spl/kstat/zfs/dbgmsg:
1579644131 arc.c:4613:arc_adjust_cb(): XXX: STG 1: PREV arcstat_need_free is 8388608
1579644131 arc.c:4614:arc_adjust_cb(): XXX: STG 1: arc_adjust evicted 120297984
1579644131 arc.c:4637:arc_adjust_cb(): XXX: STG 3: I WOULD ANNUL THE DEBT HERE, ENTIRELY
1579644131 arc.c:4658:arc_adjust_cb(): XXXXXXXXXXXX OLD CODE WOULD HAVE NOT ELIMINATED THE DEBT ALTOGETHER XXXXXXXXXXXX
1579644134 arc.c:4613:arc_adjust_cb(): XXX: STG 1: PREV arcstat_need_free is 47185920
1579644134 arc.c:4614:arc_adjust_cb(): XXX: STG 1: arc_adjust evicted 12438662144
1579644134 arc.c:4637:arc_adjust_cb(): XXX: STG 3: I WOULD ANNUL THE DEBT HERE, ENTIRELY
1579644134 arc.c:4658:arc_adjust_cb(): XXXXXXXXXXXX OLD CODE WOULD HAVE NOT ELIMINATED THE DEBT ALTOGETHER XXXXXXXXXXXX
1579644134 arc.c:4613:arc_adjust_cb(): XXX: STG 1: PREV arcstat_need_free is 47185920
1579644134 arc.c:4614:arc_adjust_cb(): XXX: STG 1: arc_adjust evicted 36614144
1579644134 arc.c:4634:arc_adjust_cb(): XXX: STG 2: I WOULD ADJUST DEBT HERE
1579644134 arc.c:4649:arc_adjust_cb(): XXX: STG 4: NEW arcstat_need_free is ZERO since arc_adjust_needed == 0
1579644134 arc.c:4613:arc_adjust_cb(): XXX: STG 1: PREV arcstat_need_free is 0
1579644134 arc.c:4614:arc_adjust_cb(): XXX: STG 1: arc_adjust evicted 41811968
1579644134 arc.c:4637:arc_adjust_cb(): XXX: STG 3: I WOULD ANNUL THE DEBT HERE, ENTIRELY
1579644134 arc.c:4649:arc_adjust_cb(): XXX: STG 4: NEW arcstat_need_free is ZERO since arc_adjust_needed == 0
1579644134 arc.c:4613:arc_adjust_cb(): XXX: STG 1: PREV arcstat_need_free is 0
1579644134 arc.c:4614:arc_adjust_cb(): XXX: STG 1: arc_adjust evicted 55570432
1579644134 arc.c:4637:arc_adjust_cb(): XXX: STG 3: I WOULD ANNUL THE DEBT HERE, ENTIRELY
1579644134 arc.c:4613:arc_adjust_cb(): XXX: STG 1: PREV arcstat_need_free is 0
1579644134 arc.c:4614:arc_adjust_cb(): XXX: STG 1: arc_adjust evicted 524288
1579644134 arc.c:4637:arc_adjust_cb(): XXX: STG 3: I WOULD ANNUL THE DEBT HERE, ENTIRELY
1579644134 arc.c:4649:arc_adjust_cb(): XXX: STG 4: NEW arcstat_need_free is ZERO since arc_adjust_needed == 0
1579644134 arc.c:4613:arc_adjust_cb(): XXX: STG 1: PREV arcstat_need_free is 0
1579644134 arc.c:4614:arc_adjust_cb(): XXX: STG 1: arc_adjust evicted 44281856
1579644134 arc.c:4637:arc_adjust_cb(): XXX: STG 3: I WOULD ANNUL THE DEBT HERE, ENTIRELY
In the first four lines, we see that arc_adjust evicted 120297984 bytes -- far more than arcstat_need_free (8388608 bytes) and yet arcstat_need_free is not zeroed -- it's left to be incremented further (by a second kernel shrinker call)! arc_adjust_cb runs again, evicts 12438662144 bytes yet arcstat_need_free is left at 47185920, and it's only on the third call to arc_adjust_cb that arcstat_need_free is zeroed.
i believe that this does proper accounting for arcstat_need_free with regard to the bytes that arc_adjust evicts:
if (arc_need_free > evicted) {
/* arc_need_free (aka kernel shrinker) threshold is extant, but not satisfied yet */
ARCSTAT_INCR(arcstat_need_free, -evicted); /* but we evicted SOMETHING, so we decrease our debt */
debt_adjusted = 1;
zfs_dbgmsg("XXX: STG 2: NEW arcstat_need_free is %llu since arc_need_free > evicted", arc_need_free);
} else if (evicted > arc_need_free) { /* We satisfied the arc_need_free debt so we clear arc_need_free */
arc_need_free = 0;
debt_adjusted = 1;
zfs_dbgmsg("XXX: STG 3: NEW arcstat_need_free is ZERO since evicted > arc_need_free");
}
and this shows that this code is indeed hit and does adjust arcstat_need_free in cases where the original code does not touch it:
if (!arc_adjust_needed) {
/*
* We're either no longer overflowing, or we
* can't evict anything more, so we should wake
* arc_get_data_impl() sooner.
*/
cv_broadcast(&arc_adjust_waiters_cv);
zfs_dbgmsg("XXX: STG 4: NEW arcstat_need_free is ZERO since arc_adjust_needed == 0");
arc_need_free = 0;
} else {
if (debt_adjusted == 1) {
zfs_dbgmsg("XXXXXXXXXXXX OLD CODE WOULD HAVE NOT TOUCHED ARC_NEED_FREE XXXXXXXXXXXX");
}
}
System information
Describe the problem you're observing
When looking at some stat graphs of e.g. the ARC, I noticed that the target size for the ARC appears to be varying incredibly wildly and rapidly.
https://i.imgur.com/1VHy1Hv.png is a graph of min/max/target/current over a 10m interval. The bottom orange that's fluctuating wildly is the target size value (it says red on the legend to the right, but the translucent red over the green makes it look orange).
In addition, looking at this graph ( https://i.imgur.com/GAVFb9j.png ) of MRU/MFU ratio makes me think either something's evicting all of the MRU at regular intervals or somehow ~100% of the MRU entries are getting on the MFU list, and the timing of the ARC target size drops makes me think it's unlikely to be the latter.
It's esp. confusing because, since the system is idle, there's not really anything external that should be exerting memory pressure to trigger drops like that.
Describe how to reproduce the problem
Unknown; the system is presently fairly idle, taking hourly snapshots of the datasets but otherwise not doing much of anything.
Include any warning/errors/backtraces from the system logs
Nothing interesting in dmesg.