Open luismcv opened 2 years ago
I am also seeing this.
I'd struggled with the issue from #9966 as well for months, even after setting these options
options zfs zfs_arc_min=17179869184
options zfs zfs_arc_max=17179869184
options zfs zfs_arc_meta_limit_percent=100
options zfs zfs_arc_dnode_limit_percent=75
After upgrading the server (AMD chips, 96-cores, 128GB) from 20.04 to 22.04 (and thus ZFS to 2.1.4) and reverting all these options to their defaults, the issue of having 96 arc_prune
processes running flat out is gone - I now have one arc_prune
process running flat out. Don't get me wrong, that's a huge improvement - I can at least log in and get diagnostics. But it's not there yet.
Heres cat /proc/spl/kstat/zfs/arcstats | egrep '(dnode|arc_meta)'
- very similar to the OP. In our case we're running a large regression test which creates tens of thousands of files over a few minutes. It doesn't happen every time despite almost identical workload.
dnode_size 4 6383114272
arc_meta_used 4 21562042592
arc_meta_limit 4 17179869184
arc_dnode_limit 4 12884901888
arc_meta_max 4 22070295584
arc_meta_min 4 16777216
I've now tried reapplying the zfs_arc_min
and zfs_arc_max
options above and will keep an eye on it - next time I'll post arc_summary
output before dropping caches, but if there's any other diagnostic info that would help please let me know.
Back again - and I can confirm that forcing zfs_arc_min
and zfs_arc_max
to 16G has no impact. Here's the output of arc_summary
:
------------------------------------------------------------------------
ZFS Subsystem Report Fri Oct 21 10:41:30 2022
Linux 5.15.0-50-generic 2.1.4-0ubuntu0.1
Machine: nova (x86_64) 2.1.4-0ubuntu0.1
ARC status: HEALTHY
Memory throttle count: 0
ARC size (current): 125.2 % 20.0 GiB
Target size (adaptive): 100.0 % 16.0 GiB
Min size (hard limit): 100.0 % 16.0 GiB
Max size (high water): 1:1 16.0 GiB
Most Frequently Used (MFU) cache size: 77.4 % 6.4 GiB
Most Recently Used (MRU) cache size: 22.6 % 1.9 GiB
Metadata cache size (hard limit): 100.0 % 16.0 GiB
Metadata cache size (current): 124.2 % 19.9 GiB
Dnode cache size (hard limit): 75.0 % 12.0 GiB
Dnode cache size (current): 48.9 % 5.9 GiB
ARC hash breakdown:
Elements max: 2.7M
Elements current: 15.1 % 413.8k
Collisions: 3.1M
Chain max: 5
Chains: 5.0k
ARC misc:
Deleted: 39.4M
Mutex misses: 43.8M
Eviction skips: 10.6G
Eviction skips due to L2 writes: 0
L2 cached evictions: 0 Bytes
L2 eligible evictions: 1.1 TiB
L2 eligible MFU evictions: 42.6 % 474.6 GiB
L2 eligible MRU evictions: 57.4 % 638.9 GiB
L2 ineligible evictions: 406.3 GiB
ARC total accesses (hits + misses): 4.5G
Cache hit ratio: 99.2 % 4.5G
Cache miss ratio: 0.8 % 34.6M
Actual hit ratio (MFU + MRU hits): 99.2 % 4.5G
Data demand efficiency: 95.0 % 352.0M
Data prefetch efficiency: 0.8 % 4.8M
Cache hits by cache type:
Most frequently used (MFU): 94.0 % 4.2G
Most recently used (MRU): 5.9 % 265.2M
Most frequently used (MFU) ghost: < 0.1 % 1.9M
Most recently used (MRU) ghost: 0.1 % 4.5M
Cache hits by data type:
Demand data: 7.5 % 334.6M
Demand prefetch data: < 0.1 % 37.9k
Demand metadata: 92.4 % 4.1G
Demand prefetch metadata: 0.1 % 4.2M
Cache misses by data type:
Demand data: 50.4 % 17.4M
Demand prefetch data: 13.8 % 4.8M
Demand metadata: 24.1 % 8.4M
Demand prefetch metadata: 11.7 % 4.1M
DMU prefetch efficiency: 75.5M
Hit ratio: 8.3 % 6.3M
Miss ratio: 91.7 % 69.2M
L2ARC not detected, skipping section
Solaris Porting Layer (SPL):
spl_hostid 0
spl_hostid_path /etc/hostid
spl_kmem_alloc_max 1048576
spl_kmem_alloc_warn 65536
spl_kmem_cache_kmem_threads 4
spl_kmem_cache_magazine_size 0
spl_kmem_cache_max_size 32
spl_kmem_cache_obj_per_slab 8
spl_kmem_cache_reclaim 0
spl_kmem_cache_slab_limit 16384
spl_max_show_tasks 512
spl_panic_halt 0
spl_schedule_hrtimeout_slack_us 0
spl_taskq_kick 0
spl_taskq_thread_bind 0
spl_taskq_thread_dynamic 1
spl_taskq_thread_priority 1
spl_taskq_thread_sequential 4
Tunables:
dbuf_cache_hiwater_pct 10
dbuf_cache_lowater_pct 10
dbuf_cache_max_bytes 18446744073709551615
dbuf_cache_shift 5
dbuf_metadata_cache_max_bytes 18446744073709551615
dbuf_metadata_cache_shift 6
dmu_object_alloc_chunk_shift 7
dmu_prefetch_max 134217728
ignore_hole_birth 1
l2arc_feed_again 1
l2arc_feed_min_ms 200
l2arc_feed_secs 1
l2arc_headroom 2
l2arc_headroom_boost 200
l2arc_meta_percent 33
l2arc_mfuonly 0
l2arc_noprefetch 1
l2arc_norw 0
l2arc_rebuild_blocks_min_l2size 1073741824
l2arc_rebuild_enabled 1
l2arc_trim_ahead 0
l2arc_write_boost 8388608
l2arc_write_max 8388608
metaslab_aliquot 524288
metaslab_bias_enabled 1
metaslab_debug_load 0
metaslab_debug_unload 0
metaslab_df_max_search 16777216
metaslab_df_use_largest_segment 0
metaslab_force_ganging 16777217
metaslab_fragmentation_factor_enabled 1
metaslab_lba_weighting_enabled 1
metaslab_preload_enabled 1
metaslab_unload_delay 32
metaslab_unload_delay_ms 600000
send_holes_without_birth_time 1
spa_asize_inflation 24
spa_config_path /etc/zfs/zpool.cache
spa_load_print_vdev_tree 0
spa_load_verify_data 1
spa_load_verify_metadata 1
spa_load_verify_shift 4
spa_slop_shift 5
vdev_file_logical_ashift 9
vdev_file_physical_ashift 9
vdev_removal_max_span 32768
vdev_validate_skip 0
zap_iterate_prefetch 1
zfetch_array_rd_sz 1048576
zfetch_max_distance 8388608
zfetch_max_idistance 67108864
zfetch_max_streams 8
zfetch_min_sec_reap 2
zfs_abd_scatter_enabled 1
zfs_abd_scatter_max_order 10
zfs_abd_scatter_min_size 1536
zfs_admin_snapshot 0
zfs_allow_redacted_dataset_mount 0
zfs_arc_average_blocksize 8192
zfs_arc_dnode_limit 0
zfs_arc_dnode_limit_percent 75
zfs_arc_dnode_reduce_percent 10
zfs_arc_evict_batch_limit 10
zfs_arc_eviction_pct 200
zfs_arc_grow_retry 0
zfs_arc_lotsfree_percent 10
zfs_arc_max 17179869184
zfs_arc_meta_adjust_restarts 4096
zfs_arc_meta_limit 0
zfs_arc_meta_limit_percent 100
zfs_arc_meta_min 0
zfs_arc_meta_prune 10000
zfs_arc_meta_strategy 1
zfs_arc_min 17179869184
zfs_arc_min_prefetch_ms 0
zfs_arc_min_prescient_prefetch_ms 0
zfs_arc_p_dampener_disable 1
zfs_arc_p_min_shift 0
zfs_arc_pc_percent 0
zfs_arc_prune_task_threads 1
zfs_arc_shrink_shift 0
zfs_arc_shrinker_limit 10000
zfs_arc_sys_free 0
zfs_async_block_max_blocks 18446744073709551615
zfs_autoimport_disable 1
zfs_checksum_events_per_second 20
zfs_commit_timeout_pct 5
zfs_compressed_arc_enabled 1
zfs_condense_indirect_commit_entry_delay_ms 0
zfs_condense_indirect_obsolete_pct 25
zfs_condense_indirect_vdevs_enable 1
zfs_condense_max_obsolete_bytes 1073741824
zfs_condense_min_mapping_bytes 131072
zfs_dbgmsg_enable 1
zfs_dbgmsg_maxsize 4194304
zfs_dbuf_state_index 0
zfs_ddt_data_is_special 1
zfs_deadman_checktime_ms 60000
zfs_deadman_enabled 1
zfs_deadman_failmode wait
zfs_deadman_synctime_ms 600000
zfs_deadman_ziotime_ms 300000
zfs_dedup_prefetch 0
zfs_delay_min_dirty_percent 60
zfs_delay_scale 500000
zfs_delete_blocks 20480
zfs_dirty_data_max 4294967296
zfs_dirty_data_max_max 4294967296
zfs_dirty_data_max_max_percent 25
zfs_dirty_data_max_percent 10
zfs_dirty_data_sync_percent 20
zfs_disable_ivset_guid_check 0
zfs_dmu_offset_next_sync 1
zfs_embedded_slog_min_ms 64
zfs_expire_snapshot 300
zfs_fallocate_reserve_percent 110
zfs_flags 0
zfs_free_bpobj_enabled 1
zfs_free_leak_on_eio 0
zfs_free_min_time_ms 1000
zfs_history_output_max 1048576
zfs_immediate_write_sz 32768
zfs_initialize_chunk_size 1048576
zfs_initialize_value 16045690984833335022
zfs_keep_log_spacemaps_at_export 0
zfs_key_max_salt_uses 400000000
zfs_livelist_condense_new_alloc 0
zfs_livelist_condense_sync_cancel 0
zfs_livelist_condense_sync_pause 0
zfs_livelist_condense_zthr_cancel 0
zfs_livelist_condense_zthr_pause 0
zfs_livelist_max_entries 500000
zfs_livelist_min_percent_shared 75
zfs_lua_max_instrlimit 100000000
zfs_lua_max_memlimit 104857600
zfs_max_async_dedup_frees 100000
zfs_max_log_walking 5
zfs_max_logsm_summary_length 10
zfs_max_missing_tvds 0
zfs_max_nvlist_src_size 0
zfs_max_recordsize 1048576
zfs_metaslab_find_max_tries 100
zfs_metaslab_fragmentation_threshold 70
zfs_metaslab_max_size_cache_sec 3600
zfs_metaslab_mem_limit 25
zfs_metaslab_segment_weight_enabled 1
zfs_metaslab_switch_threshold 2
zfs_metaslab_try_hard_before_gang 0
zfs_mg_fragmentation_threshold 95
zfs_mg_noalloc_threshold 0
zfs_min_metaslabs_to_flush 1
zfs_multihost_fail_intervals 10
zfs_multihost_history 0
zfs_multihost_import_intervals 20
zfs_multihost_interval 1000
zfs_multilist_num_sublists 0
zfs_no_scrub_io 0
zfs_no_scrub_prefetch 0
zfs_nocacheflush 0
zfs_nopwrite_enabled 1
zfs_object_mutex_size 64
zfs_obsolete_min_time_ms 500
zfs_override_estimate_recordsize 0
zfs_pd_bytes_max 52428800
zfs_per_txg_dirty_frees_percent 5
zfs_prefetch_disable 0
zfs_read_history 0
zfs_read_history_hits 0
zfs_rebuild_max_segment 1048576
zfs_rebuild_scrub_enabled 1
zfs_rebuild_vdev_limit 33554432
zfs_reconstruct_indirect_combinations_max 4096
zfs_recover 0
zfs_recv_queue_ff 20
zfs_recv_queue_length 16777216
zfs_recv_write_batch_size 1048576
zfs_removal_ignore_errors 0
zfs_removal_suspend_progress 0
zfs_remove_max_segment 16777216
zfs_resilver_disable_defer 0
zfs_resilver_min_time_ms 3000
zfs_scan_checkpoint_intval 7200
zfs_scan_fill_weight 3
zfs_scan_ignore_errors 0
zfs_scan_issue_strategy 0
zfs_scan_legacy 0
zfs_scan_max_ext_gap 2097152
zfs_scan_mem_lim_fact 20
zfs_scan_mem_lim_soft_fact 20
zfs_scan_strict_mem_lim 0
zfs_scan_suspend_progress 0
zfs_scan_vdev_limit 4194304
zfs_scrub_min_time_ms 1000
zfs_send_corrupt_data 0
zfs_send_no_prefetch_queue_ff 20
zfs_send_no_prefetch_queue_length 1048576
zfs_send_queue_ff 20
zfs_send_queue_length 16777216
zfs_send_unmodified_spill_blocks 1
zfs_slow_io_events_per_second 20
zfs_spa_discard_memory_limit 16777216
zfs_special_class_metadata_reserve_pct 25
zfs_sync_pass_deferred_free 2
zfs_sync_pass_dont_compress 8
zfs_sync_pass_rewrite 2
zfs_sync_taskq_batch_pct 75
zfs_traverse_indirect_prefetch_limit 32
zfs_trim_extent_bytes_max 134217728
zfs_trim_extent_bytes_min 32768
zfs_trim_metaslab_skip 0
zfs_trim_queue_limit 10
zfs_trim_txg_batch 32
zfs_txg_history 100
zfs_txg_timeout 5
zfs_unflushed_log_block_max 262144
zfs_unflushed_log_block_min 1000
zfs_unflushed_log_block_pct 400
zfs_unflushed_max_mem_amt 1073741824
zfs_unflushed_max_mem_ppm 1000
zfs_unlink_suspend_progress 0
zfs_user_indirect_is_special 1
zfs_vdev_aggregate_trim 0
zfs_vdev_aggregation_limit 1048576
zfs_vdev_aggregation_limit_non_rotating 131072
zfs_vdev_async_read_max_active 3
zfs_vdev_async_read_min_active 1
zfs_vdev_async_write_active_max_dirty_percent 60
zfs_vdev_async_write_active_min_dirty_percent 30
zfs_vdev_async_write_max_active 10
zfs_vdev_async_write_min_active 2
zfs_vdev_cache_bshift 16
zfs_vdev_cache_max 16384
zfs_vdev_cache_size 0
zfs_vdev_default_ms_count 200
zfs_vdev_default_ms_shift 29
zfs_vdev_initializing_max_active 1
zfs_vdev_initializing_min_active 1
zfs_vdev_max_active 1000
zfs_vdev_max_auto_ashift 16
zfs_vdev_min_auto_ashift 9
zfs_vdev_min_ms_count 16
zfs_vdev_mirror_non_rotating_inc 0
zfs_vdev_mirror_non_rotating_seek_inc 1
zfs_vdev_mirror_rotating_inc 0
zfs_vdev_mirror_rotating_seek_inc 5
zfs_vdev_mirror_rotating_seek_offset 1048576
zfs_vdev_ms_count_limit 131072
zfs_vdev_nia_credit 5
zfs_vdev_nia_delay 5
zfs_vdev_queue_depth_pct 1000
zfs_vdev_raidz_impl cycle [fastest] original scalar sse2 ssse3 avx2
zfs_vdev_read_gap_limit 32768
zfs_vdev_rebuild_max_active 3
zfs_vdev_rebuild_min_active 1
zfs_vdev_removal_max_active 2
zfs_vdev_removal_min_active 1
zfs_vdev_scheduler unused
zfs_vdev_scrub_max_active 3
zfs_vdev_scrub_min_active 1
zfs_vdev_sync_read_max_active 10
zfs_vdev_sync_read_min_active 10
zfs_vdev_sync_write_max_active 10
zfs_vdev_sync_write_min_active 10
zfs_vdev_trim_max_active 2
zfs_vdev_trim_min_active 1
zfs_vdev_write_gap_limit 4096
zfs_vnops_read_chunk_size 1048576
zfs_zevent_len_max 512
zfs_zevent_retain_expire_secs 900
zfs_zevent_retain_max 2000
zfs_zil_clean_taskq_maxalloc 1048576
zfs_zil_clean_taskq_minalloc 1024
zfs_zil_clean_taskq_nthr_pct 100
zil_maxblocksize 131072
zil_nocacheflush 0
zil_replay_disable 0
zil_slog_bulk 786432
zio_deadman_log_all 0
zio_dva_throttle_enabled 1
zio_requeue_io_start_cut_in_line 1
zio_slow_io_ms 30000
zio_taskq_batch_pct 80
zio_taskq_batch_tpq 0
zvol_inhibit_dev 0
zvol_major 230
zvol_max_discard_blocks 16384
zvol_prefetch_bytes 131072
zvol_request_sync 0
zvol_threads 32
zvol_volmode 1
VDEV cache disabled, skipping section
ZIL committed transactions: 5.4M
Commit requests: 415.4k
Flushes to stable storage: 415.4k
Transactions to SLOG storage pool: 0 Bytes 0
Transactions to non-SLOG storage pool: 19.6 GiB 551.6k
A few days later, happens again. I'm attaching the final 4 hours or so of /proc/spl/kstat/zfs/dbgmsg
- this log was fairly idle until the overnight jobs kicked in, then the attached 500 lines are all from the last four hours.
When happening again, can you show the output of the following commanda?
grep "arc_\|memory\|dnode" /proc/spl/kstat/zfs/arcstats
arcstat 1
free -m
With pleasure! I've caught it in the act for once so first here's the output of those commands while the machine is genuinely busy:
# grep "arc_\|memory\|dnode" /proc/spl/kstat/zfs/arcstats
dnode_size 4 5282924768
memory_throttle_count 4 0
memory_direct_count 4 8
memory_indirect_count 4 4
memory_all_bytes 4 135014055936
memory_free_bytes 4 7666941952
memory_available_bytes 3 3095025664
arc_no_grow 4 0
arc_tempreserve 4 9214
arc_loaned_bytes 4 0
arc_prune 4 213098808
arc_meta_used 4 18098941232
arc_meta_limit 4 17179869184
arc_dnode_limit 4 12884901888
arc_meta_max 4 24104624640
arc_meta_min 4 16777216
arc_need_free 4 2048
arc_sys_free 4 4571916288
arc_raw_size 4 0
# arcstat 1
time read miss miss% dmis dm% pmis pm% mmis mm% size c avail
18:52:08 10K 26 0 26 0 0 0 23 0 17G 16G 2.3G
18:52:09 1.3M 3.1K 0 3.1K 0 0 0 1.6K 0 17G 16G 2.5G
18:52:10 1.6M 2.8K 0 2.8K 0 0 0 1.4K 0 17G 16G 2.5G
18:52:11 740K 4.0K 0 4.0K 0 0 0 2.0K 0 17G 16G 2.5G
18:52:12 184K 4.3K 2 4.3K 2 0 0 2.1K 4 17G 16G 2.5G
18:52:13 188K 5.0K 2 5.0K 2 0 0 2.6K 5 17G 16G 2.4G
18:52:14 217K 6.9K 3 6.9K 3 0 0 3.2K 6 17G 16G 2.4G
18:52:15 181K 5.6K 3 5.6K 3 0 0 2.3K 4 17G 16G 2.4G
18:52:16 163K 4.1K 2 4.1K 2 7 100 2.1K 5 17G 16G 2.4G
18:52:17 1.3M 5.4K 0 5.4K 0 0 0 2.9K 0 17G 16G 2.4G
18:52:18 1.4M 4.8K 0 4.8K 0 0 0 2.4K 0 17G 16G 2.3G
18:52:19 1.5M 6.8K 0 6.8K 0 0 0 3.6K 0 17G 16G 2.3G
18:52:20 1.8M 2.4K 0 2.4K 0 0 0 1.3K 0 17G 16G 2.3G
18:52:21 1.6M 5.4K 0 5.4K 0 0 0 2.9K 0 17G 16G 2.2G
18:52:22 1.4M 6.0K 0 6.0K 0 0 0 3.1K 0 17G 16G 2.2G
18:52:23 1.5M 5.5K 0 5.5K 0 2 100 2.9K 0 17G 16G 2.1G
18:52:24 246K 1.3K 0 1.1K 0 136 97 711 0 17G 16G 2.1G
18:52:25 15K 619 4 556 3 63 95 357 6 17G 16G 2.1G
18:52:26 14K 696 4 621 4 75 93 397 7 17G 16G 2.2G
18:52:27 13K 476 3 436 3 40 95 281 6 17G 16G 2.2G
18:52:28 25K 1.1K 4 815 3 278 98 704 9 17G 16G 2.1G
18:52:29 181K 523 0 497 0 26 92 297 0 17G 16G 2.1G
18:52:30 619K 1.3K 0 1.2K 0 108 96 723 0 17G 16G 2.1G
18:52:31 23K 698 2 641 2 57 98 383 6 17G 16G 2.1G
18:52:32 28K 826 2 752 2 74 97 470 5 17G 16G 2.1G
18:52:33 17K 545 3 517 2 28 96 268 4 17G 16G 2.1G
18:52:34 53K 1.2K 2 1.1K 2 115 99 653 5 17G 16G 2.1G
... etc ...
18:53:28 17K 1.4K 7 1.4K 7 0 0 350 3 18G 16G 1.4G
... etc ...
18:54:58 1.2M 140 0 101 0 39 97 91 0 18G 16G 1.5G
... etc....
18:55:23 3.4M 34 0 29 0 5 83 21 0 18G 16G 1.4G
... etc ...
18:55:45 2.4M 856 0 854 0 2 100 21 0 18G 16G 1.3G
^C
# free -m
total used free shared buff/cache available
Mem: 128759 121865 5403 27 1491 5532
Swap: 0 0 0
Then I killed the application process which was using all the memory and hammering the disk - normally I don't spot it until the machine has been like this for a while, so this is a new step. arc_prune
and arc_evict
don't immediately rocket up to 100% - do hit that level after a few seconds, but then fall back to just 1-2%. Now I see
grep "arc_\|memory\|dnode" /proc/spl/kstat/zfs/arcstats
dnode_size 4 5625136000
memory_throttle_count 4 0
memory_direct_count 4 8
memory_indirect_count 4 4
memory_all_bytes 4 135014055936
memory_free_bytes 4 88923025408
memory_available_bytes 3 84351109120
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 213477638
arc_meta_used 4 19282122336
arc_meta_limit 4 17179869184
arc_dnode_limit 4 12884901888
arc_meta_max 4 24104624640
arc_meta_min 4 16777216
arc_need_free 4 0
arc_sys_free 4 4571916288
arc_raw_size 4 0
# arcstat 1
time read miss miss% dmis dm% pmis pm% mmis mm% size c avail
19:02:49 0 0 0 0 0 0 0 0 0 18G 16G 78G
19:02:50 23 0 0 0 0 0 0 0 0 18G 16G 78G
19:02:51 152 6 3 6 3 0 0 6 4 18G 16G 78G
19:02:52 28 0 0 0 0 0 0 0 0 18G 16G 78G
19:02:53 20 0 0 0 0 0 0 0 0 18G 16G 78G
^C (no obvious variation in this.)
# free -m
total used free shared buff/cache available
Mem: 128759 42694 84603 27 1461 84717
Swap: 0 0 0
The machine has no load running, several minutes pass. arc_prune
and arc_evict
are intermittently busy - they go up to 100% in top, then back to 1%. It looks like jobs which have very little impact normally are causing this, eg just opening a new SSH connection or running apt update
. When I run that in another terminal, here's what I see while it's running:
# grep "arc_\|memory\|dnode" /proc/spl/kstat/zfs/arcstats
dnode_size 4 5548859136
memory_throttle_count 4 0
memory_direct_count 4 8
memory_indirect_count 4 4
memory_all_bytes 4 135014055936
memory_free_bytes 4 89338064896
memory_available_bytes 3 84766148608
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 213694035
arc_meta_used 4 18778333536
arc_meta_limit 4 17179869184
arc_dnode_limit 4 12884901888
arc_meta_max 4 24104624640
arc_meta_min 4 16777216
arc_need_free 4 0
arc_sys_free 4 4571916288
arc_raw_size 4 0
# arcstat 1
time read miss miss% dmis dm% pmis pm% mmis mm% size c avail
19:12:17 8 0 0 0 0 0 0 0 0 18G 16G 78G
19:12:18 24 1 4 1 4 0 0 1 4 18G 16G 78G
19:12:19 155 0 0 0 0 0 0 0 0 18G 16G 78G
19:12:20 23 0 0 0 0 0 0 0 0 18G 16G 78G
19:12:21 13 0 0 0 0 0 0 0 0 18G 16G 78G
19:12:22 0 0 0 0 0 0 0 0 0 18G 16G 78G
19:12:23 7.2K 21 0 20 0 1 100 7 0 18G 16G 78G
19:12:24 9.5K 297 3 159 1 138 100 1 0 17G 16G 78G
19:12:25 12K 2 0 2 0 0 0 0 0 17G 16G 78G
19:12:26 40 2 5 2 5 0 0 2 5 17G 16G 78G
19:12:27 8 0 0 0 0 0 0 0 0 17G 16G 78G
19:12:28 6 0 0 0 0 0 0 0 0 17G 16G 78G
19:12:29 195 2 1 2 1 0 0 2 1 18G 16G 78G
^C
# free -m
total used free shared buff/cache available
Mem: 128759 42337 84914 27 1506 85051
Swap: 0 0 0
I don't know what I'm looking at with all this, but it feels like because I killed the job early the ZFS issue is now... simmering. The slightest disk activity causes the problem to surface.
I need the machine back so drop caches, and we have:
# grep "arc_\|memory\|dnode" /proc/spl/kstat/zfs/arcstats
dnode_size 4 1878924384
memory_throttle_count 4 0
memory_direct_count 4 8
memory_indirect_count 4 4
memory_all_bytes 4 135014055936
memory_free_bytes 4 99754872832
memory_available_bytes 3 95182956544
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 213744942
arc_meta_used 4 7351729280
arc_meta_limit 4 17179869184
arc_dnode_limit 4 12884901888
arc_meta_max 4 24104624640
arc_meta_min 4 16777216
arc_need_free 4 0
arc_sys_free 4 4571916288
arc_raw_size 4 0
# arcstat 1
time read miss miss% dmis dm% pmis pm% mmis mm% size c avail
19:15:32 0 0 0 0 0 0 0 0 0 6.9G 16G 88G
19:15:33 0 0 0 0 0 0 0 0 0 6.9G 16G 88G
19:15:34 161 0 0 0 0 0 0 0 0 6.9G 16G 88G
^C
# free -m
total used free shared buff/cache available
Mem: 128759 33112 95045 27 601 94729
Swap: 0 0 0
and arc_prune
and arc_evict
are no longer to be seen in top.
I'll generate the same output the next time I fail to catch the job in time...
Thanks for your reporting. The key point is that, until you dropped caches, your metadata were over limit (arc_meta_used > arc_meta_limit
) and so arc_prune
was furiously trying to free some metadata for eviction but without success. I don't know why arc_prune
was unable to free the required metadata, but the next time it happens try to issue echo 100 > /sys/module/zfs/parameters/zfs_arc_meta_limit_percent
For the next report, can you grep the following: grep "arc_\|memory\|overflow\|evict\|dnode" arcstats
?
Back again - this time I did not catch it in time. More diagnostics:
# grep "arc_\|memory\|overflow\|evict\|dnode" /proc/spl/kstat/zfs/arcstats
evict_skip 4 42489272488
evict_not_enough 4 12413470777
evict_l2_cached 4 0
evict_l2_eligible 4 5648382219264
evict_l2_eligible_mfu 4 2713057107968
evict_l2_eligible_mru 4 2935325111296
evict_l2_ineligible 4 2005483851776
evict_l2_skip 4 0
dnode_size 4 5139755360
anon_evictable_data 4 0
anon_evictable_metadata 4 0
mru_evictable_data 4 0
mru_evictable_metadata 4 4096
mru_ghost_evictable_data 4 0
mru_ghost_evictable_metadata 4 0
mfu_evictable_data 4 0
mfu_evictable_metadata 4 0
mfu_ghost_evictable_data 4 0
mfu_ghost_evictable_metadata 4 0
l2_evict_lock_retry 4 0
l2_evict_reading 4 0
l2_evict_l1cached 4 0
memory_throttle_count 4 0
memory_direct_count 4 15
memory_indirect_count 4 58
memory_all_bytes 4 135014055936
memory_free_bytes 4 90148765696
memory_available_bytes 3 85576849408
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 381556707
arc_meta_used 4 17288867392
arc_meta_limit 4 17179869184
arc_dnode_limit 4 12884901888
arc_meta_max 4 24104624640
arc_meta_min 4 16777216
arc_need_free 4 36544
arc_sys_free 4 4571916288
arc_raw_size 4 0
# echo 100 > /sys/module/zfs/parameters/zfs_arc_meta_limit_percent
# grep "arc_\|memory\|overflow\|evict\|dnode" /proc/spl/kstat/zfs/arcstats
evict_skip 4 42504492992
evict_not_enough 4 12424903898
evict_l2_cached 4 0
evict_l2_eligible 4 5648413739008
evict_l2_eligible_mfu 4 2713057716224
evict_l2_eligible_mru 4 2935356022784
evict_l2_ineligible 4 2005635026944
evict_l2_skip 4 0
dnode_size 4 5148812320
anon_evictable_data 4 0
anon_evictable_metadata 4 0
mru_evictable_data 4 0
mru_evictable_metadata 4 8192
mru_ghost_evictable_data 4 0
mru_ghost_evictable_metadata 4 0
mfu_evictable_data 4 0
mfu_evictable_metadata 4 0
mfu_ghost_evictable_data 4 0
mfu_ghost_evictable_metadata 4 0
l2_evict_lock_retry 4 0
l2_evict_reading 4 0
l2_evict_l1cached 4 0
memory_throttle_count 4 0
memory_direct_count 4 15
memory_indirect_count 4 58
memory_all_bytes 4 135014055936
memory_free_bytes 4 90145370112
memory_available_bytes 3 85573453824
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 381992117
arc_meta_used 4 17349396480
arc_meta_limit 4 17179869184
arc_dnode_limit 4 12884901888
arc_meta_max 4 24104624640
arc_meta_min 4 16777216
arc_need_free 4 73728
arc_sys_free 4 4571916288
arc_raw_size 4 0
# arcstat 1
time read miss miss% dmis dm% pmis pm% mmis mm% size c avail
08:28:57 60 9 15 9 15 0 0 8 15 16G 16G 79G
08:28:58 4.9K 704 14 613 12 91 94 701 14 16G 16G 79G
08:28:59 1.4K 37 2 10 0 27 62 34 2 16G 16G 79G
08:29:00 1.5K 138 8 59 4 79 84 137 8 16G 16G 79G
^C
Setting zfs_arc_meta_limit_percent to 100 made no difference, arc_evict
and arc_prune
are still pegged at 100% for a few minutes afterwards. So I dropped caches
# echo 3 > /proc/sys/vm/drop_caches
# grep "arc_\|memory\|overflow\|evict\|dnode" /proc/spl/kstat/zfs/arcstats
evict_skip 4 42545416959
evict_not_enough 4 12443886776
evict_l2_cached 4 0
evict_l2_eligible 4 5649859763200
evict_l2_eligible_mfu 4 2713251620864
evict_l2_eligible_mru 4 2936608142336
evict_l2_ineligible 4 2006184136704
evict_l2_skip 4 0
dnode_size 4 1140962688
anon_evictable_data 4 0
anon_evictable_metadata 4 0
mru_evictable_data 4 281744384
mru_evictable_metadata 4 424766464
mru_ghost_evictable_data 4 0
mru_ghost_evictable_metadata 4 0
mfu_evictable_data 4 209156096
mfu_evictable_metadata 4 829720064
mfu_ghost_evictable_data 4 0
mfu_ghost_evictable_metadata 4 0
l2_evict_lock_retry 4 0
l2_evict_reading 4 0
l2_evict_l1cached 4 0
memory_throttle_count 4 0
memory_direct_count 4 16
memory_indirect_count 4 58
memory_all_bytes 4 135014055936
memory_free_bytes 4 103376101376
memory_available_bytes 3 98804185088
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 382636215
arc_meta_used 4 6049864784
arc_meta_limit 4 17179869184
arc_dnode_limit 4 12884901888
arc_meta_max 4 24104624640
arc_meta_min 4 16777216
arc_need_free 4 0
arc_sys_free 4 4571916288
arc_raw_size 4 0
This happens to me regularly on a daily basis, usually after suspending the system to RAM. Changing zfs_arc_meta_limit_percent
to 100
doesn't make a difference, dropping all caches with echo 3 > /proc/sys/vm/drop_caches
helps. If it's needed to capture more statistics I can post it next time it happens. I also observed arc_meta_used >> arc_meta_limit
.
@faceless2 arc_meta_limit
as shown in your post is identical both before and after setting zfs_arc_meta_limit_percent
. Did you already set it to 100 before? Have you changed zfs_arc_meta_limit
(the value in bytes)?
@coot When happening, please report back the following data:
grep "arc_\|memory\|overflow\|evict\|dnode" /proc/spl/kstat/zfs/arcstats
arcstat 1
free -m
Sorry - zfs_arc_meta_limit_percent
was already at 100, I can see that from my first report on Oct 21st. I originally had the following in /etc/modprobe.d/zfs.conf
options zfs zfs_arc_min=17179869184
options zfs zfs_arc_max=17179869184
options zfs zfs_arc_meta_limit_percent=100
options zfs zfs_arc_dnode_limit_percent=75
I removed these after the kernel upgrade but it looks like I didn't reboot. So those are the options I've been running with. Sorry I didn't notice
The initial level for zfs_arc_max
(50% of RAM, or 64GB), was causing a large Java task we run with 96GB of heap to be killed by the kernel OOM killer. That's why I reduced zfs_arc_max
.
I have no particular reason for setting zfs_arc_min
to the same level, but based on your comment above
I don't know why arc_prune was unable to free the required metadata
I wonder if this could be a contributing factor? I've now set it to 8GB, half of the 16GB max value and will see if it makes a difference.
Surely having zfs_arc_min set so high (relatively to ARC) can be problematic, lets see if the issue reappears.
cat /sys/module/zfs/parameters/zfs_arc_meta_limit_percent
75
grep "arc_\|memory\|overflow\|evict\|dnode" /proc/spl/kstat/zfs/arcstats
evict_skip 4 206552551
evict_not_enough 4 22859306
evict_l2_cached 4 0
evict_l2_eligible 4 50234150912
evict_l2_eligible_mfu 4 19353164288
evict_l2_eligible_mru 4 30880986624
evict_l2_ineligible 4 9036353536
evict_l2_skip 4 0
dnode_size 4 3136246688
anon_evictable_data 4 0
anon_evictable_metadata 4 0
mru_evictable_data 4 12800
mru_evictable_metadata 4 134144
mru_ghost_evictable_data 4 0
mru_ghost_evictable_metadata 4 1024
mfu_evictable_data 4 512
mfu_evictable_metadata 4 512
mfu_ghost_evictable_data 4 0
mfu_ghost_evictable_metadata 4 33792
l2_evict_lock_retry 4 0
l2_evict_reading 4 0
l2_evict_l1cached 4 0
memory_throttle_count 4 0
memory_direct_count 4 0
memory_indirect_count 4 2010
memory_all_bytes 4 67366936576
memory_free_bytes 4 3758071808
memory_available_bytes 3 1403715840
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 48626010
arc_meta_used 4 33332407424
arc_meta_limit 4 25262601216
arc_dnode_limit 4 2526260121
arc_meta_max 4 35483426448
arc_meta_min 4 16777216
arc_need_free 4 0
arc_sys_free 4 2354355968
arc_raw_size 4 8180123648
arcstat 1
time read miss miss% dmis dm% pmis pm% mmis mm% size c avail
09:08:13 338 77 22 76 22 1 100 45 14 31G 31G 1.3G
09:08:14 30K 6.7K 21 6.7K 21 1 100 4.2K 14 31G 31G 1.3G
09:08:15 4.1K 862 20 861 20 1 100 565 14 31G 31G 1.3G
09:08:16 506 102 20 102 20 0 0 67 14 31G 31G 1.3G
09:08:17 527 107 20 107 20 0 0 70 14 31G 31G 1.3G
09:08:18 540 110 20 110 20 0 0 69 14 31G 31G 1.3G
09:08:19 2.4K 100 4 100 4 0 0 66 3 31G 31G 1.3G
09:08:20 707 109 15 109 15 0 0 71 10 31G 31G 1.3G
09:08:21 650 128 19 128 19 0 0 79 13 31G 31G 1.3G
09:08:22 807 128 15 128 15 0 0 78 10 31G 31G 1.3G
09:08:23 592 113 19 113 19 0 0 70 13 31G 31G 1.3G
09:08:24 566 115 20 115 20 0 0 72 14 31G 31G 1.3G
09:08:25 790 115 14 115 14 0 0 70 9 31G 31G 1.3G
09:08:26 585 112 19 112 19 0 0 68 12 31G 31G 1.3G
09:08:27 593 110 18 110 18 0 0 68 12 31G 31G 1.3G
09:08:28 747 149 19 148 19 1 100 74 11 31G 31G 1.3G
09:08:29 745 147 19 147 19 0 0 74 11 31G 31G 1.3G
09:08:30 3.7K 263 7 263 7 0 0 146 4 31G 31G 1.2G
09:08:31 874 142 16 142 16 0 0 76 10 31G 31G 1.2G
09:08:32 1.9K 158 8 120 6 38 100 108 6 31G 31G 1.2G
09:08:33 573 147 25 93 17 54 100 110 20 31G 31G 1.2G
09:08:34 563 130 23 72 14 58 100 102 19 31G 31G 1.2G
09:08:35 795 128 16 72 9 56 100 100 13 31G 31G 1.2G
09:08:36 4.8K 1.3K 26 179 4 1.1K 98 1.2K 25 31G 31G 1.2G
09:08:37 17K 1.5K 8 766 4 754 94 1.2K 7 31G 31G 1.2G
09:08:38 65K 14K 21 9.8K 16 4.3K 86 10K 17 31G 31G 1.3G
09:08:39 72K 16K 22 16K 22 4 100 9.9K 14 31G 31G 1.2G
09:08:40 1.3K 262 20 262 20 0 0 156 13 31G 31G 1.2G
09:08:41 800 159 19 159 19 0 0 95 13 31G 31G 1.2G
09:08:42 5.8K 1.2K 20 1.2K 20 0 0 728 13 31G 31G 1.2G
09:08:43 618 126 20 126 20 0 0 77 13 31G 31G 1.2G
09:08:44 936 131 13 131 13 0 0 80 9 31G 31G 1.2G
09:08:45 581 115 19 115 19 0 0 70 13 31G 31G 1.2G
09:08:46 571 111 19 111 19 0 0 67 12 31G 31G 1.2G
09:08:47 761 111 14 111 14 0 0 67 9 31G 31G 1.2G
09:08:48 696 117 16 117 16 0 0 71 11 31G 31G 1.2G
09:08:49 643 117 18 117 18 0 0 71 12 31G 31G 1.2G
09:08:50 2.4K 112 4 112 4 0 0 68 3 31G 31G 1.2G
09:08:51 788 125 15 125 15 0 0 75 10 31G 31G 1.2G
09:08:52 1.9K 147 7 147 7 0 0 95 5 31G 31G 1.2G
09:08:53 914 175 19 175 19 0 0 105 12 31G 31G 1.2G
09:08:54 4.4K 161 3 160 3 1 100 80 2 31G 31G 1.2G
09:08:55 643 116 18 116 18 0 0 71 12 31G 31G 1.2G
09:08:56 649 124 19 123 18 1 100 72 12 31G 31G 1.2G
09:08:57 849 125 14 125 14 0 0 77 10 31G 31G 1.2G
09:08:58 750 131 17 131 17 0 0 79 11 31G 31G 1.2G
09:08:59 640 126 19 126 19 0 0 77 13 31G 31G 1.2G
09:09:00 760 133 17 133 17 0 0 79 11 31G 31G 1.2G
09:09:01 2.4K 132 5 132 5 0 0 79 3 31G 31G 1.2G
09:09:02 760 124 16 124 16 0 0 76 10 31G 31G 1.2G
09:09:03 648 129 19 129 19 0 0 79 13 31G 31G 1.2G
09:09:04 699 126 18 126 18 0 0 75 11 31G 31G 1.2G
09:09:05 752 127 16 127 16 0 0 77 11 31G 31G 1.2G
09:09:06 654 128 19 128 19 0 0 77 13 31G 31G 1.2G
09:09:07 709 130 18 130 18 0 0 79 12 31G 31G 1.2G
09:09:08 806 126 15 126 15 0 0 76 10 31G 31G 1.2G
09:09:09 854 150 17 150 17 0 0 94 11 31G 31G 1.2G
09:09:10 619 125 20 125 20 0 0 78 13 31G 31G 1.2G
09:09:11 623 126 20 126 20 0 0 77 13 31G 31G 1.2G
09:09:12 625 123 19 123 19 0 0 76 13 31G 31G 1.2G
09:09:13 929 120 12 120 12 0 0 75 8 31G 31G 1.2G
09:09:14 672 128 19 128 19 0 0 76 12 31G 31G 1.2G
09:09:15 636 127 19 127 19 0 0 77 13 31G 31G 1.2G
09:09:16 610 125 20 125 20 0 0 76 13 31G 31G 1.2G
09:09:17 634 127 20 127 20 0 0 77 13 31G 31G 1.2G
free -m
total used free shared buff/cache available
Mem: 64246 58410 2679 882 3156 4303
Swap: 8191 2739 5452
And after
echo 100 > /sys/module/zfs/parameters/zfs_arc_meta_limit_percent
grep "arc_\|memory\|overflow\|evict\|dnode" /proc/spl/kstat/zfs/arcstats
evict_skip 4 214132970
evict_not_enough 4 23935555
evict_l2_cached 4 0
evict_l2_eligible 4 50574349312
evict_l2_eligible_mfu 4 19446336512
evict_l2_eligible_mru 4 31128012800
evict_l2_ineligible 4 9232410624
evict_l2_skip 4 0
dnode_size 4 3170140352
anon_evictable_data 4 0
anon_evictable_metadata 4 0
mru_evictable_data 4 0
mru_evictable_metadata 4 32768
mru_ghost_evictable_data 4 1883648
mru_ghost_evictable_metadata 4 6279168
mfu_evictable_data 4 0
mfu_evictable_metadata 4 458752
mfu_ghost_evictable_data 4 472064
mfu_ghost_evictable_metadata 4 712704
l2_evict_lock_retry 4 0
l2_evict_reading 4 0
l2_evict_l1cached 4 0
memory_throttle_count 4 0
memory_direct_count 4 0
memory_indirect_count 4 2010
memory_all_bytes 4 67366936576
memory_free_bytes 4 3435409408
memory_available_bytes 3 1081053440
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 49696362
arc_meta_used 4 33678983392
arc_meta_limit 4 33683468288
arc_dnode_limit 4 3368346828
arc_meta_max 4 35483426448
arc_meta_min 4 16777216
arc_need_free 4 0
arc_sys_free 4 2354355968
arc_raw_size 4 8244556800
arcstat 1
time read miss miss% dmis dm% pmis pm% mmis mm% size c avail
09:10:39 386 85 22 85 22 0 0 51 14 31G 31G 1.0G
09:10:40 22K 4.8K 21 4.8K 21 1 100 2.8K 14 31G 31G 1003M
09:10:41 773 135 17 135 17 0 0 82 11 31G 31G 1001M
09:10:42 699 138 19 138 19 0 0 81 12 31G 31G 1000M
09:10:43 782 137 17 137 17 0 0 82 11 31G 31G 999M
free -m
total used free shared buff/cache available
Mem: 64246 58618 2469 882 3157 4095
Swap: 8191 2739 5452
I don't know what has changed recently, but since a week ago (or maybe two) this has been happening to me now even with Borg (I haven't used Kopia since I tried it when I opened this issue). And I'm not 100% sure about this, but I think that compiling a big Java/Maven project the other day triggered it too. And quite frequently, once every 2 or 3 days.
ZFS version is still 2.1.4-0ubuntu0.1. But the kernel is now Ubuntu's 5.15.0-53-generic.
$ grep "arc_\|memory\|dnode" /proc/spl/kstat/zfs/arcstats
dnode_size 4 761390752
memory_throttle_count 4 0
memory_direct_count 4 0
memory_indirect_count 4 0
memory_all_bytes 4 25150902272
memory_free_bytes 4 8646987776
memory_available_bytes 3 7708804480
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 22482562
arc_meta_used 4 4007114624
arc_meta_limit 4 1610612736
arc_dnode_limit 4 161061273
arc_meta_max 4 4112501104
arc_meta_min 4 16777216
arc_need_free 4 66560
arc_sys_free 4 938183296
arc_raw_size 4 871041536
$ arcstat 1
time read miss miss% dmis dm% pmis pm% mmis mm% size c avail
23:01:14 0 0 0 0 0 0 0 0 0 3.6G 2.0G 7.4G
23:01:15 895 110 12 110 12 0 0 0 0 3.6G 2.0G 7.4G
23:01:16 64 3 4 3 4 0 0 3 4 3.6G 2.0G 7.4G
23:01:17 195 5 2 5 2 0 0 5 2 3.6G 2.0G 7.4G
23:01:18 281 5 1 5 1 0 0 4 2 3.6G 2.0G 7.4G
23:01:19 17 1 5 1 5 0 0 1 8 3.6G 2.0G 7.4G
23:01:20 339 6 1 6 1 0 0 0 0 3.6G 2.0G 7.4G
23:01:21 22 3 13 3 13 0 0 2 28 3.6G 2.0G 7.4G
23:01:22 653 7 1 7 1 0 0 5 1 3.6G 2.0G 7.4G
23:01:23 159 2 1 2 1 0 0 2 1 3.6G 2.0G 7.4G
23:01:24 341 3 0 3 0 0 0 2 0 3.6G 2.0G 7.4G
23:01:25 294 9 3 9 3 0 0 2 1 3.6G 2.0G 7.4G
23:01:26 320 1 0 1 0 0 0 1 0 3.6G 2.0G 7.4G
23:01:27 603 13 2 13 2 0 0 13 2 3.6G 2.0G 7.4G
23:01:28 410 2 0 2 0 0 0 2 0 3.6G 2.0G 7.4G
23:01:29 12 0 0 0 0 0 0 0 0 3.6G 2.0G 7.4G
23:01:30 282 2 0 2 0 0 0 1 0 3.6G 2.0G 7.4G
23:01:31 15 0 0 0 0 0 0 0 0 3.6G 2.0G 7.4G
23:01:32 1.2K 34 2 34 2 0 0 1 0 3.6G 2.0G 7.4G
23:01:33 154 0 0 0 0 0 0 0 0 3.6G 2.0G 7.4G
23:01:34 19 0 0 0 0 0 0 0 0 3.6G 2.0G 7.4G
23:01:35 57 3 5 3 5 0 0 0 0 3.6G 2.0G 7.4G
23:01:36 23 1 4 1 4 0 0 1 4 3.6G 2.0G 7.4G
23:01:37 270 0 0 0 0 0 0 0 0 3.6G 2.0G 7.4G
23:01:38 227 4 1 4 1 0 0 3 1 3.6G 2.0G 7.4G
23:01:39 221 28 12 28 12 0 0 23 17 3.6G 2.0G 7.4G
23:01:40 984 13 1 11 1 2 100 13 1 3.6G 2.0G 7.4G
23:01:41 13 0 0 0 0 0 0 0 0 3.6G 2.0G 7.4G
23:01:42 271 0 0 0 0 0 0 0 0 3.6G 2.0G 7.4G
23:01:43 201 4 1 4 1 0 0 4 2 3.6G 2.0G 7.4G
23:01:44 271 1 0 1 0 0 0 0 0 3.6G 2.0G 7.4G
23:01:45 14 1 7 1 7 0 0 1 9 3.6G 2.0G 7.4G
23:01:46 694 57 8 57 8 0 0 50 9 3.6G 2.0G 7.4G
23:01:47 5.0K 1 0 1 0 0 0 0 0 3.6G 2.0G 7.4G
23:01:48 435 0 0 0 0 0 0 0 0 3.6G 2.0G 7.4G
23:01:49 630 21 3 21 3 0 0 0 0 3.6G 2.0G 7.4G
23:01:50 260 0 0 0 0 0 0 0 0 3.6G 2.0G 7.4G
23:01:51 100 5 5 5 5 0 0 0 0 3.6G 2.0G 7.4G
23:01:52 510 0 0 0 0 0 0 0 0 3.6G 2.0G 7.4G
23:01:53 156 1 0 1 0 0 0 0 0 3.6G 2.0G 7.4G
23:01:54 68 1 1 1 1 0 0 1 2 3.6G 2.0G 7.4G
23:01:55 320 5 1 5 1 0 0 2 1 3.6G 2.0G 7.4G
23:01:56 79 6 7 6 7 0 0 4 6 3.6G 2.0G 7.4G
23:01:57 18 1 5 1 5 0 0 1 5 3.6G 2.0G 7.4G
23:01:58 841 10 1 10 1 0 0 10 1 3.6G 2.0G 7.4G
23:01:59 220 2 0 2 0 0 0 2 0 3.6G 2.0G 7.4G
23:02:00 8 2 25 2 25 0 0 1 14 3.6G 2.0G 7.4G
23:02:01 51 3 5 3 5 0 0 3 6 3.6G 2.0G 7.4G
23:02:02 271 0 0 0 0 0 0 0 0 3.6G 2.0G 7.4G
23:02:03 9 0 0 0 0 0 0 0 0 3.6G 2.0G 7.4G
23:02:04 578 10 1 10 1 0 0 0 0 3.6G 2.0G 7.4G
23:02:05 9 0 0 0 0 0 0 0 0 3.6G 2.0G 7.4G
23:02:06 244 3 1 3 1 0 0 3 1 3.6G 2.0G 7.4G
23:02:07 34 3 8 3 8 0 0 3 11 3.6G 2.0G 7.4G
23:02:08 305 1 0 1 0 0 0 1 0 3.6G 2.0G 7.4G
23:02:09 189 5 2 5 2 0 0 4 2 3.6G 2.0G 7.4G
23:02:10 690 5 0 4 0 1 100 4 0 3.6G 2.0G 7.4G
23:02:11 29 2 6 2 6 0 0 0 0 3.6G 2.0G 7.4G
23:02:12 263 0 0 0 0 0 0 0 0 3.6G 2.0G 7.4G
23:02:13 24 1 4 1 4 0 0 1 5 3.6G 2.0G 7.4G
23:02:14 443 0 0 0 0 0 0 0 0 3.6G 2.0G 7.4G
23:02:15 43 0 0 0 0 0 0 0 0 3.6G 2.0G 7.4G
$ free -m
total used free shared buff/cache available
Mem: 23985 14190 7370 371 2424 9035
Swap: 12285 0 12285
$ ps aux | grep arc_
root 860 19.3 0.0 0 0 ? S 12:58 119:17 [arc_prune]
root 861 14.1 0.0 0 0 ? S 12:58 86:49 [arc_evict]
root 862 0.0 0.0 0 0 ? SN 12:58 0:00 [arc_reap]
root 866 0.0 0.0 0 0 ? S 12:58 0:02 [l2arc_feed]
The arc meta cache keeps going bonkers, this time using more than twice the limit and even after several hours arc_prune and arc_evict haven't managed to bring it down. As usual, dropping the caches solves the problem temporarily.
Yep, in both cases arc_meta_used
is above arc_meta_limit
and even if arc_prune
is furiously called, no memory is freed. Are you both walking/reading/writing a lot of files when the issue happens? Or did some application suddenly increases its memory requirements?
@behlendorf any idea on why arc_prune
is unsuccessful in releasing allocated memory, while dropping caches immediately works?
It usually happens for when I start working. Sometimes it's just after I login, sometimes after some time when doing something, might be when compiling a large project, but I am not that sure if this is correlated.
In my case, it used to happen only (but always) when running a Kopia backup job. I decided to keep using Borg and kind of forgot about the issue.
But since a couple of weeks ago or so it's been happening while & after running Borg Backup. And I'm not sure, but I think that a big Java/Maven project compilation caused it the other day too.
At least in my case, things seem to indicate that it's high disk activity what triggers the problem.
If forgot to mention that with Borg it doesn't happen every time. Right now I have an uptime of almost a day and a half (with several cycles of RAM suspension). Borg has run 4 times during this time and everything is still fine:
arc_meta_used 4 1544783680
arc_meta_limit 4 1610612736
About 15 minutes ago the usage was 160....... So it's working fine, getting close to the limit and being reduced again.
And a couple of days ago it happened after just 5 o 6 hours of uptime and I think it was the first time that Borg had run since booting up.
So it doesn't seem to be related so much to the accumulated amount of disk reads but to a particularly high disk activity in a concrete point in time.
My impression is that everything works as expected as long as 'arc_meta_used' is kept below the limit. But if for any reason the usage goes above the limit (is the limit a hard or soft limit?) the cache enters in some weird state from which it doesn't recover until it's reset by a flush.
At least in my case, things seem to indicate that it's high disk activity what triggers the problem.
I think it is more an issue related to memory usage, both due to applications and vfs cache.
My impression is that everything works as expected as long as 'arc_meta_used' is kept below the limit
Yes, but when memory requirements become higher, arc_meta_used
must be reduced, and this seems the key issue: sometime arc_prune
seems not capable of freeing pinned buffers, spinning furiously (with high CPU load) without freeing anything.
Looks like reducing zfs_arc_min
to a value lower than zfs_arc_max
made no difference:
# grep "arc_\|memory\|overflow\|evict\|dnode" /proc/spl/kstat/zfs/arcstats
evict_skip 4 81026320001
evict_not_enough 4 20937787774
evict_l2_cached 4 0
evict_l2_eligible 4 8629143595008
evict_l2_eligible_mfu 4 4180798813696
evict_l2_eligible_mru 4 4448344781312
evict_l2_ineligible 4 3739255898112
evict_l2_skip 4 0
dnode_size 4 5741232736
anon_evictable_data 4 0
anon_evictable_metadata 4 0
mru_evictable_data 4 0
mru_evictable_metadata 4 0
mru_ghost_evictable_data 4 0
mru_ghost_evictable_metadata 4 0
mfu_evictable_data 4 0
mfu_evictable_metadata 4 0
mfu_ghost_evictable_data 4 0
mfu_ghost_evictable_metadata 4 0
l2_evict_lock_retry 4 0
l2_evict_reading 4 0
l2_evict_l1cached 4 0
memory_throttle_count 4 0
memory_direct_count 4 16
memory_indirect_count 4 59
memory_all_bytes 4 135014055936
memory_free_bytes 4 87351349248
memory_available_bytes 3 82779432960
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 633177776
arc_meta_used 4 19216927424
arc_meta_limit 4 17179869184
arc_dnode_limit 4 12884901888
arc_meta_max 4 24104624640
arc_meta_min 4 16777216
arc_need_free 4 0
arc_sys_free 4 4571916288
arc_raw_size 4 0
# cat /sys/module/zfs/parameters/zfs_arc_meta_limit_percent
100
# cat /sys/module/zfs/parameters/zfs_arc_max
17179869184
# cat /sys/module/zfs/parameters/zfs_arc_min
8589934592
# arcstat 1
time read miss miss% dmis dm% pmis pm% mmis mm% size c avail
16:45:06 16 0 0 0 0 0 0 0 0 18G 16G 77G
16:45:07 16 0 0 0 0 0 0 0 0 18G 16G 77G
16:45:08 3 0 0 0 0 0 0 0 0 18G 16G 77G
16:45:09 0 0 0 0 0 0 0 0 0 18G 16G 77G
^C
edit: I seem to have caught it in the "simmering" stage rather than finding it after it's boiled over - arc_prune
is not pegged at 100%, it's going up and down but never dropping to zero even on an essentially idle system. So these stats reflect that.
zfs version 2.1.5, kernel version 5.10.0-60.18.0.50.oe2203.x86_64
When using rsync to back up a large amount of data, the cache of arc cannot be released,and arc_prune at 100%,lasts for 50 minutes
[root@nas ~]# ps -aux | grep D USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 971 0.0 0.0 0 0 ? D 16:45 0:00 [arc_prune] root 992 0.0 0.0 0 0 ? DN 16:45 0:00 [dbuf_evict] root 1048 0.0 0.0 55248 4216 ? Ssl 16:45 0:00 /usr/sbin/gssproxy -D root 2879 0.0 0.0 13936 7504 ? Ss 16:46 0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 1000-1000 startups root 4422 0.4 0.0 0 0 ? D< 16:46 0:03 [z_rd_int_0] root 4423 0.4 0.0 0 0 ? D< 16:46 0:03 [z_rd_int_1] root 4754 2.4 2.9 5778352 467576 ? Sl 16:46 0:23 /usr/local/jdk/java-jdk/bin/java -Dnop -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Xms512m -Xmx1024m -XX:PermSize=128m -XX:MaxPermSize=256m -Djdk.tls.ephemeralDHKeySize=2048 -Djava.protocol.handler.pkgs=org.apache.catalina.webresources -Dfile.encoding=UTF-8 -Djava.endorsed.dirs= -classpath /usr/local/nas/apache-tomcat/bin/bootstrap.jar:/usr/local/nas/apache-tomcat/bin/tomcat-juli.jar -Dcatalina.base=/usr/local/nas/apache-tomcat -Dcatalina.home=/usr/local/nas/apache-tomcat -Djava.io.tmpdir=/usr/local/nas/apache-tomcat/temp org.apache.catalina.startup.Bootstrap start root 4861 0.0 0.0 0 0 ? D 16:46 0:00 [txg_sync] root 6150 0.0 0.0 0 0 ? D 16:46 0:00 [nfsd] root 6151 0.0 0.0 0 0 ? D 16:46 0:00 [nfsd] root 6152 0.1 0.0 0 0 ? D 16:46 0:01 [nfsd] root 6155 0.0 0.0 0 0 ? D 16:46 0:00 [nfsd] root 6158 0.0 0.0 0 0 ? D 16:46 0:00 [nfsd] root 6159 0.0 0.0 0 0 ? D 16:46 0:00 [nfsd] root 6160 0.0 0.0 0 0 ? D 16:46 0:00 [nfsd] root 6161 0.0 0.0 0 0 ? D 16:46 0:00 [nfsd] root 38275 0.3 0.0 0 0 ? D< 16:52 0:02 [z_rd_int_1] root 38702 0.0 0.0 0 0 ? D< 16:52 0:00 [z_rd_int_0] root 38703 0.0 0.0 0 0 ? D< 16:52 0:00 [z_rd_int_0] root 45449 0.0 0.0 15488 6212 ? D 16:55 0:00 zfs list -r -o name,mountpoint,quota,used,avail root 47045 0.0 0.0 12560 5532 ? D 16:56 0:00 zpool list -H pool root 54944 0.0 0.0 12560 5592 ? D 16:59 0:00 zpool list -H pool root 61823 0.0 0.0 12528 5640 ? D 17:01 0:00 zfs list -Ho name,mounted root 64103 0.0 0.0 6420 2300 pts/4 S+ 17:02 0:00 grep --color=auto D
[root@nas ~]# arcstat 1 time read miss miss% dmis dm% pmis pm% mmis mm% size c avail 17:02:18 0 0 0 0 0 0 0 0 0 8.1G 7.7G 3.7G 17:02:19 3 0 0 0 0 0 0 0 0 8.1G 7.7G 3.7G 17:02:20 0 0 0 0 0 0 0 0 0 8.1G 7.7G 3.7G 17:02:21 0 0 0 0 0 0 0 0 0 8.1G 7.7G 3.7G 17:02:22 0 0 0 0 0 0 0 0 0 8.1G 7.7G 3.7G 17:02:23 0 0 0 0 0 0 0 0 0 8.1G 7.7G 3.7G 17:02:24 3 0 0 0 0 0 0 0 0 8.1G 7.7G 3.7G
[root@nas ~]# free -h total used free shared buff/cache available Mem: 15Gi 10Gi 4.1Gi 57Mi 507Mi 4.1Gi Swap: 0B 0B 0B
sometimes avail is negative
Commenting to note that a) I am still seeing this, but as I now have a cronjob that drops caches every hour I'm dealing with it, and b) I missed this question above
Are you both walking/reading/writing a lot of files when the issue happens? Or did some application suddenly increases its memory requirements?
Oh yes. The process that triggers this creates about half a million files in 10 minutes while reading about 2 million. I should also note that we used to do this on more resource limited machine with 32 cores without problems - the same job took 45 minutes, and it was fine. It's when we upgraded to a much faster machine with more cores and NVMe disks that this started to happen. A case of "things happening faster than we can deal with them"?
sometimes avail is negative
If possible, try updating to zfs 2.1.7 or 2.1.9 (.8 has a serious issue, so skip it) as a fix for this very same issue was merged.
I set up a spank new server with arc_max as 8GB and arc_min 8GB - 1 and I also run repeatedly in this situation even with 2.1.9 so there is still something wired happening because the arc size goes slightly above the set max value.
I set up a spank new server with arc_max as 8GB and arc_min 8GB - 1 and I also run repeatedly in this situation even with 2.1.9 so there is still something wired happening because the arc size goes slightly above the set max value.
If you look on zfs_arc_overflow_shift, ARC is allowed to go slightly above the limit for a short time. And please, if you set some tunables to some weird values, don't complain if you see ARC behaves "weird".
ARC eviction code was completely rewritten in upcoming ZFS 2.2.
ARC eviction code was completely rewritten in upcoming ZFS 2.2.
@amotin interesting, any information to share (or to point at) regarding this change?
@shodanshok https://github.com/openzfs/zfs/pull/14359
@shodanshok
Hi all,
Not sure if this is helping, but I had the same issue on older kernel 5.4.203 (proxmox) with zfs-2.0.7.
It consistently happened after I received crypted dataset, loaded keys and then mounted the datasets. This triggered in turn several z_upgrade processes that started heavy I/Os. Which in turn started the arc_prune/arc_evict running full time.
After reading this thread, I tried the drop cache trick and it worked (took some time), ie. z_upgrade was still working with heavy I/Os but arc_prune/arc_evict are gone.
Hope this helps.
This problem seems to still happen on zfs-2.1.13. I was waiting for 2.2.0 to be available in the RHEL dkms non testing repository on Rocky8 but I may have to try the testing version because I can't work when the storage is this slow.
For me this problem was related to misconfiguration of locate
(the updatedb
). After fixing it (uninstalling a duplicated dependency) things started to behave well.
Linux 6.5.9 with ZFS 2.2.0
Saw arc_prune and arc_evict showing up in my hud for a cpu usage tonight and remembered echoing 0
into /sys/module/zfs/parameters/zfs_arc_max
earlier today after changing the size a few times. I was having issues with audio 'popping' with latency for an audio task which required real-time behaviour. I noticed 6.8/32GB of memory used by my system with an uptime of about 12 hours which made me wonder if ARC was working at all (Given ARC usually keeps my used memory for any given day closer to 25GB+ used)
I echoed 16gb (16*1024*1024*1024
) into /sys/module/zfs/parameters/zfs_arc_max
and 0
again and the eviction process stopped and so did the stuttering I experienced.
System information
Describe the problem you're observing
I've been using ZFS for the root fs of my desktop, with ARC limited to 2GB, for several months without any issues until now. When I run disk intensive tasks, like borg-backup, duc, find, etc... I can see arc_prune and arc_evict working intermittently, with peaks of about 15% of a thread each, and they manage to keep the arc within limits:
And once the tasks are finished they both go to sleep. All normal so far.
But yesterday I decided to try another backup tool, Kopia, and I know what it does while doing a backup that makes ARC going out of control. Both arc_prune and arc_evict start using 100% of a CPU thread each. And despite that, they don't manage to keep the memory limits within range, only reaching some kind of balance at around 3.6GB.
But even after Kopia has finished or I have aborted it, the problem keeps going on indefinitely, even though there's no processes disk activity anymore (iostat and my system's disk led both show some continuous activity though, so it seems it's not just the CPU what they're using).
NOTES:
echo 3 > /proc/sys/vm/drop_caches
stops it, until I run a backup again.Describe how to reproduce the problem
Run a Kopia backup of a ZFS filesytem with many files and a low(ish) memory limit for ARC.