openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.63k stars 1.75k forks source link

arc_prune and arc_evict at 100% even with no disk activity #14005

Open luismcv opened 2 years ago

luismcv commented 2 years ago

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 22.04.1
Kernel Version 5.15.0-48-generic
Architecture x86_64
OpenZFS Version zfs-2.1.4-0ubuntu0.1 / zfs-kmod-2.1.4-0ubuntu0.1

Describe the problem you're observing

I've been using ZFS for the root fs of my desktop, with ARC limited to 2GB, for several months without any issues until now. When I run disk intensive tasks, like borg-backup, duc, find, etc... I can see arc_prune and arc_evict working intermittently, with peaks of about 15% of a thread each, and they manage to keep the arc within limits:

$ rg "dnode|arc_meta" /proc/spl/kstat/zfs/arcstats
dnode_size                      4    338218432
arc_meta_used                   4    2096995632
arc_meta_limit                  4    2147483648
arc_dnode_limit                 4    1610612736
arc_meta_max                    4    3871211136
arc_meta_min                    4    16777216

And once the tasks are finished they both go to sleep. All normal so far.

But yesterday I decided to try another backup tool, Kopia, and I know what it does while doing a backup that makes ARC going out of control. Both arc_prune and arc_evict start using 100% of a CPU thread each. And despite that, they don't manage to keep the memory limits within range, only reaching some kind of balance at around 3.6GB.

$ rg "dnode|arc_meta" /proc/spl/kstat/zfs/arcstats
dnode_size                      4    635616064
arc_meta_used                   4    3652514272
arc_meta_limit                  4    2147483648
arc_dnode_limit                 4    1610612736
arc_meta_max                    4    3871211136
arc_meta_min                    4    16777216

But even after Kopia has finished or I have aborted it, the problem keeps going on indefinitely, even though there's no processes disk activity anymore (iostat and my system's disk led both show some continuous activity though, so it seems it's not just the CPU what they're using).

NOTES:

Describe how to reproduce the problem

Run a Kopia backup of a ZFS filesytem with many files and a low(ish) memory limit for ARC.

faceless2 commented 2 years ago

I am also seeing this.

I'd struggled with the issue from #9966 as well for months, even after setting these options

options zfs zfs_arc_min=17179869184
options zfs zfs_arc_max=17179869184
options zfs zfs_arc_meta_limit_percent=100
options zfs zfs_arc_dnode_limit_percent=75

After upgrading the server (AMD chips, 96-cores, 128GB) from 20.04 to 22.04 (and thus ZFS to 2.1.4) and reverting all these options to their defaults, the issue of having 96 arc_prune processes running flat out is gone - I now have one arc_prune process running flat out. Don't get me wrong, that's a huge improvement - I can at least log in and get diagnostics. But it's not there yet.

Heres cat /proc/spl/kstat/zfs/arcstats | egrep '(dnode|arc_meta)' - very similar to the OP. In our case we're running a large regression test which creates tens of thousands of files over a few minutes. It doesn't happen every time despite almost identical workload.

dnode_size                      4    6383114272
arc_meta_used                   4    21562042592
arc_meta_limit                  4    17179869184
arc_dnode_limit                 4    12884901888
arc_meta_max                    4    22070295584
arc_meta_min                    4    16777216

I've now tried reapplying the zfs_arc_min and zfs_arc_max options above and will keep an eye on it - next time I'll post arc_summary output before dropping caches, but if there's any other diagnostic info that would help please let me know.

faceless2 commented 2 years ago

Back again - and I can confirm that forcing zfs_arc_min and zfs_arc_max to 16G has no impact. Here's the output of arc_summary:

------------------------------------------------------------------------
ZFS Subsystem Report                            Fri Oct 21 10:41:30 2022
Linux 5.15.0-50-generic                                 2.1.4-0ubuntu0.1
Machine: nova (x86_64)                                  2.1.4-0ubuntu0.1

ARC status:                                                      HEALTHY
        Memory throttle count:                                         0

ARC size (current):                                   125.2 %   20.0 GiB
        Target size (adaptive):                       100.0 %   16.0 GiB
        Min size (hard limit):                        100.0 %   16.0 GiB
        Max size (high water):                            1:1   16.0 GiB
        Most Frequently Used (MFU) cache size:         77.4 %    6.4 GiB
        Most Recently Used (MRU) cache size:           22.6 %    1.9 GiB
        Metadata cache size (hard limit):             100.0 %   16.0 GiB
        Metadata cache size (current):                124.2 %   19.9 GiB
        Dnode cache size (hard limit):                 75.0 %   12.0 GiB
        Dnode cache size (current):                    48.9 %    5.9 GiB

ARC hash breakdown:
        Elements max:                                               2.7M
        Elements current:                              15.1 %     413.8k
        Collisions:                                                 3.1M
        Chain max:                                                     5
        Chains:                                                     5.0k

ARC misc:
        Deleted:                                                   39.4M
        Mutex misses:                                              43.8M
        Eviction skips:                                            10.6G
        Eviction skips due to L2 writes:                               0
        L2 cached evictions:                                     0 Bytes
        L2 eligible evictions:                                   1.1 TiB
        L2 eligible MFU evictions:                     42.6 %  474.6 GiB
        L2 eligible MRU evictions:                     57.4 %  638.9 GiB
        L2 ineligible evictions:                               406.3 GiB

ARC total accesses (hits + misses):                                 4.5G
        Cache hit ratio:                               99.2 %       4.5G
        Cache miss ratio:                               0.8 %      34.6M
        Actual hit ratio (MFU + MRU hits):             99.2 %       4.5G
        Data demand efficiency:                        95.0 %     352.0M
        Data prefetch efficiency:                       0.8 %       4.8M

Cache hits by cache type:
        Most frequently used (MFU):                    94.0 %       4.2G
        Most recently used (MRU):                       5.9 %     265.2M
        Most frequently used (MFU) ghost:             < 0.1 %       1.9M
        Most recently used (MRU) ghost:                 0.1 %       4.5M

Cache hits by data type:
        Demand data:                                    7.5 %     334.6M
        Demand prefetch data:                         < 0.1 %      37.9k
        Demand metadata:                               92.4 %       4.1G
        Demand prefetch metadata:                       0.1 %       4.2M

Cache misses by data type:
        Demand data:                                   50.4 %      17.4M
        Demand prefetch data:                          13.8 %       4.8M
        Demand metadata:                               24.1 %       8.4M
        Demand prefetch metadata:                      11.7 %       4.1M

DMU prefetch efficiency:                                           75.5M
        Hit ratio:                                      8.3 %       6.3M
        Miss ratio:                                    91.7 %      69.2M

L2ARC not detected, skipping section

Solaris Porting Layer (SPL):
        spl_hostid                                                     0
        spl_hostid_path                                      /etc/hostid
        spl_kmem_alloc_max                                       1048576
        spl_kmem_alloc_warn                                        65536
        spl_kmem_cache_kmem_threads                                    4
        spl_kmem_cache_magazine_size                                   0
        spl_kmem_cache_max_size                                       32
        spl_kmem_cache_obj_per_slab                                    8
        spl_kmem_cache_reclaim                                         0
        spl_kmem_cache_slab_limit                                  16384
        spl_max_show_tasks                                           512
        spl_panic_halt                                                 0
        spl_schedule_hrtimeout_slack_us                                0
        spl_taskq_kick                                                 0
        spl_taskq_thread_bind                                          0
        spl_taskq_thread_dynamic                                       1
        spl_taskq_thread_priority                                      1
        spl_taskq_thread_sequential                                    4

Tunables:
        dbuf_cache_hiwater_pct                                        10
        dbuf_cache_lowater_pct                                        10
        dbuf_cache_max_bytes                        18446744073709551615
        dbuf_cache_shift                                               5
        dbuf_metadata_cache_max_bytes               18446744073709551615
        dbuf_metadata_cache_shift                                      6
        dmu_object_alloc_chunk_shift                                   7
        dmu_prefetch_max                                       134217728
        ignore_hole_birth                                              1
        l2arc_feed_again                                               1
        l2arc_feed_min_ms                                            200
        l2arc_feed_secs                                                1
        l2arc_headroom                                                 2
        l2arc_headroom_boost                                         200
        l2arc_meta_percent                                            33
        l2arc_mfuonly                                                  0
        l2arc_noprefetch                                               1
        l2arc_norw                                                     0
        l2arc_rebuild_blocks_min_l2size                       1073741824
        l2arc_rebuild_enabled                                          1
        l2arc_trim_ahead                                               0
        l2arc_write_boost                                        8388608
        l2arc_write_max                                          8388608
        metaslab_aliquot                                          524288
        metaslab_bias_enabled                                          1
        metaslab_debug_load                                            0
        metaslab_debug_unload                                          0
        metaslab_df_max_search                                  16777216
        metaslab_df_use_largest_segment                                0
        metaslab_force_ganging                                  16777217
        metaslab_fragmentation_factor_enabled                          1
        metaslab_lba_weighting_enabled                                 1
        metaslab_preload_enabled                                       1
        metaslab_unload_delay                                         32
        metaslab_unload_delay_ms                                  600000
        send_holes_without_birth_time                                  1
        spa_asize_inflation                                           24
        spa_config_path                             /etc/zfs/zpool.cache
        spa_load_print_vdev_tree                                       0
        spa_load_verify_data                                           1
        spa_load_verify_metadata                                       1
        spa_load_verify_shift                                          4
        spa_slop_shift                                                 5
        vdev_file_logical_ashift                                       9
        vdev_file_physical_ashift                                      9
        vdev_removal_max_span                                      32768
        vdev_validate_skip                                             0
        zap_iterate_prefetch                                           1
        zfetch_array_rd_sz                                       1048576
        zfetch_max_distance                                      8388608
        zfetch_max_idistance                                    67108864
        zfetch_max_streams                                             8
        zfetch_min_sec_reap                                            2
        zfs_abd_scatter_enabled                                        1
        zfs_abd_scatter_max_order                                     10
        zfs_abd_scatter_min_size                                    1536
        zfs_admin_snapshot                                             0
        zfs_allow_redacted_dataset_mount                               0
        zfs_arc_average_blocksize                                   8192
        zfs_arc_dnode_limit                                            0
        zfs_arc_dnode_limit_percent                                   75
        zfs_arc_dnode_reduce_percent                                  10
        zfs_arc_evict_batch_limit                                     10
        zfs_arc_eviction_pct                                         200
        zfs_arc_grow_retry                                             0
        zfs_arc_lotsfree_percent                                      10
        zfs_arc_max                                          17179869184
        zfs_arc_meta_adjust_restarts                                4096
        zfs_arc_meta_limit                                             0
        zfs_arc_meta_limit_percent                                   100
        zfs_arc_meta_min                                               0
        zfs_arc_meta_prune                                         10000
        zfs_arc_meta_strategy                                          1
        zfs_arc_min                                          17179869184
        zfs_arc_min_prefetch_ms                                        0
        zfs_arc_min_prescient_prefetch_ms                              0
        zfs_arc_p_dampener_disable                                     1
        zfs_arc_p_min_shift                                            0
        zfs_arc_pc_percent                                             0
        zfs_arc_prune_task_threads                                     1
        zfs_arc_shrink_shift                                           0
        zfs_arc_shrinker_limit                                     10000
        zfs_arc_sys_free                                               0
        zfs_async_block_max_blocks                  18446744073709551615
        zfs_autoimport_disable                                         1
        zfs_checksum_events_per_second                                20
        zfs_commit_timeout_pct                                         5
        zfs_compressed_arc_enabled                                     1
        zfs_condense_indirect_commit_entry_delay_ms                    0
        zfs_condense_indirect_obsolete_pct                            25
        zfs_condense_indirect_vdevs_enable                             1
        zfs_condense_max_obsolete_bytes                       1073741824
        zfs_condense_min_mapping_bytes                            131072
        zfs_dbgmsg_enable                                              1
        zfs_dbgmsg_maxsize                                       4194304
        zfs_dbuf_state_index                                           0
        zfs_ddt_data_is_special                                        1
        zfs_deadman_checktime_ms                                   60000
        zfs_deadman_enabled                                            1
        zfs_deadman_failmode                                        wait
        zfs_deadman_synctime_ms                                   600000
        zfs_deadman_ziotime_ms                                    300000
        zfs_dedup_prefetch                                             0
        zfs_delay_min_dirty_percent                                   60
        zfs_delay_scale                                           500000
        zfs_delete_blocks                                          20480
        zfs_dirty_data_max                                    4294967296
        zfs_dirty_data_max_max                                4294967296
        zfs_dirty_data_max_max_percent                                25
        zfs_dirty_data_max_percent                                    10
        zfs_dirty_data_sync_percent                                   20
        zfs_disable_ivset_guid_check                                   0
        zfs_dmu_offset_next_sync                                       1
        zfs_embedded_slog_min_ms                                      64
        zfs_expire_snapshot                                          300
        zfs_fallocate_reserve_percent                                110
        zfs_flags                                                      0
        zfs_free_bpobj_enabled                                         1
        zfs_free_leak_on_eio                                           0
        zfs_free_min_time_ms                                        1000
        zfs_history_output_max                                   1048576
        zfs_immediate_write_sz                                     32768
        zfs_initialize_chunk_size                                1048576
        zfs_initialize_value                        16045690984833335022
        zfs_keep_log_spacemaps_at_export                               0
        zfs_key_max_salt_uses                                  400000000
        zfs_livelist_condense_new_alloc                                0
        zfs_livelist_condense_sync_cancel                              0
        zfs_livelist_condense_sync_pause                               0
        zfs_livelist_condense_zthr_cancel                              0
        zfs_livelist_condense_zthr_pause                               0
        zfs_livelist_max_entries                                  500000
        zfs_livelist_min_percent_shared                               75
        zfs_lua_max_instrlimit                                 100000000
        zfs_lua_max_memlimit                                   104857600
        zfs_max_async_dedup_frees                                 100000
        zfs_max_log_walking                                            5
        zfs_max_logsm_summary_length                                  10
        zfs_max_missing_tvds                                           0
        zfs_max_nvlist_src_size                                        0
        zfs_max_recordsize                                       1048576
        zfs_metaslab_find_max_tries                                  100
        zfs_metaslab_fragmentation_threshold                          70
        zfs_metaslab_max_size_cache_sec                             3600
        zfs_metaslab_mem_limit                                        25
        zfs_metaslab_segment_weight_enabled                            1
        zfs_metaslab_switch_threshold                                  2
        zfs_metaslab_try_hard_before_gang                              0
        zfs_mg_fragmentation_threshold                                95
        zfs_mg_noalloc_threshold                                       0
        zfs_min_metaslabs_to_flush                                     1
        zfs_multihost_fail_intervals                                  10
        zfs_multihost_history                                          0
        zfs_multihost_import_intervals                                20
        zfs_multihost_interval                                      1000
        zfs_multilist_num_sublists                                     0
        zfs_no_scrub_io                                                0
        zfs_no_scrub_prefetch                                          0
        zfs_nocacheflush                                               0
        zfs_nopwrite_enabled                                           1
        zfs_object_mutex_size                                         64
        zfs_obsolete_min_time_ms                                     500
        zfs_override_estimate_recordsize                               0
        zfs_pd_bytes_max                                        52428800
        zfs_per_txg_dirty_frees_percent                                5
        zfs_prefetch_disable                                           0
        zfs_read_history                                               0
        zfs_read_history_hits                                          0
        zfs_rebuild_max_segment                                  1048576
        zfs_rebuild_scrub_enabled                                      1
        zfs_rebuild_vdev_limit                                  33554432
        zfs_reconstruct_indirect_combinations_max                   4096
        zfs_recover                                                    0
        zfs_recv_queue_ff                                             20
        zfs_recv_queue_length                                   16777216
        zfs_recv_write_batch_size                                1048576
        zfs_removal_ignore_errors                                      0
        zfs_removal_suspend_progress                                   0
        zfs_remove_max_segment                                  16777216
        zfs_resilver_disable_defer                                     0
        zfs_resilver_min_time_ms                                    3000
        zfs_scan_checkpoint_intval                                  7200
        zfs_scan_fill_weight                                           3
        zfs_scan_ignore_errors                                         0
        zfs_scan_issue_strategy                                        0
        zfs_scan_legacy                                                0
        zfs_scan_max_ext_gap                                     2097152
        zfs_scan_mem_lim_fact                                         20
        zfs_scan_mem_lim_soft_fact                                    20
        zfs_scan_strict_mem_lim                                        0
        zfs_scan_suspend_progress                                      0
        zfs_scan_vdev_limit                                      4194304
        zfs_scrub_min_time_ms                                       1000
        zfs_send_corrupt_data                                          0
        zfs_send_no_prefetch_queue_ff                                 20
        zfs_send_no_prefetch_queue_length                        1048576
        zfs_send_queue_ff                                             20
        zfs_send_queue_length                                   16777216
        zfs_send_unmodified_spill_blocks                               1
        zfs_slow_io_events_per_second                                 20
        zfs_spa_discard_memory_limit                            16777216
        zfs_special_class_metadata_reserve_pct                        25
        zfs_sync_pass_deferred_free                                    2
        zfs_sync_pass_dont_compress                                    8
        zfs_sync_pass_rewrite                                          2
        zfs_sync_taskq_batch_pct                                      75
        zfs_traverse_indirect_prefetch_limit                          32
        zfs_trim_extent_bytes_max                              134217728
        zfs_trim_extent_bytes_min                                  32768
        zfs_trim_metaslab_skip                                         0
        zfs_trim_queue_limit                                          10
        zfs_trim_txg_batch                                            32
        zfs_txg_history                                              100
        zfs_txg_timeout                                                5
        zfs_unflushed_log_block_max                               262144
        zfs_unflushed_log_block_min                                 1000
        zfs_unflushed_log_block_pct                                  400
        zfs_unflushed_max_mem_amt                             1073741824
        zfs_unflushed_max_mem_ppm                                   1000
        zfs_unlink_suspend_progress                                    0
        zfs_user_indirect_is_special                                   1
        zfs_vdev_aggregate_trim                                        0
        zfs_vdev_aggregation_limit                               1048576
        zfs_vdev_aggregation_limit_non_rotating                   131072
        zfs_vdev_async_read_max_active                                 3
        zfs_vdev_async_read_min_active                                 1
        zfs_vdev_async_write_active_max_dirty_percent                 60
        zfs_vdev_async_write_active_min_dirty_percent                 30
        zfs_vdev_async_write_max_active                               10
        zfs_vdev_async_write_min_active                                2
        zfs_vdev_cache_bshift                                         16
        zfs_vdev_cache_max                                         16384
        zfs_vdev_cache_size                                            0
        zfs_vdev_default_ms_count                                    200
        zfs_vdev_default_ms_shift                                     29
        zfs_vdev_initializing_max_active                               1
        zfs_vdev_initializing_min_active                               1
        zfs_vdev_max_active                                         1000
        zfs_vdev_max_auto_ashift                                      16
        zfs_vdev_min_auto_ashift                                       9
        zfs_vdev_min_ms_count                                         16
        zfs_vdev_mirror_non_rotating_inc                               0
        zfs_vdev_mirror_non_rotating_seek_inc                          1
        zfs_vdev_mirror_rotating_inc                                   0
        zfs_vdev_mirror_rotating_seek_inc                              5
        zfs_vdev_mirror_rotating_seek_offset                     1048576
        zfs_vdev_ms_count_limit                                   131072
        zfs_vdev_nia_credit                                            5
        zfs_vdev_nia_delay                                             5
        zfs_vdev_queue_depth_pct                                    1000
        zfs_vdev_raidz_impl cycle [fastest] original scalar sse2 ssse3 avx2
        zfs_vdev_read_gap_limit                                    32768
        zfs_vdev_rebuild_max_active                                    3
        zfs_vdev_rebuild_min_active                                    1
        zfs_vdev_removal_max_active                                    2
        zfs_vdev_removal_min_active                                    1
        zfs_vdev_scheduler                                        unused
        zfs_vdev_scrub_max_active                                      3
        zfs_vdev_scrub_min_active                                      1
        zfs_vdev_sync_read_max_active                                 10
        zfs_vdev_sync_read_min_active                                 10
        zfs_vdev_sync_write_max_active                                10
        zfs_vdev_sync_write_min_active                                10
        zfs_vdev_trim_max_active                                       2
        zfs_vdev_trim_min_active                                       1
        zfs_vdev_write_gap_limit                                    4096
        zfs_vnops_read_chunk_size                                1048576
        zfs_zevent_len_max                                           512
        zfs_zevent_retain_expire_secs                                900
        zfs_zevent_retain_max                                       2000
        zfs_zil_clean_taskq_maxalloc                             1048576
        zfs_zil_clean_taskq_minalloc                                1024
        zfs_zil_clean_taskq_nthr_pct                                 100
        zil_maxblocksize                                          131072
        zil_nocacheflush                                               0
        zil_replay_disable                                             0
        zil_slog_bulk                                             786432
        zio_deadman_log_all                                            0
        zio_dva_throttle_enabled                                       1
        zio_requeue_io_start_cut_in_line                               1
        zio_slow_io_ms                                             30000
        zio_taskq_batch_pct                                           80
        zio_taskq_batch_tpq                                            0
        zvol_inhibit_dev                                               0
        zvol_major                                                   230
        zvol_max_discard_blocks                                    16384
        zvol_prefetch_bytes                                       131072
        zvol_request_sync                                              0
        zvol_threads                                                  32
        zvol_volmode                                                   1

VDEV cache disabled, skipping section

ZIL committed transactions:                                         5.4M
        Commit requests:                                          415.4k
        Flushes to stable storage:                                415.4k
        Transactions to SLOG storage pool:            0 Bytes          0
        Transactions to non-SLOG storage pool:       19.6 GiB     551.6k
faceless2 commented 2 years ago

A few days later, happens again. I'm attaching the final 4 hours or so of /proc/spl/kstat/zfs/dbgmsg - this log was fairly idle until the overnight jobs kicked in, then the attached 500 lines are all from the last four hours.

dbgmsg.txt

shodanshok commented 2 years ago

When happening again, can you show the output of the following commanda?

grep "arc_\|memory\|dnode" /proc/spl/kstat/zfs/arcstats arcstat 1 free -m

faceless2 commented 2 years ago

With pleasure! I've caught it in the act for once so first here's the output of those commands while the machine is genuinely busy:

# grep "arc_\|memory\|dnode" /proc/spl/kstat/zfs/arcstats
dnode_size                      4    5282924768
memory_throttle_count           4    0
memory_direct_count             4    8
memory_indirect_count           4    4
memory_all_bytes                4    135014055936
memory_free_bytes               4    7666941952
memory_available_bytes          3    3095025664
arc_no_grow                     4    0
arc_tempreserve                 4    9214
arc_loaned_bytes                4    0
arc_prune                       4    213098808
arc_meta_used                   4    18098941232
arc_meta_limit                  4    17179869184
arc_dnode_limit                 4    12884901888
arc_meta_max                    4    24104624640
arc_meta_min                    4    16777216
arc_need_free                   4    2048
arc_sys_free                    4    4571916288
arc_raw_size                    4    0

# arcstat 1
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size     c  avail
18:52:08   10K    26      0    26    0     0    0    23    0   17G   16G   2.3G
18:52:09  1.3M  3.1K      0  3.1K    0     0    0  1.6K    0   17G   16G   2.5G
18:52:10  1.6M  2.8K      0  2.8K    0     0    0  1.4K    0   17G   16G   2.5G
18:52:11  740K  4.0K      0  4.0K    0     0    0  2.0K    0   17G   16G   2.5G
18:52:12  184K  4.3K      2  4.3K    2     0    0  2.1K    4   17G   16G   2.5G
18:52:13  188K  5.0K      2  5.0K    2     0    0  2.6K    5   17G   16G   2.4G
18:52:14  217K  6.9K      3  6.9K    3     0    0  3.2K    6   17G   16G   2.4G
18:52:15  181K  5.6K      3  5.6K    3     0    0  2.3K    4   17G   16G   2.4G
18:52:16  163K  4.1K      2  4.1K    2     7  100  2.1K    5   17G   16G   2.4G
18:52:17  1.3M  5.4K      0  5.4K    0     0    0  2.9K    0   17G   16G   2.4G
18:52:18  1.4M  4.8K      0  4.8K    0     0    0  2.4K    0   17G   16G   2.3G
18:52:19  1.5M  6.8K      0  6.8K    0     0    0  3.6K    0   17G   16G   2.3G
18:52:20  1.8M  2.4K      0  2.4K    0     0    0  1.3K    0   17G   16G   2.3G
18:52:21  1.6M  5.4K      0  5.4K    0     0    0  2.9K    0   17G   16G   2.2G
18:52:22  1.4M  6.0K      0  6.0K    0     0    0  3.1K    0   17G   16G   2.2G
18:52:23  1.5M  5.5K      0  5.5K    0     2  100  2.9K    0   17G   16G   2.1G
18:52:24  246K  1.3K      0  1.1K    0   136   97   711    0   17G   16G   2.1G
18:52:25   15K   619      4   556    3    63   95   357    6   17G   16G   2.1G
18:52:26   14K   696      4   621    4    75   93   397    7   17G   16G   2.2G
18:52:27   13K   476      3   436    3    40   95   281    6   17G   16G   2.2G
18:52:28   25K  1.1K      4   815    3   278   98   704    9   17G   16G   2.1G
18:52:29  181K   523      0   497    0    26   92   297    0   17G   16G   2.1G
18:52:30  619K  1.3K      0  1.2K    0   108   96   723    0   17G   16G   2.1G
18:52:31   23K   698      2   641    2    57   98   383    6   17G   16G   2.1G
18:52:32   28K   826      2   752    2    74   97   470    5   17G   16G   2.1G
18:52:33   17K   545      3   517    2    28   96   268    4   17G   16G   2.1G
18:52:34   53K  1.2K      2  1.1K    2   115   99   653    5   17G   16G   2.1G
... etc ...
18:53:28   17K  1.4K      7  1.4K    7     0    0   350    3   18G   16G   1.4G
... etc ...
18:54:58  1.2M   140      0   101    0    39   97    91    0   18G   16G   1.5G
... etc.... 
18:55:23  3.4M    34      0    29    0     5   83    21    0   18G   16G   1.4G
... etc ...
18:55:45  2.4M   856      0   854    0     2  100    21    0   18G   16G   1.3G
^C

# free -m
               total        used        free      shared  buff/cache   available
Mem:          128759      121865        5403          27        1491        5532
Swap:              0           0           0

Then I killed the application process which was using all the memory and hammering the disk - normally I don't spot it until the machine has been like this for a while, so this is a new step. arc_prune and arc_evict don't immediately rocket up to 100% - do hit that level after a few seconds, but then fall back to just 1-2%. Now I see

grep "arc_\|memory\|dnode" /proc/spl/kstat/zfs/arcstats
dnode_size                      4    5625136000
memory_throttle_count           4    0
memory_direct_count             4    8
memory_indirect_count           4    4
memory_all_bytes                4    135014055936
memory_free_bytes               4    88923025408
memory_available_bytes          3    84351109120
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    213477638
arc_meta_used                   4    19282122336
arc_meta_limit                  4    17179869184
arc_dnode_limit                 4    12884901888
arc_meta_max                    4    24104624640
arc_meta_min                    4    16777216
arc_need_free                   4    0
arc_sys_free                    4    4571916288
arc_raw_size                    4    0

# arcstat 1                                              
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size     c  avail
19:02:49     0     0      0     0    0     0    0     0    0   18G   16G    78G
19:02:50    23     0      0     0    0     0    0     0    0   18G   16G    78G
19:02:51   152     6      3     6    3     0    0     6    4   18G   16G    78G
19:02:52    28     0      0     0    0     0    0     0    0   18G   16G    78G
19:02:53    20     0      0     0    0     0    0     0    0   18G   16G    78G
^C (no obvious variation in this.)

# free -m
               total        used        free      shared  buff/cache   available
Mem:          128759       42694       84603          27        1461       84717
Swap:              0           0           0

The machine has no load running, several minutes pass. arc_prune and arc_evict are intermittently busy - they go up to 100% in top, then back to 1%. It looks like jobs which have very little impact normally are causing this, eg just opening a new SSH connection or running apt update. When I run that in another terminal, here's what I see while it's running:

# grep "arc_\|memory\|dnode" /proc/spl/kstat/zfs/arcstats
dnode_size                      4    5548859136
memory_throttle_count           4    0
memory_direct_count             4    8
memory_indirect_count           4    4
memory_all_bytes                4    135014055936
memory_free_bytes               4    89338064896
memory_available_bytes          3    84766148608
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    213694035
arc_meta_used                   4    18778333536
arc_meta_limit                  4    17179869184
arc_dnode_limit                 4    12884901888
arc_meta_max                    4    24104624640
arc_meta_min                    4    16777216
arc_need_free                   4    0
arc_sys_free                    4    4571916288
arc_raw_size                    4    0

# arcstat 1
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size     c  avail
19:12:17     8     0      0     0    0     0    0     0    0   18G   16G    78G
19:12:18    24     1      4     1    4     0    0     1    4   18G   16G    78G
19:12:19   155     0      0     0    0     0    0     0    0   18G   16G    78G
19:12:20    23     0      0     0    0     0    0     0    0   18G   16G    78G
19:12:21    13     0      0     0    0     0    0     0    0   18G   16G    78G
19:12:22     0     0      0     0    0     0    0     0    0   18G   16G    78G
19:12:23  7.2K    21      0    20    0     1  100     7    0   18G   16G    78G
19:12:24  9.5K   297      3   159    1   138  100     1    0   17G   16G    78G
19:12:25   12K     2      0     2    0     0    0     0    0   17G   16G    78G
19:12:26    40     2      5     2    5     0    0     2    5   17G   16G    78G
19:12:27     8     0      0     0    0     0    0     0    0   17G   16G    78G
19:12:28     6     0      0     0    0     0    0     0    0   17G   16G    78G
19:12:29   195     2      1     2    1     0    0     2    1   18G   16G    78G
^C

# free -m
               total        used        free      shared  buff/cache   available
Mem:          128759       42337       84914          27        1506       85051
Swap:              0           0           0

I don't know what I'm looking at with all this, but it feels like because I killed the job early the ZFS issue is now... simmering. The slightest disk activity causes the problem to surface.

I need the machine back so drop caches, and we have:

# grep "arc_\|memory\|dnode" /proc/spl/kstat/zfs/arcstats
dnode_size                      4    1878924384
memory_throttle_count           4    0
memory_direct_count             4    8
memory_indirect_count           4    4
memory_all_bytes                4    135014055936
memory_free_bytes               4    99754872832
memory_available_bytes          3    95182956544
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    213744942
arc_meta_used                   4    7351729280
arc_meta_limit                  4    17179869184
arc_dnode_limit                 4    12884901888
arc_meta_max                    4    24104624640
arc_meta_min                    4    16777216
arc_need_free                   4    0
arc_sys_free                    4    4571916288
arc_raw_size                    4    0

# arcstat 1                        
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size     c  avail
19:15:32     0     0      0     0    0     0    0     0    0  6.9G   16G    88G
19:15:33     0     0      0     0    0     0    0     0    0  6.9G   16G    88G
19:15:34   161     0      0     0    0     0    0     0    0  6.9G   16G    88G
^C

# free -m
               total        used        free      shared  buff/cache   available
Mem:          128759       33112       95045          27         601       94729
Swap:              0           0           0

and arc_prune and arc_evict are no longer to be seen in top.

I'll generate the same output the next time I fail to catch the job in time...

shodanshok commented 2 years ago

Thanks for your reporting. The key point is that, until you dropped caches, your metadata were over limit (arc_meta_used > arc_meta_limit) and so arc_prune was furiously trying to free some metadata for eviction but without success. I don't know why arc_prune was unable to free the required metadata, but the next time it happens try to issue echo 100 > /sys/module/zfs/parameters/zfs_arc_meta_limit_percent

For the next report, can you grep the following: grep "arc_\|memory\|overflow\|evict\|dnode" arcstats ?

faceless2 commented 2 years ago

Back again - this time I did not catch it in time. More diagnostics:

# grep "arc_\|memory\|overflow\|evict\|dnode"  /proc/spl/kstat/zfs/arcstats
evict_skip                      4    42489272488
evict_not_enough                4    12413470777
evict_l2_cached                 4    0
evict_l2_eligible               4    5648382219264
evict_l2_eligible_mfu           4    2713057107968
evict_l2_eligible_mru           4    2935325111296
evict_l2_ineligible             4    2005483851776
evict_l2_skip                   4    0
dnode_size                      4    5139755360
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_evictable_data              4    0
mru_evictable_metadata          4    4096
mru_ghost_evictable_data        4    0
mru_ghost_evictable_metadata    4    0
mfu_evictable_data              4    0
mfu_evictable_metadata          4    0
mfu_ghost_evictable_data        4    0
mfu_ghost_evictable_metadata    4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
memory_throttle_count           4    0
memory_direct_count             4    15
memory_indirect_count           4    58
memory_all_bytes                4    135014055936
memory_free_bytes               4    90148765696
memory_available_bytes          3    85576849408
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    381556707
arc_meta_used                   4    17288867392
arc_meta_limit                  4    17179869184
arc_dnode_limit                 4    12884901888
arc_meta_max                    4    24104624640
arc_meta_min                    4    16777216
arc_need_free                   4    36544
arc_sys_free                    4    4571916288
arc_raw_size                    4    0

# echo 100 > /sys/module/zfs/parameters/zfs_arc_meta_limit_percent

# grep "arc_\|memory\|overflow\|evict\|dnode" /proc/spl/kstat/zfs/arcstats
evict_skip                      4    42504492992
evict_not_enough                4    12424903898
evict_l2_cached                 4    0
evict_l2_eligible               4    5648413739008
evict_l2_eligible_mfu           4    2713057716224
evict_l2_eligible_mru           4    2935356022784
evict_l2_ineligible             4    2005635026944
evict_l2_skip                   4    0
dnode_size                      4    5148812320
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_evictable_data              4    0
mru_evictable_metadata          4    8192
mru_ghost_evictable_data        4    0
mru_ghost_evictable_metadata    4    0
mfu_evictable_data              4    0
mfu_evictable_metadata          4    0
mfu_ghost_evictable_data        4    0
mfu_ghost_evictable_metadata    4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
memory_throttle_count           4    0
memory_direct_count             4    15
memory_indirect_count           4    58
memory_all_bytes                4    135014055936
memory_free_bytes               4    90145370112
memory_available_bytes          3    85573453824
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    381992117
arc_meta_used                   4    17349396480
arc_meta_limit                  4    17179869184
arc_dnode_limit                 4    12884901888
arc_meta_max                    4    24104624640
arc_meta_min                    4    16777216
arc_need_free                   4    73728
arc_sys_free                    4    4571916288
arc_raw_size                    4    0

# arcstat 1
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size     c  avail
08:28:57    60     9     15     9   15     0    0     8   15   16G   16G    79G
08:28:58  4.9K   704     14   613   12    91   94   701   14   16G   16G    79G
08:28:59  1.4K    37      2    10    0    27   62    34    2   16G   16G    79G
08:29:00  1.5K   138      8    59    4    79   84   137    8   16G   16G    79G
^C

Setting zfs_arc_meta_limit_percent to 100 made no difference, arc_evict and arc_prune are still pegged at 100% for a few minutes afterwards. So I dropped caches

# echo 3 > /proc/sys/vm/drop_caches

# grep "arc_\|memory\|overflow\|evict\|dnode" /proc/spl/kstat/zfs/arcstats
evict_skip                      4    42545416959
evict_not_enough                4    12443886776
evict_l2_cached                 4    0
evict_l2_eligible               4    5649859763200
evict_l2_eligible_mfu           4    2713251620864
evict_l2_eligible_mru           4    2936608142336
evict_l2_ineligible             4    2006184136704
evict_l2_skip                   4    0
dnode_size                      4    1140962688
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_evictable_data              4    281744384
mru_evictable_metadata          4    424766464
mru_ghost_evictable_data        4    0
mru_ghost_evictable_metadata    4    0
mfu_evictable_data              4    209156096
mfu_evictable_metadata          4    829720064
mfu_ghost_evictable_data        4    0
mfu_ghost_evictable_metadata    4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
memory_throttle_count           4    0
memory_direct_count             4    16
memory_indirect_count           4    58
memory_all_bytes                4    135014055936
memory_free_bytes               4    103376101376
memory_available_bytes          3    98804185088
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    382636215
arc_meta_used                   4    6049864784
arc_meta_limit                  4    17179869184
arc_dnode_limit                 4    12884901888
arc_meta_max                    4    24104624640
arc_meta_min                    4    16777216
arc_need_free                   4    0
arc_sys_free                    4    4571916288
arc_raw_size                    4    0
coot commented 1 year ago

This happens to me regularly on a daily basis, usually after suspending the system to RAM. Changing zfs_arc_meta_limit_percent to 100 doesn't make a difference, dropping all caches with echo 3 > /proc/sys/vm/drop_caches helps. If it's needed to capture more statistics I can post it next time it happens. I also observed arc_meta_used >> arc_meta_limit.

shodanshok commented 1 year ago

@faceless2 arc_meta_limit as shown in your post is identical both before and after setting zfs_arc_meta_limit_percent. Did you already set it to 100 before? Have you changed zfs_arc_meta_limit (the value in bytes)?

@coot When happening, please report back the following data:

grep "arc_\|memory\|overflow\|evict\|dnode" /proc/spl/kstat/zfs/arcstats
arcstat 1
free -m
faceless2 commented 1 year ago

Sorry - zfs_arc_meta_limit_percent was already at 100, I can see that from my first report on Oct 21st. I originally had the following in /etc/modprobe.d/zfs.conf

options zfs zfs_arc_min=17179869184
options zfs zfs_arc_max=17179869184
options zfs zfs_arc_meta_limit_percent=100
options zfs zfs_arc_dnode_limit_percent=75

I removed these after the kernel upgrade but it looks like I didn't reboot. So those are the options I've been running with. Sorry I didn't notice

The initial level for zfs_arc_max (50% of RAM, or 64GB), was causing a large Java task we run with 96GB of heap to be killed by the kernel OOM killer. That's why I reduced zfs_arc_max.

I have no particular reason for setting zfs_arc_min to the same level, but based on your comment above

I don't know why arc_prune was unable to free the required metadata

I wonder if this could be a contributing factor? I've now set it to 8GB, half of the 16GB max value and will see if it makes a difference.

shodanshok commented 1 year ago

Surely having zfs_arc_min set so high (relatively to ARC) can be problematic, lets see if the issue reappears.

coot commented 1 year ago
cat /sys/module/zfs/parameters/zfs_arc_meta_limit_percent 
75
grep "arc_\|memory\|overflow\|evict\|dnode" /proc/spl/kstat/zfs/arcstats
evict_skip                      4    206552551
evict_not_enough                4    22859306
evict_l2_cached                 4    0
evict_l2_eligible               4    50234150912
evict_l2_eligible_mfu           4    19353164288
evict_l2_eligible_mru           4    30880986624
evict_l2_ineligible             4    9036353536
evict_l2_skip                   4    0
dnode_size                      4    3136246688
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_evictable_data              4    12800
mru_evictable_metadata          4    134144
mru_ghost_evictable_data        4    0
mru_ghost_evictable_metadata    4    1024
mfu_evictable_data              4    512
mfu_evictable_metadata          4    512
mfu_ghost_evictable_data        4    0
mfu_ghost_evictable_metadata    4    33792
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
memory_throttle_count           4    0
memory_direct_count             4    0
memory_indirect_count           4    2010
memory_all_bytes                4    67366936576
memory_free_bytes               4    3758071808
memory_available_bytes          3    1403715840
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    48626010
arc_meta_used                   4    33332407424
arc_meta_limit                  4    25262601216
arc_dnode_limit                 4    2526260121
arc_meta_max                    4    35483426448
arc_meta_min                    4    16777216
arc_need_free                   4    0
arc_sys_free                    4    2354355968
arc_raw_size                    4    8180123648
arcstat 1
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size     c  avail
09:08:13   338    77     22    76   22     1  100    45   14   31G   31G   1.3G
09:08:14   30K  6.7K     21  6.7K   21     1  100  4.2K   14   31G   31G   1.3G
09:08:15  4.1K   862     20   861   20     1  100   565   14   31G   31G   1.3G
09:08:16   506   102     20   102   20     0    0    67   14   31G   31G   1.3G
09:08:17   527   107     20   107   20     0    0    70   14   31G   31G   1.3G
09:08:18   540   110     20   110   20     0    0    69   14   31G   31G   1.3G
09:08:19  2.4K   100      4   100    4     0    0    66    3   31G   31G   1.3G
09:08:20   707   109     15   109   15     0    0    71   10   31G   31G   1.3G
09:08:21   650   128     19   128   19     0    0    79   13   31G   31G   1.3G
09:08:22   807   128     15   128   15     0    0    78   10   31G   31G   1.3G
09:08:23   592   113     19   113   19     0    0    70   13   31G   31G   1.3G
09:08:24   566   115     20   115   20     0    0    72   14   31G   31G   1.3G
09:08:25   790   115     14   115   14     0    0    70    9   31G   31G   1.3G
09:08:26   585   112     19   112   19     0    0    68   12   31G   31G   1.3G
09:08:27   593   110     18   110   18     0    0    68   12   31G   31G   1.3G
09:08:28   747   149     19   148   19     1  100    74   11   31G   31G   1.3G
09:08:29   745   147     19   147   19     0    0    74   11   31G   31G   1.3G
09:08:30  3.7K   263      7   263    7     0    0   146    4   31G   31G   1.2G
09:08:31   874   142     16   142   16     0    0    76   10   31G   31G   1.2G
09:08:32  1.9K   158      8   120    6    38  100   108    6   31G   31G   1.2G
09:08:33   573   147     25    93   17    54  100   110   20   31G   31G   1.2G
09:08:34   563   130     23    72   14    58  100   102   19   31G   31G   1.2G
09:08:35   795   128     16    72    9    56  100   100   13   31G   31G   1.2G
09:08:36  4.8K  1.3K     26   179    4  1.1K   98  1.2K   25   31G   31G   1.2G
09:08:37   17K  1.5K      8   766    4   754   94  1.2K    7   31G   31G   1.2G
09:08:38   65K   14K     21  9.8K   16  4.3K   86   10K   17   31G   31G   1.3G
09:08:39   72K   16K     22   16K   22     4  100  9.9K   14   31G   31G   1.2G
09:08:40  1.3K   262     20   262   20     0    0   156   13   31G   31G   1.2G
09:08:41   800   159     19   159   19     0    0    95   13   31G   31G   1.2G
09:08:42  5.8K  1.2K     20  1.2K   20     0    0   728   13   31G   31G   1.2G
09:08:43   618   126     20   126   20     0    0    77   13   31G   31G   1.2G
09:08:44   936   131     13   131   13     0    0    80    9   31G   31G   1.2G
09:08:45   581   115     19   115   19     0    0    70   13   31G   31G   1.2G
09:08:46   571   111     19   111   19     0    0    67   12   31G   31G   1.2G
09:08:47   761   111     14   111   14     0    0    67    9   31G   31G   1.2G
09:08:48   696   117     16   117   16     0    0    71   11   31G   31G   1.2G
09:08:49   643   117     18   117   18     0    0    71   12   31G   31G   1.2G
09:08:50  2.4K   112      4   112    4     0    0    68    3   31G   31G   1.2G
09:08:51   788   125     15   125   15     0    0    75   10   31G   31G   1.2G
09:08:52  1.9K   147      7   147    7     0    0    95    5   31G   31G   1.2G
09:08:53   914   175     19   175   19     0    0   105   12   31G   31G   1.2G
09:08:54  4.4K   161      3   160    3     1  100    80    2   31G   31G   1.2G
09:08:55   643   116     18   116   18     0    0    71   12   31G   31G   1.2G
09:08:56   649   124     19   123   18     1  100    72   12   31G   31G   1.2G
09:08:57   849   125     14   125   14     0    0    77   10   31G   31G   1.2G
09:08:58   750   131     17   131   17     0    0    79   11   31G   31G   1.2G
09:08:59   640   126     19   126   19     0    0    77   13   31G   31G   1.2G
09:09:00   760   133     17   133   17     0    0    79   11   31G   31G   1.2G
09:09:01  2.4K   132      5   132    5     0    0    79    3   31G   31G   1.2G
09:09:02   760   124     16   124   16     0    0    76   10   31G   31G   1.2G
09:09:03   648   129     19   129   19     0    0    79   13   31G   31G   1.2G
09:09:04   699   126     18   126   18     0    0    75   11   31G   31G   1.2G
09:09:05   752   127     16   127   16     0    0    77   11   31G   31G   1.2G
09:09:06   654   128     19   128   19     0    0    77   13   31G   31G   1.2G
09:09:07   709   130     18   130   18     0    0    79   12   31G   31G   1.2G
09:09:08   806   126     15   126   15     0    0    76   10   31G   31G   1.2G
09:09:09   854   150     17   150   17     0    0    94   11   31G   31G   1.2G
09:09:10   619   125     20   125   20     0    0    78   13   31G   31G   1.2G
09:09:11   623   126     20   126   20     0    0    77   13   31G   31G   1.2G
09:09:12   625   123     19   123   19     0    0    76   13   31G   31G   1.2G
09:09:13   929   120     12   120   12     0    0    75    8   31G   31G   1.2G
09:09:14   672   128     19   128   19     0    0    76   12   31G   31G   1.2G
09:09:15   636   127     19   127   19     0    0    77   13   31G   31G   1.2G
09:09:16   610   125     20   125   20     0    0    76   13   31G   31G   1.2G
09:09:17   634   127     20   127   20     0    0    77   13   31G   31G   1.2G
free -m
               total        used        free      shared  buff/cache   available
Mem:           64246       58410        2679         882        3156        4303
Swap:           8191        2739        5452

And after

echo 100 > /sys/module/zfs/parameters/zfs_arc_meta_limit_percent
grep "arc_\|memory\|overflow\|evict\|dnode" /proc/spl/kstat/zfs/arcstats
evict_skip                      4    214132970
evict_not_enough                4    23935555
evict_l2_cached                 4    0
evict_l2_eligible               4    50574349312
evict_l2_eligible_mfu           4    19446336512
evict_l2_eligible_mru           4    31128012800
evict_l2_ineligible             4    9232410624
evict_l2_skip                   4    0
dnode_size                      4    3170140352
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_evictable_data              4    0
mru_evictable_metadata          4    32768
mru_ghost_evictable_data        4    1883648
mru_ghost_evictable_metadata    4    6279168
mfu_evictable_data              4    0
mfu_evictable_metadata          4    458752
mfu_ghost_evictable_data        4    472064
mfu_ghost_evictable_metadata    4    712704
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
memory_throttle_count           4    0
memory_direct_count             4    0
memory_indirect_count           4    2010
memory_all_bytes                4    67366936576
memory_free_bytes               4    3435409408
memory_available_bytes          3    1081053440
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    49696362
arc_meta_used                   4    33678983392
arc_meta_limit                  4    33683468288
arc_dnode_limit                 4    3368346828
arc_meta_max                    4    35483426448
arc_meta_min                    4    16777216
arc_need_free                   4    0
arc_sys_free                    4    2354355968
arc_raw_size                    4    8244556800
arcstat 1
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size     c  avail
09:10:39   386    85     22    85   22     0    0    51   14   31G   31G   1.0G
09:10:40   22K  4.8K     21  4.8K   21     1  100  2.8K   14   31G   31G  1003M
09:10:41   773   135     17   135   17     0    0    82   11   31G   31G  1001M
09:10:42   699   138     19   138   19     0    0    81   12   31G   31G  1000M
09:10:43   782   137     17   137   17     0    0    82   11   31G   31G   999M
free -m
               total        used        free      shared  buff/cache   available
Mem:           64246       58618        2469         882        3157        4095
Swap:           8191        2739        5452
luismcv commented 1 year ago

I don't know what has changed recently, but since a week ago (or maybe two) this has been happening to me now even with Borg (I haven't used Kopia since I tried it when I opened this issue). And I'm not 100% sure about this, but I think that compiling a big Java/Maven project the other day triggered it too. And quite frequently, once every 2 or 3 days.

ZFS version is still 2.1.4-0ubuntu0.1. But the kernel is now Ubuntu's 5.15.0-53-generic.

$ grep "arc_\|memory\|dnode" /proc/spl/kstat/zfs/arcstats
dnode_size                      4    761390752
memory_throttle_count           4    0
memory_direct_count             4    0
memory_indirect_count           4    0
memory_all_bytes                4    25150902272
memory_free_bytes               4    8646987776
memory_available_bytes          3    7708804480
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    22482562
arc_meta_used                   4    4007114624
arc_meta_limit                  4    1610612736
arc_dnode_limit                 4    161061273
arc_meta_max                    4    4112501104
arc_meta_min                    4    16777216
arc_need_free                   4    66560
arc_sys_free                    4    938183296
arc_raw_size                    4    871041536

$ arcstat 1
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size     c  avail
23:01:14     0     0      0     0    0     0    0     0    0  3.6G  2.0G   7.4G
23:01:15   895   110     12   110   12     0    0     0    0  3.6G  2.0G   7.4G
23:01:16    64     3      4     3    4     0    0     3    4  3.6G  2.0G   7.4G
23:01:17   195     5      2     5    2     0    0     5    2  3.6G  2.0G   7.4G
23:01:18   281     5      1     5    1     0    0     4    2  3.6G  2.0G   7.4G
23:01:19    17     1      5     1    5     0    0     1    8  3.6G  2.0G   7.4G
23:01:20   339     6      1     6    1     0    0     0    0  3.6G  2.0G   7.4G
23:01:21    22     3     13     3   13     0    0     2   28  3.6G  2.0G   7.4G
23:01:22   653     7      1     7    1     0    0     5    1  3.6G  2.0G   7.4G
23:01:23   159     2      1     2    1     0    0     2    1  3.6G  2.0G   7.4G
23:01:24   341     3      0     3    0     0    0     2    0  3.6G  2.0G   7.4G
23:01:25   294     9      3     9    3     0    0     2    1  3.6G  2.0G   7.4G
23:01:26   320     1      0     1    0     0    0     1    0  3.6G  2.0G   7.4G
23:01:27   603    13      2    13    2     0    0    13    2  3.6G  2.0G   7.4G
23:01:28   410     2      0     2    0     0    0     2    0  3.6G  2.0G   7.4G
23:01:29    12     0      0     0    0     0    0     0    0  3.6G  2.0G   7.4G
23:01:30   282     2      0     2    0     0    0     1    0  3.6G  2.0G   7.4G
23:01:31    15     0      0     0    0     0    0     0    0  3.6G  2.0G   7.4G
23:01:32  1.2K    34      2    34    2     0    0     1    0  3.6G  2.0G   7.4G
23:01:33   154     0      0     0    0     0    0     0    0  3.6G  2.0G   7.4G
23:01:34    19     0      0     0    0     0    0     0    0  3.6G  2.0G   7.4G
23:01:35    57     3      5     3    5     0    0     0    0  3.6G  2.0G   7.4G
23:01:36    23     1      4     1    4     0    0     1    4  3.6G  2.0G   7.4G
23:01:37   270     0      0     0    0     0    0     0    0  3.6G  2.0G   7.4G
23:01:38   227     4      1     4    1     0    0     3    1  3.6G  2.0G   7.4G
23:01:39   221    28     12    28   12     0    0    23   17  3.6G  2.0G   7.4G
23:01:40   984    13      1    11    1     2  100    13    1  3.6G  2.0G   7.4G
23:01:41    13     0      0     0    0     0    0     0    0  3.6G  2.0G   7.4G
23:01:42   271     0      0     0    0     0    0     0    0  3.6G  2.0G   7.4G
23:01:43   201     4      1     4    1     0    0     4    2  3.6G  2.0G   7.4G
23:01:44   271     1      0     1    0     0    0     0    0  3.6G  2.0G   7.4G
23:01:45    14     1      7     1    7     0    0     1    9  3.6G  2.0G   7.4G
23:01:46   694    57      8    57    8     0    0    50    9  3.6G  2.0G   7.4G
23:01:47  5.0K     1      0     1    0     0    0     0    0  3.6G  2.0G   7.4G
23:01:48   435     0      0     0    0     0    0     0    0  3.6G  2.0G   7.4G
23:01:49   630    21      3    21    3     0    0     0    0  3.6G  2.0G   7.4G
23:01:50   260     0      0     0    0     0    0     0    0  3.6G  2.0G   7.4G
23:01:51   100     5      5     5    5     0    0     0    0  3.6G  2.0G   7.4G
23:01:52   510     0      0     0    0     0    0     0    0  3.6G  2.0G   7.4G
23:01:53   156     1      0     1    0     0    0     0    0  3.6G  2.0G   7.4G
23:01:54    68     1      1     1    1     0    0     1    2  3.6G  2.0G   7.4G
23:01:55   320     5      1     5    1     0    0     2    1  3.6G  2.0G   7.4G
23:01:56    79     6      7     6    7     0    0     4    6  3.6G  2.0G   7.4G
23:01:57    18     1      5     1    5     0    0     1    5  3.6G  2.0G   7.4G
23:01:58   841    10      1    10    1     0    0    10    1  3.6G  2.0G   7.4G
23:01:59   220     2      0     2    0     0    0     2    0  3.6G  2.0G   7.4G
23:02:00     8     2     25     2   25     0    0     1   14  3.6G  2.0G   7.4G
23:02:01    51     3      5     3    5     0    0     3    6  3.6G  2.0G   7.4G
23:02:02   271     0      0     0    0     0    0     0    0  3.6G  2.0G   7.4G
23:02:03     9     0      0     0    0     0    0     0    0  3.6G  2.0G   7.4G
23:02:04   578    10      1    10    1     0    0     0    0  3.6G  2.0G   7.4G
23:02:05     9     0      0     0    0     0    0     0    0  3.6G  2.0G   7.4G
23:02:06   244     3      1     3    1     0    0     3    1  3.6G  2.0G   7.4G
23:02:07    34     3      8     3    8     0    0     3   11  3.6G  2.0G   7.4G
23:02:08   305     1      0     1    0     0    0     1    0  3.6G  2.0G   7.4G
23:02:09   189     5      2     5    2     0    0     4    2  3.6G  2.0G   7.4G
23:02:10   690     5      0     4    0     1  100     4    0  3.6G  2.0G   7.4G
23:02:11    29     2      6     2    6     0    0     0    0  3.6G  2.0G   7.4G
23:02:12   263     0      0     0    0     0    0     0    0  3.6G  2.0G   7.4G
23:02:13    24     1      4     1    4     0    0     1    5  3.6G  2.0G   7.4G
23:02:14   443     0      0     0    0     0    0     0    0  3.6G  2.0G   7.4G
23:02:15    43     0      0     0    0     0    0     0    0  3.6G  2.0G   7.4G

$ free -m
               total        used        free      shared  buff/cache   available
Mem:           23985       14190        7370         371        2424        9035
Swap:          12285           0       12285

$ ps aux | grep arc_
root         860 19.3  0.0      0     0 ?        S    12:58 119:17 [arc_prune]
root         861 14.1  0.0      0     0 ?        S    12:58  86:49 [arc_evict]
root         862  0.0  0.0      0     0 ?        SN   12:58   0:00 [arc_reap]
root         866  0.0  0.0      0     0 ?        S    12:58   0:02 [l2arc_feed]

The arc meta cache keeps going bonkers, this time using more than twice the limit and even after several hours arc_prune and arc_evict haven't managed to bring it down. As usual, dropping the caches solves the problem temporarily.

shodanshok commented 1 year ago

Yep, in both cases arc_meta_used is above arc_meta_limit and even if arc_prune is furiously called, no memory is freed. Are you both walking/reading/writing a lot of files when the issue happens? Or did some application suddenly increases its memory requirements?

@behlendorf any idea on why arc_prune is unsuccessful in releasing allocated memory, while dropping caches immediately works?

coot commented 1 year ago

It usually happens for when I start working. Sometimes it's just after I login, sometimes after some time when doing something, might be when compiling a large project, but I am not that sure if this is correlated.

luismcv commented 1 year ago

In my case, it used to happen only (but always) when running a Kopia backup job. I decided to keep using Borg and kind of forgot about the issue.

But since a couple of weeks ago or so it's been happening while & after running Borg Backup. And I'm not sure, but I think that a big Java/Maven project compilation caused it the other day too.

At least in my case, things seem to indicate that it's high disk activity what triggers the problem.

luismcv commented 1 year ago

If forgot to mention that with Borg it doesn't happen every time. Right now I have an uptime of almost a day and a half (with several cycles of RAM suspension). Borg has run 4 times during this time and everything is still fine:

arc_meta_used                   4    1544783680
arc_meta_limit                  4    1610612736

About 15 minutes ago the usage was 160....... So it's working fine, getting close to the limit and being reduced again.

And a couple of days ago it happened after just 5 o 6 hours of uptime and I think it was the first time that Borg had run since booting up.

So it doesn't seem to be related so much to the accumulated amount of disk reads but to a particularly high disk activity in a concrete point in time.

luismcv commented 1 year ago

My impression is that everything works as expected as long as 'arc_meta_used' is kept below the limit. But if for any reason the usage goes above the limit (is the limit a hard or soft limit?) the cache enters in some weird state from which it doesn't recover until it's reset by a flush.

shodanshok commented 1 year ago

At least in my case, things seem to indicate that it's high disk activity what triggers the problem.

I think it is more an issue related to memory usage, both due to applications and vfs cache.

My impression is that everything works as expected as long as 'arc_meta_used' is kept below the limit

Yes, but when memory requirements become higher, arc_meta_used must be reduced, and this seems the key issue: sometime arc_prune seems not capable of freeing pinned buffers, spinning furiously (with high CPU load) without freeing anything.

faceless2 commented 1 year ago

Looks like reducing zfs_arc_min to a value lower than zfs_arc_max made no difference:

# grep "arc_\|memory\|overflow\|evict\|dnode"  /proc/spl/kstat/zfs/arcstats
evict_skip                      4    81026320001
evict_not_enough                4    20937787774
evict_l2_cached                 4    0
evict_l2_eligible               4    8629143595008
evict_l2_eligible_mfu           4    4180798813696
evict_l2_eligible_mru           4    4448344781312
evict_l2_ineligible             4    3739255898112
evict_l2_skip                   4    0
dnode_size                      4    5741232736
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_evictable_data              4    0
mru_evictable_metadata          4    0
mru_ghost_evictable_data        4    0
mru_ghost_evictable_metadata    4    0
mfu_evictable_data              4    0
mfu_evictable_metadata          4    0
mfu_ghost_evictable_data        4    0
mfu_ghost_evictable_metadata    4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_evict_l1cached               4    0
memory_throttle_count           4    0
memory_direct_count             4    16
memory_indirect_count           4    59
memory_all_bytes                4    135014055936
memory_free_bytes               4    87351349248
memory_available_bytes          3    82779432960
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    633177776
arc_meta_used                   4    19216927424
arc_meta_limit                  4    17179869184
arc_dnode_limit                 4    12884901888
arc_meta_max                    4    24104624640
arc_meta_min                    4    16777216
arc_need_free                   4    0
arc_sys_free                    4    4571916288
arc_raw_size                    4    0

# cat /sys/module/zfs/parameters/zfs_arc_meta_limit_percent
100
# cat /sys/module/zfs/parameters/zfs_arc_max               
17179869184
# cat /sys/module/zfs/parameters/zfs_arc_min
8589934592

# arcstat 1 
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  size     c  avail
16:45:06    16     0      0     0    0     0    0     0    0   18G   16G    77G
16:45:07    16     0      0     0    0     0    0     0    0   18G   16G    77G
16:45:08     3     0      0     0    0     0    0     0    0   18G   16G    77G
16:45:09     0     0      0     0    0     0    0     0    0   18G   16G    77G
^C

edit: I seem to have caught it in the "simmering" stage rather than finding it after it's boiled over - arc_prune is not pegged at 100%, it's going up and down but never dropping to zero even on an essentially idle system. So these stats reflect that.

unkownerror1 commented 1 year ago

zfs version 2.1.5, kernel version 5.10.0-60.18.0.50.oe2203.x86_64

When using rsync to back up a large amount of data, the cache of arc cannot be released,and arc_prune at 100%,lasts for 50 minutes

[root@nas ~]# ps -aux | grep D USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 971 0.0 0.0 0 0 ? D 16:45 0:00 [arc_prune] root 992 0.0 0.0 0 0 ? DN 16:45 0:00 [dbuf_evict] root 1048 0.0 0.0 55248 4216 ? Ssl 16:45 0:00 /usr/sbin/gssproxy -D root 2879 0.0 0.0 13936 7504 ? Ss 16:46 0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 1000-1000 startups root 4422 0.4 0.0 0 0 ? D< 16:46 0:03 [z_rd_int_0] root 4423 0.4 0.0 0 0 ? D< 16:46 0:03 [z_rd_int_1] root 4754 2.4 2.9 5778352 467576 ? Sl 16:46 0:23 /usr/local/jdk/java-jdk/bin/java -Dnop -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Xms512m -Xmx1024m -XX:PermSize=128m -XX:MaxPermSize=256m -Djdk.tls.ephemeralDHKeySize=2048 -Djava.protocol.handler.pkgs=org.apache.catalina.webresources -Dfile.encoding=UTF-8 -Djava.endorsed.dirs= -classpath /usr/local/nas/apache-tomcat/bin/bootstrap.jar:/usr/local/nas/apache-tomcat/bin/tomcat-juli.jar -Dcatalina.base=/usr/local/nas/apache-tomcat -Dcatalina.home=/usr/local/nas/apache-tomcat -Djava.io.tmpdir=/usr/local/nas/apache-tomcat/temp org.apache.catalina.startup.Bootstrap start root 4861 0.0 0.0 0 0 ? D 16:46 0:00 [txg_sync] root 6150 0.0 0.0 0 0 ? D 16:46 0:00 [nfsd] root 6151 0.0 0.0 0 0 ? D 16:46 0:00 [nfsd] root 6152 0.1 0.0 0 0 ? D 16:46 0:01 [nfsd] root 6155 0.0 0.0 0 0 ? D 16:46 0:00 [nfsd] root 6158 0.0 0.0 0 0 ? D 16:46 0:00 [nfsd] root 6159 0.0 0.0 0 0 ? D 16:46 0:00 [nfsd] root 6160 0.0 0.0 0 0 ? D 16:46 0:00 [nfsd] root 6161 0.0 0.0 0 0 ? D 16:46 0:00 [nfsd] root 38275 0.3 0.0 0 0 ? D< 16:52 0:02 [z_rd_int_1] root 38702 0.0 0.0 0 0 ? D< 16:52 0:00 [z_rd_int_0] root 38703 0.0 0.0 0 0 ? D< 16:52 0:00 [z_rd_int_0] root 45449 0.0 0.0 15488 6212 ? D 16:55 0:00 zfs list -r -o name,mountpoint,quota,used,avail root 47045 0.0 0.0 12560 5532 ? D 16:56 0:00 zpool list -H pool root 54944 0.0 0.0 12560 5592 ? D 16:59 0:00 zpool list -H pool root 61823 0.0 0.0 12528 5640 ? D 17:01 0:00 zfs list -Ho name,mounted root 64103 0.0 0.0 6420 2300 pts/4 S+ 17:02 0:00 grep --color=auto D

[root@nas ~]# arcstat 1 time read miss miss% dmis dm% pmis pm% mmis mm% size c avail 17:02:18 0 0 0 0 0 0 0 0 0 8.1G 7.7G 3.7G 17:02:19 3 0 0 0 0 0 0 0 0 8.1G 7.7G 3.7G 17:02:20 0 0 0 0 0 0 0 0 0 8.1G 7.7G 3.7G 17:02:21 0 0 0 0 0 0 0 0 0 8.1G 7.7G 3.7G 17:02:22 0 0 0 0 0 0 0 0 0 8.1G 7.7G 3.7G 17:02:23 0 0 0 0 0 0 0 0 0 8.1G 7.7G 3.7G 17:02:24 3 0 0 0 0 0 0 0 0 8.1G 7.7G 3.7G

[root@nas ~]# free -h total used free shared buff/cache available Mem: 15Gi 10Gi 4.1Gi 57Mi 507Mi 4.1Gi Swap: 0B 0B 0B

unkownerror1 commented 1 year ago

sometimes avail is negative

image

faceless2 commented 1 year ago

Commenting to note that a) I am still seeing this, but as I now have a cronjob that drops caches every hour I'm dealing with it, and b) I missed this question above

Are you both walking/reading/writing a lot of files when the issue happens? Or did some application suddenly increases its memory requirements?

Oh yes. The process that triggers this creates about half a million files in 10 minutes while reading about 2 million. I should also note that we used to do this on more resource limited machine with 32 cores without problems - the same job took 45 minutes, and it was fine. It's when we upgraded to a much faster machine with more cores and NVMe disks that this started to happen. A case of "things happening faster than we can deal with them"?

shodanshok commented 1 year ago

sometimes avail is negative

If possible, try updating to zfs 2.1.7 or 2.1.9 (.8 has a serious issue, so skip it) as a fix for this very same issue was merged.

IZSkiSurfer commented 1 year ago

I set up a spank new server with arc_max as 8GB and arc_min 8GB - 1 and I also run repeatedly in this situation even with 2.1.9 so there is still something wired happening because the arc size goes slightly above the set max value.

amotin commented 1 year ago

I set up a spank new server with arc_max as 8GB and arc_min 8GB - 1 and I also run repeatedly in this situation even with 2.1.9 so there is still something wired happening because the arc size goes slightly above the set max value.

If you look on zfs_arc_overflow_shift, ARC is allowed to go slightly above the limit for a short time. And please, if you set some tunables to some weird values, don't complain if you see ARC behaves "weird".

ARC eviction code was completely rewritten in upcoming ZFS 2.2.

shodanshok commented 1 year ago

ARC eviction code was completely rewritten in upcoming ZFS 2.2.

@amotin interesting, any information to share (or to point at) regarding this change?

amotin commented 1 year ago

@shodanshok https://github.com/openzfs/zfs/pull/14359

olidal commented 1 year ago

@shodanshok

Hi all,

Not sure if this is helping, but I had the same issue on older kernel 5.4.203 (proxmox) with zfs-2.0.7.

It consistently happened after I received crypted dataset, loaded keys and then mounted the datasets. This triggered in turn several z_upgrade processes that started heavy I/Os. Which in turn started the arc_prune/arc_evict running full time.

After reading this thread, I tried the drop cache trick and it worked (took some time), ie. z_upgrade was still working with heavy I/Os but arc_prune/arc_evict are gone.

Hope this helps.

drescherjm commented 1 year ago

This problem seems to still happen on zfs-2.1.13. I was waiting for 2.2.0 to be available in the RHEL dkms non testing repository on Rocky8 but I may have to try the testing version because I can't work when the storage is this slow.

coot commented 1 year ago

For me this problem was related to misconfiguration of locate (the updatedb). After fixing it (uninstalling a duplicated dependency) things started to behave well.

ipaqmaster commented 1 year ago

Linux 6.5.9 with ZFS 2.2.0

Saw arc_prune and arc_evict showing up in my hud for a cpu usage tonight and remembered echoing 0 into /sys/module/zfs/parameters/zfs_arc_max earlier today after changing the size a few times. I was having issues with audio 'popping' with latency for an audio task which required real-time behaviour. I noticed 6.8/32GB of memory used by my system with an uptime of about 12 hours which made me wonder if ARC was working at all (Given ARC usually keeps my used memory for any given day closer to 25GB+ used)

I echoed 16gb (16*1024*1024*1024) into /sys/module/zfs/parameters/zfs_arc_max and 0 again and the eviction process stopped and so did the stuttering I experienced.