openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.45k stars 1.73k forks source link

ARC grows well past the zfs_arc_max #676

Closed DeHackEd closed 11 years ago

DeHackEd commented 12 years ago

While running a lot of rsync instances (~45 at once... it makes sense in context) I found my ARC expanding beyond the set limit of 2 gigabytes until ZFS deadlocked - almost all rsync processes hung in the D state and some kernel threads as well. A reboot was necessary. It only took about 5-10 minutes of this kind of pressure to break.

# cat /proc/spl/kstat/zfs/arcstats 
4 1 0x01 77 3696 55271634900 965748228174
name                            type data
hits                            4    5620340
misses                          4    693436
demand_data_hits                4    212193
demand_data_misses              4    6069
demand_metadata_hits            4    3873689
demand_metadata_misses          4    320630
prefetch_data_hits              4    29368
prefetch_data_misses            4    77018
prefetch_metadata_hits          4    1505090
prefetch_metadata_misses        4    289719
mru_hits                        4    872626
mru_ghost_hits                  4    204427
mfu_hits                        4    3213295
mfu_ghost_hits                  4    70138
deleted                         4    221845
recycle_miss                    4    373068
mutex_miss                      4    120515
evict_skip                      4    279067774
evict_l2_cached                 4    0
evict_l2_eligible               4    11435467776
evict_l2_ineligible             4    5027812352
hash_elements                   4    201635
hash_elements_max               4    359811
hash_collisions                 4    211399
hash_chains                     4    47203
hash_chain_max                  4    8
p                               4    97061458
c                               4    753210703
c_min                           4    293601280
c_max                           4    2147483648
size                            4    3514897752
hdr_size                        4    96012384
data_size                       4    2391985664
other_size                      4    1026899704
anon_size                       4    4358144
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    267850240
mru_evict_data                  4    2872320
mru_evict_metadata              4    10894848
mru_ghost_size                  4    465448960
mru_ghost_evict_data            4    8077824
mru_ghost_evict_metadata        4    457371136
mfu_size                        4    2119777280
mfu_evict_data                  4    0
mfu_evict_metadata              4    5223424
mfu_ghost_size                  4    272368640
mfu_ghost_evict_data            4    12076032
mfu_ghost_evict_metadata        4    260292608
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_hdr_miss              4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_free_on_write                4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_hdr_size                     4    0
memory_throttle_count           4    0
memory_direct_count             4    0
memory_indirect_count           4    0
arc_no_grow                     4    1
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    1746
arc_meta_used                   4    3512025432
arc_meta_limit                  4    536870912
arc_meta_max                    4    3518600752

Not too long after all ZFS-involved processes froze up.

Stack dumps for some hung processes:

SysRq : Show Blocked State
 task                        PC stack   pid father
kswapd0         D ffff8802235d7898     0   407      2 0x00000000
ffff8802235d75f0 0000000000000046 ffffffff8115539c ffff880221d2db20
ffff880216d7d0e0 000000000000ffc0 ffff880221d2dfd8 0000000000004000
ffff880221d2dfd8 000000000000ffc0 ffff8802235d75f0 ffff880221d2c010
Call Trace:
[] ? cpumask_any_but+0x28/0x37
[] ? __pagevec_release+0x19/0x22
[] ? move_active_pages_to_lru+0x130/0x154
[] ? __mutex_lock_slowpath+0xe2/0x128
[] ? arc_buf_remove_ref+0xe6/0xf4 [zfs]
[] ? mutex_lock+0x12/0x25
[] ? zfs_zinactive+0x5a/0xd4 [zfs]
[] ? zfs_inactive+0x106/0x19e [zfs]
[] ? evict+0x78/0x117
[] ? dispose_list+0x2c/0x36
[] ? shrink_icache_memory+0x278/0x2a8
[] ? shrink_slab+0xe3/0x153
[] ? balance_pgdat+0x33c/0x650
[] ? calculate_pressure_threshold+0x18/0x3b
[] ? kswapd+0x252/0x26b
[] ? wake_up_bit+0x23/0x23
[] ? balance_pgdat+0x650/0x650
[] ? balance_pgdat+0x650/0x650
[] ? kthread+0x7a/0x82
[] ? kernel_thread_helper+0x4/0x10
[] ? kthread_worker_fn+0x139/0x139
[] ? gs_change+0xb/0xb
snmpd           D ffff88022369a368     0  2158      1 0x00000000
ffff88022369a0c0 0000000000000086 ffffffff8115539c 0000000000000000
ffffffff81435020 000000000000ffc0 ffff88022150dfd8 0000000000004000
ffff88022150dfd8 000000000000ffc0 ffff88022369a0c0 ffff88022150c010
Call Trace:
[] ? cpumask_any_but+0x28/0x37
[] ? __pagevec_release+0x19/0x22
[] ? move_active_pages_to_lru+0x130/0x154
[] ? __mutex_lock_slowpath+0xe2/0x128
[] ? shrink_active_list+0x2e1/0x2fa
[] ? mutex_lock+0x12/0x25
[] ? zfs_zinactive+0x5a/0xd4 [zfs]
[] ? zfs_inactive+0x106/0x19e [zfs]
[] ? evict+0x78/0x117
[] ? dispose_list+0x2c/0x36
[] ? shrink_icache_memory+0x278/0x2a8
[] ? shrink_slab+0xe3/0x153
[] ? do_try_to_free_pages+0x253/0x3f0
[] ? try_to_free_pages+0x79/0x7e
[] ? __alloc_pages_nodemask+0x48b/0x6c8
[] ? __pollwait+0xd6/0xd6
[] ? __pollwait+0xd6/0xd6
[] ? __pollwait+0xd6/0xd6
[] ? handle_pte_fault+0x17f/0x7d9
[] ? bit_waitqueue+0x14/0x92
[] ? handle_mm_fault+0x3e/0x2cd
[] ? do_page_fault+0x31a/0x33f
[] ? seq_printf+0x56/0x7b
[] ? kobject_get+0x12/0x17
[] ? disk_part_iter_next+0x19/0xba
[] ? diskstats_show+0x3b5/0x3c9
[] ? page_fault+0x1f/0x30
[] ? copy_user_generic_string+0x2d/0x40
[] ? seq_read+0x2c2/0x339
[] ? proc_reg_read+0x6f/0x88
[] ? vfs_read+0xaa/0x14d
[] ? sys_read+0x45/0x6e
[] ? system_call_fastpath+0x16/0x1b
arc_reclaim     D ffff88022189ec78     0  2446      2 0x00000000
ffff88022189e9d0 0000000000000046 0000000000000202 0000000100000000
ffff88022e8a6750 000000000000ffc0 ffff880218767fd8 0000000000004000
ffff880218767fd8 000000000000ffc0 ffff88022189e9d0 ffff880218766010
Call Trace:
[] ? vsnprintf+0x7e/0x428
[] ? getnstimeofday+0x54/0xa5
[] ? call_function_single_interrupt+0xe/0x20
[] ? apic_timer_interrupt+0xe/0x20
[] ? call_function_single_interrupt+0xe/0x20
[] ? apic_timer_interrupt+0xe/0x20
[] ? __mutex_lock_slowpath+0xe2/0x128
[] ? arc_buf_remove_ref+0xe6/0xf4 [zfs]
[] ? mutex_lock+0x12/0x25
[] ? zfs_zinactive+0x5a/0xd4 [zfs]
[] ? zfs_inactive+0x106/0x19e [zfs]
[] ? evict+0x78/0x117
[] ? dispose_list+0x2c/0x36
[] ? shrink_icache_memory+0x278/0x2a8
[] ? zpl_inode_alloc+0x6e/0x6e [zfs]
[] ? zpl_prune_sbs+0x53/0x5e [zfs]
[] ? arc_adjust_meta+0x137/0x1a4 [zfs]
[] ? __thread_create+0x2df/0x2df [spl]
[] ? arc_reclaim_thread+0xb0/0x11c [zfs]
[] ? arc_adjust_meta+0x1a4/0x1a4 [zfs]
[] ? arc_adjust_meta+0x1a4/0x1a4 [zfs]
[] ? thread_generic_wrapper+0x6a/0x75 [spl]
[] ? kthread+0x7a/0x82
[] ? kernel_thread_helper+0x4/0x10
[] ? kthread_worker_fn+0x139/0x139
[] ? gs_change+0xb/0xb
watch           D ffff88021801e978     0  3060   3027 0x00000004
ffff88021801e6d0 0000000000000082 ffffffff8115539c ffff8802078f5a20
ffff88021807aea0 000000000000ffc0 ffff8802078f5fd8 0000000000004000
ffff8802078f5fd8 000000000000ffc0 ffff88021801e6d0 ffff8802078f4010
Call Trace:
[] ? cpumask_any_but+0x28/0x37
[] ? __pagevec_release+0x19/0x22
[] ? move_active_pages_to_lru+0x130/0x154
[] ? apic_timer_interrupt+0xe/0x20
[] ? __mutex_lock_slowpath+0xe2/0x128
[] ? arc_buf_remove_ref+0xe6/0xf4 [zfs]
[] ? mutex_lock+0x12/0x25
[] ? zfs_zinactive+0x5a/0xd4 [zfs]
[] ? zfs_inactive+0x106/0x19e [zfs]
[] ? evict+0x78/0x117
[] ? dispose_list+0x2c/0x36
[] ? shrink_icache_memory+0x278/0x2a8
[] ? shrink_slab+0xe3/0x153
[] ? d_instantiate+0x39/0x48
[] ? do_try_to_free_pages+0x253/0x3f0
[] ? try_to_free_pages+0x79/0x7e
[] ? __alloc_pages_nodemask+0x48b/0x6c8
[] ? copy_process+0xd7/0xff3
[] ? get_empty_filp+0x93/0x135
[] ? do_fork+0xf0/0x232
[] ? fd_install+0x27/0x4d
[] ? stub_clone+0x13/0x20
[] ? system_call_fastpath+0x16/0x1b
rsync           D ffff880221873958     0  3265   3264 0x00000000
ffff8802218736b0 0000000000000082 ffffffff8115539c ffff88011f229408
ffff880216d3a240 000000000000ffc0 ffff88011f229fd8 0000000000004000
ffff88011f229fd8 000000000000ffc0 ffff8802218736b0 ffff88011f228010
Call Trace:
[] ? cpumask_any_but+0x28/0x37
[] ? __pagevec_release+0x19/0x22
[] ? move_active_pages_to_lru+0x130/0x154
[] ? __mutex_lock_slowpath+0xe2/0x128
[] ? fsnotify_clear_marks_by_inode+0x20/0xbd
[] ? mutex_lock+0x12/0x25
[] ? zfs_zinactive+0x5a/0xd4 [zfs]
[] ? zfs_inactive+0x106/0x19e [zfs]
[] ? evict+0x78/0x117
[] ? dispose_list+0x2c/0x36
[] ? shrink_icache_memory+0x278/0x2a8
[] ? shrink_slab+0xe3/0x153
[] ? do_try_to_free_pages+0x253/0x3f0
[] ? try_to_free_pages+0x79/0x7e
[] ? __alloc_pages_nodemask+0x48b/0x6c8
[] ? __get_free_pages+0x12/0x52
[] ? spl_kmem_cache_alloc+0x236/0x975 [spl]
[] ? cache_alloc_refill+0x86/0x48f
[] ? dnode_create+0x2e/0x144 [zfs]
[] ? dnode_hold_impl+0x2ea/0x43c [zfs]
[] ? dmu_bonus_hold+0x22/0x26e [zfs]
[] ? zfs_zget+0x5c/0x19f [zfs]
[] ? zfs_dirent_lock+0x447/0x48f [zfs]
[] ? zfs_zaccess_aces_check+0x1d5/0x203 [zfs]
[] ? zfs_dirlook+0x20a/0x276 [zfs]
[] ? zfs_lookup+0x26e/0x2b6 [zfs]
[] ? zpl_lookup+0x47/0x80 [zfs]
[] ? d_alloc_and_lookup+0x43/0x60
[] ? do_lookup+0x1c9/0x2bb
[] ? path_lookupat+0xe2/0x5af
[] ? do_path_lookup+0x1d/0x5f
[] ? user_path_at_empty+0x49/0x84
[] ? tsd_exit+0x83/0x18d [spl]
[] ? __schedule+0x727/0x7b0
[] ? cp_new_stat+0xdf/0xf1
[] ? vfs_fstatat+0x43/0x70
[] ? sys_newlstat+0x11/0x2d
[] ? system_call_fastpath+0x16/0x1b
rsync           D ffff880208aa3088     0  3270   3269 0x00000000
ffff880208aa2de0 0000000000000082 000000000000000a ffff880100000000
ffff88022e8d6790 000000000000ffc0 ffff88012f173fd8 0000000000004000
ffff88012f173fd8 000000000000ffc0 ffff880208aa2de0 ffff88012f172010
Call Trace:
[] ? zap_get_leaf_byblk+0x1b5/0x249 [zfs]
[] ? zap_leaf_array_match+0x166/0x197 [zfs]
[] ? remove_reference+0x93/0x9f [zfs]
[] ? arc_buf_remove_ref+0xe6/0xf4 [zfs]
[] ? dbuf_rele_and_unlock+0x12b/0x19a [zfs]
[] ? __mutex_lock_slowpath+0xe2/0x128
[] ? mutex_lock+0x12/0x25
[] ? zfs_zget+0x46/0x19f [zfs]
[] ? zfs_dirent_lock+0x447/0x48f [zfs]
[] ? zfs_zaccess_aces_check+0x1d5/0x203 [zfs]
[] ? zfs_dirlook+0x20a/0x276 [zfs]
[] ? zfs_lookup+0x26e/0x2b6 [zfs]
[] ? zpl_lookup+0x47/0x80 [zfs]
[] ? d_alloc_and_lookup+0x43/0x60
[] ? do_lookup+0x1c9/0x2bb
[] ? path_lookupat+0xe2/0x5af
[] ? do_path_lookup+0x1d/0x5f
[] ? user_path_at_empty+0x49/0x84
[] ? tsd_exit+0x83/0x18d [spl]
[] ? mutex_lock+0x12/0x25
[] ? cp_new_stat+0xdf/0xf1
[] ? vfs_fstatat+0x43/0x70
[] ? sys_newlstat+0x11/0x2d
[] ? system_call_fastpath+0x16/0x1b
rsync           D ffff88010ec74b38     0  3497   3496 0x00000000
ffff88010ec74890 0000000000000086 ffffffff8115539c 0000000000000000
ffff88022e8d6790 000000000000ffc0 ffff8800b2fabfd8 0000000000004000
ffff8800b2fabfd8 000000000000ffc0 ffff88010ec74890 ffff8800b2faa010
Call Trace:
[] ? cpumask_any_but+0x28/0x37
[] ? __pagevec_release+0x19/0x22
[] ? move_active_pages_to_lru+0x130/0x154
[] ? __mutex_lock_slowpath+0xe2/0x128
[] ? shrink_active_list+0x2e1/0x2fa
[] ? mutex_lock+0x12/0x25
[] ? zfs_zinactive+0x5a/0xd4 [zfs]
[] ? zfs_inactive+0x106/0x19e [zfs]
[] ? evict+0x78/0x117
[] ? dispose_list+0x2c/0x36
[] ? shrink_icache_memory+0x278/0x2a8
[] ? shrink_slab+0xe3/0x153
[] ? do_try_to_free_pages+0x253/0x3f0
[] ? try_to_free_pages+0x79/0x7e
[] ? __alloc_pages_nodemask+0x48b/0x6c8
[] ? __get_free_pages+0x12/0x52
[] ? spl_kmem_cache_alloc+0x236/0x975 [spl]
[] ? __wake_up+0x35/0x46
[] ? cv_wait_common+0xf5/0x141 [spl]
[] ? wake_up_bit+0x23/0x23
[] ? dbuf_create+0x38/0x32e [zfs]
[] ? __dbuf_hold_impl+0x39a/0x3c6 [zfs]
[] ? remove_reference+0x93/0x9f [zfs]
[] ? __dbuf_hold_impl+0x271/0x3c6 [zfs]
[] ? __dbuf_hold_impl+0x17d/0x3c6 [zfs]
[] ? dbuf_hold_impl+0x6e/0x97 [zfs]
[] ? dbuf_prefetch+0x105/0x23d [zfs]
[] ? dmu_zfetch_dofetch+0xd7/0x113 [zfs]
[] ? dmu_zfetch+0x4b1/0xc5e [zfs]
[] ? dbuf_read+0xd1/0x5c0 [zfs]
[] ? dnode_hold_impl+0x1a8/0x43c [zfs]
[] ? dmu_buf_hold+0x33/0x161 [zfs]
[] ? zap_lockdir+0x57/0x5cc [zfs]
[] ? dmu_bonus_hold+0x212/0x26e [zfs]
[] ? zap_lookup_norm+0x40/0x160 [zfs]
[] ? kmem_alloc_debug+0x13c/0x2ba [spl]
[] ? zap_lookup+0x2a/0x30 [zfs]
[] ? zfs_dirent_lock+0x3cc/0x48f [zfs]
[] ? zfs_zaccess_aces_check+0x1d5/0x203 [zfs]
[] ? zfs_dirlook+0x20a/0x276 [zfs]
[] ? zfs_lookup+0x26e/0x2b6 [zfs]
[] ? zpl_lookup+0x47/0x80 [zfs]
[] ? d_alloc_and_lookup+0x43/0x60
[] ? do_lookup+0x1c9/0x2bb
[] ? path_lookupat+0xe2/0x5af
[] ? do_path_lookup+0x1d/0x5f
[] ? user_path_at_empty+0x49/0x84
[] ? tsd_exit+0x83/0x18d [spl]
[] ? cp_new_stat+0xdf/0xf1
[] ? vfs_fstatat+0x43/0x70
[] ? sys_newlstat+0x11/0x2d
[] ? system_call_fastpath+0x16/0x1b

Machine specs: Single-socket quad-core Xeon 8 Gigs of RAM

SPL version: b29012b99994ece46019b664d67dace29e5c2586 ZFS version: 409dc1a570a836737b2a5bb43658cdde703c935e Kernel version: 3.0.28 vanilla custom build ZPool version 26 (originally built/run by zfs-fuse)

  pool: tankname1
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
    still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
    pool will no longer be accessible on older software versions.
 scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    tankname1   ONLINE       0     0     0
      sda       ONLINE       0     0     0

errors: No known data errors

I've also tried using the module/zfs/arc.c from https://github.com/zfsonlinux/zfs/pull/669 for testing and reducing the ARC size. RAM usage still exceeds the limits set.

# egrep "^(c_|size)" /proc/spl/kstat/zfs/arcstats
c_min                           4    293601280
c_max                           4    1610612736
size                            4    2602236704

Nevertheless it's been running for a few hours now reliably.

(Edit: I also raised vm.min_free_kbytes from its default up to 262144 as part of a shotgun attempt to make this more stable.)

behlendorf commented 11 years ago

@DeHackEd When you say it crashed do you mean the system paniced or became unstable in some way, or that we just exceeded that meta limit? I believe I see why that's happening now regarding the meta limit but I'd like to better understand the instability your referring too.

behlendorf commented 11 years ago

@DeHackEd Can you please try 6c5207088f732168569d1a0b29f5f949b91bb503 which I believe should address your issue. Basically it was possible under a very heavy meta data workload to move all the meta data just from the MRU/MFU on to the MRU/MFU ghost lists. It would then never be free'd from there if there was data on the ghost list which could be reclaimed instead. I'd be interested to hear if this change improves your performance as well.

DeHackEd commented 11 years ago

While it does seem to help, it hasn't completely eliminated the issue. At best it just delays the inevitable.

arp096's method of using drop_caches does seem to keep it stable though. Not a great workaround but still...

aarrpp commented 11 years ago

I tried the patch, but nothing has changed

cat /proc/spl/kstat/zfs/arcstats 4 1 0x01 77 3696 2403385908311841 2403855656867260 name type data hits 4 38524104 misses 4 368775 demand_data_hits 4 0 demand_data_misses 4 22 demand_metadata_hits 4 37250368 demand_metadata_misses 4 159192 prefetch_data_hits 4 0 prefetch_data_misses 4 0 prefetch_metadata_hits 4 1273736 prefetch_metadata_misses 4 209561 mru_hits 4 1358445 mru_ghost_hits 4 47602 mfu_hits 4 36023113 mfu_ghost_hits 4 19495 deleted 4 259553 recycle_miss 4 602959 mutex_miss 4 62668 evict_skip 4 564914143 evict_l2_cached 4 0 evict_l2_eligible 4 19083264 evict_l2_ineligible 4 4364881408 hash_elements 4 42650 hash_elements_max 4 42650 hash_collisions 4 15009 hash_chains 4 1629 hash_chain_max 4 4 p 4 144727296 c 4 1000000000 c_min 4 838860800 c_max 4 1000000000 size 4 4533766376 hdr_size 4 25180480 data_size 4 1830418432 other_size 4 2678167464 anon_size 4 8264704 anon_evict_data 4 0 anon_evict_metadata 4 0 mru_size 4 966695424 mru_evict_data 4 0 mru_evict_metadata 4 21205504 mru_ghost_size 4 57323008 mru_ghost_evict_data 4 118272 mru_ghost_evict_metadata 4 57204736 mfu_size 4 855458304 mfu_evict_data 4 0 mfu_evict_metadata 4 9123328 mfu_ghost_size 4 1480704 mfu_ghost_evict_data 4 304640 mfu_ghost_evict_metadata 4 1176064 l2_hits 4 0 l2_misses 4 0 l2_feeds 4 0 l2_rw_clash 4 0 l2_read_bytes 4 0 l2_write_bytes 4 0 l2_writes_sent 4 0 l2_writes_done 4 0 l2_writes_error 4 0 l2_writes_hdr_miss 4 0 l2_evict_lock_retry 4 0 l2_evict_reading 4 0 l2_free_on_write 4 0 l2_abort_lowmem 4 0 l2_cksum_bad 4 0 l2_io_error 4 0 l2_size 4 0 l2_hdr_size 4 0 memory_throttle_count 4 0 memory_direct_count 4 0 memory_indirect_count 4 0 arc_no_grow 4 0 arc_tempreserve 4 0 arc_loaned_bytes 4 0 arc_prune 4 79 arc_meta_used 4 4533766376 arc_meta_limit 4 500000000 arc_meta_max 4 4533770472

DeHackEd commented 11 years ago

Yet more output from arcstats, /proc/meminfo and some IO stats, immediately before and after doing a drop_caches after I started hitting the redline of memory usage.

http://pastebin.com/8Wh0zeKG http://pastebin.com/777anKPq

DeHackEd commented 11 years ago

A workaround for issue #1101 seems to fix this issue as well under my typical workload.

aarrpp commented 11 years ago

Yes, this workaround works for me. But I set ZFS_OBJ_MTX_SZ to 10240. It works fine with ~500 concurrent processes. Thank You!

behlendorf commented 11 years ago

I'm closing this long long long issue because a fix was merged for the deadlock, see issue #1101. The remaining memory management concerns I'll open a new more succinct issue to track.

durandalTR commented 11 years ago

@behlendorf: Do you have a reference to that new memory management issue?

behlendorf commented 11 years ago

See #1132, it's largely a place holder.