DeHackEd commented 12 years ago

While running a lot of rsync instances (~45 at once... it makes sense in context) I found my ARC expanding beyond the set limit of 2 gigabytes until ZFS deadlocked - almost all rsync processes hung in the D state and some kernel threads as well. A reboot was necessary. It only took about 5-10 minutes of this kind of pressure to break.

# cat /proc/spl/kstat/zfs/arcstats 
4 1 0x01 77 3696 55271634900 965748228174
name                            type data
hits                            4    5620340
misses                          4    693436
demand_data_hits                4    212193
demand_data_misses              4    6069
demand_metadata_hits            4    3873689
demand_metadata_misses          4    320630
prefetch_data_hits              4    29368
prefetch_data_misses            4    77018
prefetch_metadata_hits          4    1505090
prefetch_metadata_misses        4    289719
mru_hits                        4    872626
mru_ghost_hits                  4    204427
mfu_hits                        4    3213295
mfu_ghost_hits                  4    70138
deleted                         4    221845
recycle_miss                    4    373068
mutex_miss                      4    120515
evict_skip                      4    279067774
evict_l2_cached                 4    0
evict_l2_eligible               4    11435467776
evict_l2_ineligible             4    5027812352
hash_elements                   4    201635
hash_elements_max               4    359811
hash_collisions                 4    211399
hash_chains                     4    47203
hash_chain_max                  4    8
p                               4    97061458
c                               4    753210703
c_min                           4    293601280
c_max                           4    2147483648
size                            4    3514897752
hdr_size                        4    96012384
data_size                       4    2391985664
other_size                      4    1026899704
anon_size                       4    4358144
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    267850240
mru_evict_data                  4    2872320
mru_evict_metadata              4    10894848
mru_ghost_size                  4    465448960
mru_ghost_evict_data            4    8077824
mru_ghost_evict_metadata        4    457371136
mfu_size                        4    2119777280
mfu_evict_data                  4    0
mfu_evict_metadata              4    5223424
mfu_ghost_size                  4    272368640
mfu_ghost_evict_data            4    12076032
mfu_ghost_evict_metadata        4    260292608
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_hdr_miss              4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_free_on_write                4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_hdr_size                     4    0
memory_throttle_count           4    0
memory_direct_count             4    0
memory_indirect_count           4    0
arc_no_grow                     4    1
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    1746
arc_meta_used                   4    3512025432
arc_meta_limit                  4    536870912
arc_meta_max                    4    3518600752

Not too long after all ZFS-involved processes froze up.

Stack dumps for some hung processes:

SysRq : Show Blocked State
 task                        PC stack   pid father
kswapd0         D ffff8802235d7898     0   407      2 0x00000000
ffff8802235d75f0 0000000000000046 ffffffff8115539c ffff880221d2db20
ffff880216d7d0e0 000000000000ffc0 ffff880221d2dfd8 0000000000004000
ffff880221d2dfd8 000000000000ffc0 ffff8802235d75f0 ffff880221d2c010
Call Trace:
[] ? cpumask_any_but+0x28/0x37
[] ? __pagevec_release+0x19/0x22
[] ? move_active_pages_to_lru+0x130/0x154
[] ? __mutex_lock_slowpath+0xe2/0x128
[] ? arc_buf_remove_ref+0xe6/0xf4 [zfs]
[] ? mutex_lock+0x12/0x25
[] ? zfs_zinactive+0x5a/0xd4 [zfs]
[] ? zfs_inactive+0x106/0x19e [zfs]
[] ? evict+0x78/0x117
[] ? dispose_list+0x2c/0x36
[] ? shrink_icache_memory+0x278/0x2a8
[] ? shrink_slab+0xe3/0x153
[] ? balance_pgdat+0x33c/0x650
[] ? calculate_pressure_threshold+0x18/0x3b
[] ? kswapd+0x252/0x26b
[] ? wake_up_bit+0x23/0x23
[] ? balance_pgdat+0x650/0x650
[] ? balance_pgdat+0x650/0x650
[] ? kthread+0x7a/0x82
[] ? kernel_thread_helper+0x4/0x10
[] ? kthread_worker_fn+0x139/0x139
[] ? gs_change+0xb/0xb
snmpd           D ffff88022369a368     0  2158      1 0x00000000
ffff88022369a0c0 0000000000000086 ffffffff8115539c 0000000000000000
ffffffff81435020 000000000000ffc0 ffff88022150dfd8 0000000000004000
ffff88022150dfd8 000000000000ffc0 ffff88022369a0c0 ffff88022150c010
Call Trace:
[] ? cpumask_any_but+0x28/0x37
[] ? __pagevec_release+0x19/0x22
[] ? move_active_pages_to_lru+0x130/0x154
[] ? __mutex_lock_slowpath+0xe2/0x128
[] ? shrink_active_list+0x2e1/0x2fa
[] ? mutex_lock+0x12/0x25
[] ? zfs_zinactive+0x5a/0xd4 [zfs]
[] ? zfs_inactive+0x106/0x19e [zfs]
[] ? evict+0x78/0x117
[] ? dispose_list+0x2c/0x36
[] ? shrink_icache_memory+0x278/0x2a8
[] ? shrink_slab+0xe3/0x153
[] ? do_try_to_free_pages+0x253/0x3f0
[] ? try_to_free_pages+0x79/0x7e
[] ? __alloc_pages_nodemask+0x48b/0x6c8
[] ? __pollwait+0xd6/0xd6
[] ? __pollwait+0xd6/0xd6
[] ? __pollwait+0xd6/0xd6
[] ? handle_pte_fault+0x17f/0x7d9
[] ? bit_waitqueue+0x14/0x92
[] ? handle_mm_fault+0x3e/0x2cd
[] ? do_page_fault+0x31a/0x33f
[] ? seq_printf+0x56/0x7b
[] ? kobject_get+0x12/0x17
[] ? disk_part_iter_next+0x19/0xba
[] ? diskstats_show+0x3b5/0x3c9
[] ? page_fault+0x1f/0x30
[] ? copy_user_generic_string+0x2d/0x40
[] ? seq_read+0x2c2/0x339
[] ? proc_reg_read+0x6f/0x88
[] ? vfs_read+0xaa/0x14d
[] ? sys_read+0x45/0x6e
[] ? system_call_fastpath+0x16/0x1b
arc_reclaim     D ffff88022189ec78     0  2446      2 0x00000000
ffff88022189e9d0 0000000000000046 0000000000000202 0000000100000000
ffff88022e8a6750 000000000000ffc0 ffff880218767fd8 0000000000004000
ffff880218767fd8 000000000000ffc0 ffff88022189e9d0 ffff880218766010
Call Trace:
[] ? vsnprintf+0x7e/0x428
[] ? getnstimeofday+0x54/0xa5
[] ? call_function_single_interrupt+0xe/0x20
[] ? apic_timer_interrupt+0xe/0x20
[] ? call_function_single_interrupt+0xe/0x20
[] ? apic_timer_interrupt+0xe/0x20
[] ? __mutex_lock_slowpath+0xe2/0x128
[] ? arc_buf_remove_ref+0xe6/0xf4 [zfs]
[] ? mutex_lock+0x12/0x25
[] ? zfs_zinactive+0x5a/0xd4 [zfs]
[] ? zfs_inactive+0x106/0x19e [zfs]
[] ? evict+0x78/0x117
[] ? dispose_list+0x2c/0x36
[] ? shrink_icache_memory+0x278/0x2a8
[] ? zpl_inode_alloc+0x6e/0x6e [zfs]
[] ? zpl_prune_sbs+0x53/0x5e [zfs]
[] ? arc_adjust_meta+0x137/0x1a4 [zfs]
[] ? __thread_create+0x2df/0x2df [spl]
[] ? arc_reclaim_thread+0xb0/0x11c [zfs]
[] ? arc_adjust_meta+0x1a4/0x1a4 [zfs]
[] ? arc_adjust_meta+0x1a4/0x1a4 [zfs]
[] ? thread_generic_wrapper+0x6a/0x75 [spl]
[] ? kthread+0x7a/0x82
[] ? kernel_thread_helper+0x4/0x10
[] ? kthread_worker_fn+0x139/0x139
[] ? gs_change+0xb/0xb
watch           D ffff88021801e978     0  3060   3027 0x00000004
ffff88021801e6d0 0000000000000082 ffffffff8115539c ffff8802078f5a20
ffff88021807aea0 000000000000ffc0 ffff8802078f5fd8 0000000000004000
ffff8802078f5fd8 000000000000ffc0 ffff88021801e6d0 ffff8802078f4010
Call Trace:
[] ? cpumask_any_but+0x28/0x37
[] ? __pagevec_release+0x19/0x22
[] ? move_active_pages_to_lru+0x130/0x154
[] ? apic_timer_interrupt+0xe/0x20
[] ? __mutex_lock_slowpath+0xe2/0x128
[] ? arc_buf_remove_ref+0xe6/0xf4 [zfs]
[] ? mutex_lock+0x12/0x25
[] ? zfs_zinactive+0x5a/0xd4 [zfs]
[] ? zfs_inactive+0x106/0x19e [zfs]
[] ? evict+0x78/0x117
[] ? dispose_list+0x2c/0x36
[] ? shrink_icache_memory+0x278/0x2a8
[] ? shrink_slab+0xe3/0x153
[] ? d_instantiate+0x39/0x48
[] ? do_try_to_free_pages+0x253/0x3f0
[] ? try_to_free_pages+0x79/0x7e
[] ? __alloc_pages_nodemask+0x48b/0x6c8
[] ? copy_process+0xd7/0xff3
[] ? get_empty_filp+0x93/0x135
[] ? do_fork+0xf0/0x232
[] ? fd_install+0x27/0x4d
[] ? stub_clone+0x13/0x20
[] ? system_call_fastpath+0x16/0x1b
rsync           D ffff880221873958     0  3265   3264 0x00000000
ffff8802218736b0 0000000000000082 ffffffff8115539c ffff88011f229408
ffff880216d3a240 000000000000ffc0 ffff88011f229fd8 0000000000004000
ffff88011f229fd8 000000000000ffc0 ffff8802218736b0 ffff88011f228010
Call Trace:
[] ? cpumask_any_but+0x28/0x37
[] ? __pagevec_release+0x19/0x22
[] ? move_active_pages_to_lru+0x130/0x154
[] ? __mutex_lock_slowpath+0xe2/0x128
[] ? fsnotify_clear_marks_by_inode+0x20/0xbd
[] ? mutex_lock+0x12/0x25
[] ? zfs_zinactive+0x5a/0xd4 [zfs]
[] ? zfs_inactive+0x106/0x19e [zfs]
[] ? evict+0x78/0x117
[] ? dispose_list+0x2c/0x36
[] ? shrink_icache_memory+0x278/0x2a8
[] ? shrink_slab+0xe3/0x153
[] ? do_try_to_free_pages+0x253/0x3f0
[] ? try_to_free_pages+0x79/0x7e
[] ? __alloc_pages_nodemask+0x48b/0x6c8
[] ? __get_free_pages+0x12/0x52
[] ? spl_kmem_cache_alloc+0x236/0x975 [spl]
[] ? cache_alloc_refill+0x86/0x48f
[] ? dnode_create+0x2e/0x144 [zfs]
[] ? dnode_hold_impl+0x2ea/0x43c [zfs]
[] ? dmu_bonus_hold+0x22/0x26e [zfs]
[] ? zfs_zget+0x5c/0x19f [zfs]
[] ? zfs_dirent_lock+0x447/0x48f [zfs]
[] ? zfs_zaccess_aces_check+0x1d5/0x203 [zfs]
[] ? zfs_dirlook+0x20a/0x276 [zfs]
[] ? zfs_lookup+0x26e/0x2b6 [zfs]
[] ? zpl_lookup+0x47/0x80 [zfs]
[] ? d_alloc_and_lookup+0x43/0x60
[] ? do_lookup+0x1c9/0x2bb
[] ? path_lookupat+0xe2/0x5af
[] ? do_path_lookup+0x1d/0x5f
[] ? user_path_at_empty+0x49/0x84
[] ? tsd_exit+0x83/0x18d [spl]
[] ? __schedule+0x727/0x7b0
[] ? cp_new_stat+0xdf/0xf1
[] ? vfs_fstatat+0x43/0x70
[] ? sys_newlstat+0x11/0x2d
[] ? system_call_fastpath+0x16/0x1b
rsync           D ffff880208aa3088     0  3270   3269 0x00000000
ffff880208aa2de0 0000000000000082 000000000000000a ffff880100000000
ffff88022e8d6790 000000000000ffc0 ffff88012f173fd8 0000000000004000
ffff88012f173fd8 000000000000ffc0 ffff880208aa2de0 ffff88012f172010
Call Trace:
[] ? zap_get_leaf_byblk+0x1b5/0x249 [zfs]
[] ? zap_leaf_array_match+0x166/0x197 [zfs]
[] ? remove_reference+0x93/0x9f [zfs]
[] ? arc_buf_remove_ref+0xe6/0xf4 [zfs]
[] ? dbuf_rele_and_unlock+0x12b/0x19a [zfs]
[] ? __mutex_lock_slowpath+0xe2/0x128
[] ? mutex_lock+0x12/0x25
[] ? zfs_zget+0x46/0x19f [zfs]
[] ? zfs_dirent_lock+0x447/0x48f [zfs]
[] ? zfs_zaccess_aces_check+0x1d5/0x203 [zfs]
[] ? zfs_dirlook+0x20a/0x276 [zfs]
[] ? zfs_lookup+0x26e/0x2b6 [zfs]
[] ? zpl_lookup+0x47/0x80 [zfs]
[] ? d_alloc_and_lookup+0x43/0x60
[] ? do_lookup+0x1c9/0x2bb
[] ? path_lookupat+0xe2/0x5af
[] ? do_path_lookup+0x1d/0x5f
[] ? user_path_at_empty+0x49/0x84
[] ? tsd_exit+0x83/0x18d [spl]
[] ? mutex_lock+0x12/0x25
[] ? cp_new_stat+0xdf/0xf1
[] ? vfs_fstatat+0x43/0x70
[] ? sys_newlstat+0x11/0x2d
[] ? system_call_fastpath+0x16/0x1b
rsync           D ffff88010ec74b38     0  3497   3496 0x00000000
ffff88010ec74890 0000000000000086 ffffffff8115539c 0000000000000000
ffff88022e8d6790 000000000000ffc0 ffff8800b2fabfd8 0000000000004000
ffff8800b2fabfd8 000000000000ffc0 ffff88010ec74890 ffff8800b2faa010
Call Trace:
[] ? cpumask_any_but+0x28/0x37
[] ? __pagevec_release+0x19/0x22
[] ? move_active_pages_to_lru+0x130/0x154
[] ? __mutex_lock_slowpath+0xe2/0x128
[] ? shrink_active_list+0x2e1/0x2fa
[] ? mutex_lock+0x12/0x25
[] ? zfs_zinactive+0x5a/0xd4 [zfs]
[] ? zfs_inactive+0x106/0x19e [zfs]
[] ? evict+0x78/0x117
[] ? dispose_list+0x2c/0x36
[] ? shrink_icache_memory+0x278/0x2a8
[] ? shrink_slab+0xe3/0x153
[] ? do_try_to_free_pages+0x253/0x3f0
[] ? try_to_free_pages+0x79/0x7e
[] ? __alloc_pages_nodemask+0x48b/0x6c8
[] ? __get_free_pages+0x12/0x52
[] ? spl_kmem_cache_alloc+0x236/0x975 [spl]
[] ? __wake_up+0x35/0x46
[] ? cv_wait_common+0xf5/0x141 [spl]
[] ? wake_up_bit+0x23/0x23
[] ? dbuf_create+0x38/0x32e [zfs]
[] ? __dbuf_hold_impl+0x39a/0x3c6 [zfs]
[] ? remove_reference+0x93/0x9f [zfs]
[] ? __dbuf_hold_impl+0x271/0x3c6 [zfs]
[] ? __dbuf_hold_impl+0x17d/0x3c6 [zfs]
[] ? dbuf_hold_impl+0x6e/0x97 [zfs]
[] ? dbuf_prefetch+0x105/0x23d [zfs]
[] ? dmu_zfetch_dofetch+0xd7/0x113 [zfs]
[] ? dmu_zfetch+0x4b1/0xc5e [zfs]
[] ? dbuf_read+0xd1/0x5c0 [zfs]
[] ? dnode_hold_impl+0x1a8/0x43c [zfs]
[] ? dmu_buf_hold+0x33/0x161 [zfs]
[] ? zap_lockdir+0x57/0x5cc [zfs]
[] ? dmu_bonus_hold+0x212/0x26e [zfs]
[] ? zap_lookup_norm+0x40/0x160 [zfs]
[] ? kmem_alloc_debug+0x13c/0x2ba [spl]
[] ? zap_lookup+0x2a/0x30 [zfs]
[] ? zfs_dirent_lock+0x3cc/0x48f [zfs]
[] ? zfs_zaccess_aces_check+0x1d5/0x203 [zfs]
[] ? zfs_dirlook+0x20a/0x276 [zfs]
[] ? zfs_lookup+0x26e/0x2b6 [zfs]
[] ? zpl_lookup+0x47/0x80 [zfs]
[] ? d_alloc_and_lookup+0x43/0x60
[] ? do_lookup+0x1c9/0x2bb
[] ? path_lookupat+0xe2/0x5af
[] ? do_path_lookup+0x1d/0x5f
[] ? user_path_at_empty+0x49/0x84
[] ? tsd_exit+0x83/0x18d [spl]
[] ? cp_new_stat+0xdf/0xf1
[] ? vfs_fstatat+0x43/0x70
[] ? sys_newlstat+0x11/0x2d
[] ? system_call_fastpath+0x16/0x1b

Machine specs: Single-socket quad-core Xeon 8 Gigs of RAM

SPL version: b29012b99994ece46019b664d67dace29e5c2586 ZFS version: 409dc1a570a836737b2a5bb43658cdde703c935e Kernel version: 3.0.28 vanilla custom build ZPool version 26 (originally built/run by zfs-fuse)

  pool: tankname1
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
    still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
    pool will no longer be accessible on older software versions.
 scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    tankname1   ONLINE       0     0     0
      sda       ONLINE       0     0     0

errors: No known data errors

I've also tried using the module/zfs/arc.c from https://github.com/zfsonlinux/zfs/pull/669 for testing and reducing the ARC size. RAM usage still exceeds the limits set.

# egrep "^(c_|size)" /proc/spl/kstat/zfs/arcstats
c_min                           4    293601280
c_max                           4    1610612736
size                            4    2602236704

Nevertheless it's been running for a few hours now reliably.

(Edit: I also raised vm.min_free_kbytes from its default up to 262144 as part of a shotgun attempt to make this more stable.)

ryao commented 12 years ago

Unfortunately, I seem to have given you the wrong pull request number in IRC. The one that you should have tried was #618, not #669.

GregorKopka commented 12 years ago

How did you manage to create pool version 26 with zfs-fuse?

ryao commented 12 years ago

Perhaps it is a typo and he meant pool version 23.

behlendorf commented 12 years ago

I was going to suggest that you try the patch in issue #618 but you beat me too it. I'm a bit surprised that it doesn't resolve the issue, clearly there's more to be done.

DeHackEd commented 12 years ago

I haven't had the chance to apply #618 yet. I'll try it tomorrow, time permitting.

DeHackEd commented 12 years ago

No love. With #618 applied the same scenario still deadlocks after ignoring the arc size limit

These outputs are all taken after ZFS jammed up.

# free
             total       used       free     shared    buffers     cached
Mem:       8182236    7540728     641508          0      28528      43364
-/+ buffers/cache:    7468836     713400
Swap:      4194300      14492    4179808

# history | grep modprobe
modprobe zfs zfs_vdev_max_pending=20 zfs_arc_max=2147483648

    # cat /proc/spl/kstat/zfs/arcstats
    10 1 0x01 77 3696 141145250104682 144360035578750
    name                            type data
    hits                            4    5947062
    misses                          4    939766
    demand_data_hits                4    215008
    demand_data_misses              4    13180
    demand_metadata_hits            4    4365140
    demand_metadata_misses          4    469195
    prefetch_data_hits              4    20650
    prefetch_data_misses            4    53521
    prefetch_metadata_hits          4    1346264
    prefetch_metadata_misses        4    403870
    mru_hits                        4    895846
    mru_ghost_hits                  4    277838
    mfu_hits                        4    3684335
    mfu_ghost_hits                  4    189079
    deleted                         4    248071
    recycle_miss                    4    488895
    mutex_miss                      4    8046
    evict_skip                      4    362916066
    evict_l2_cached                 4    0
    evict_l2_eligible               4    14598742528
    evict_l2_ineligible             4    5304508416
    hash_elements                   4    253442
    hash_elements_max               4    253519
    hash_collisions                 4    240599
    hash_chains                     4    66173
    hash_chain_max                  4    7
    p                               4    134225408
    c                               4    2147483648
    c_min                           4    293601280
    c_max                           4    2147483648
    size                            4    3320870416
    hdr_size                        4    85214688
    data_size                       4    2363930112
    other_size                      4    871725616
    anon_size                       4    27773952
    anon_evict_data                 4    0
    anon_evict_metadata             4    0
    mru_size                        4    161863680
    mru_evict_data                  4    0
    mru_evict_metadata              4    3620864
    mru_ghost_size                  4    1977015808
    mru_ghost_evict_data            4    908544000
    mru_ghost_evict_metadata        4    1068471808
    mfu_size                        4    2174292480
    mfu_evict_data                  4    0
    mfu_evict_metadata              4    1738240
    mfu_ghost_size                  4    166193152
    mfu_ghost_evict_data            4    0
    mfu_ghost_evict_metadata        4    166193152
    l2_hits                         4    0
    l2_misses                       4    0
    l2_feeds                        4    0
    l2_rw_clash                     4    0
    l2_read_bytes                   4    0
    l2_write_bytes                  4    0
    l2_writes_sent                  4    0
    l2_writes_done                  4    0
    l2_writes_error                 4    0
    l2_writes_hdr_miss              4    0
    l2_evict_lock_retry             4    0
    l2_evict_reading                4    0
    l2_free_on_write                4    0
    l2_abort_lowmem                 4    0
    l2_cksum_bad                    4    0
    l2_io_error                     4    0
    l2_size                         4    0
    l2_hdr_size                     4    0
    memory_throttle_count           4    0
    memory_direct_count             4    0
    memory_indirect_count           4    0
    arc_no_grow                     4    0
    arc_tempreserve                 4    475136
    arc_loaned_bytes                4    262144
    arc_prune                       4    2313
    arc_meta_used                   4    3309051920
    arc_meta_limit                  4    536870912
    arc_meta_max                    4    3309157328

Stack traces, some are additional kernel threads.

z_wr_int/15     D ffff880113a49408     0 18763      2 0x00000000
ffff880113a49160 0000000000000046 ffffffff811ddf2f ffff880100000000
ffff88022e8e67d0 000000000000ffc0 ffff8801a8d05fd8 0000000000004000
ffff8801a8d05fd8 000000000000ffc0 ffff880113a49160 ffff8801a8d04010
Call Trace:
[] ? scsi_request_fn+0x314/0x3ef
[] ? __make_request+0x22a/0x243
[] ? generic_make_request+0x201/0x262
[] ? __mutex_lock_slowpath+0xe2/0x128
[] ? mutex_lock+0x12/0x25
[] ? vdev_cache_write+0x57/0x11d [zfs]
[] ? zio_vdev_io_done+0x6f/0x141 [zfs]
[] ? zio_execute+0xad/0xd1 [zfs]
[] ? taskq_thread+0x2c2/0x508 [spl]
[] ? try_to_wake_up+0x1d9/0x1eb
[] ? try_to_wake_up+0x1eb/0x1eb
[] ? spl_taskq_init+0x159/0x159 [spl]
[] ? spl_taskq_init+0x159/0x159 [spl]
[] ? kthread+0x7a/0x82
[] ? kernel_thread_helper+0x4/0x10
[] ? kthread_worker_fn+0x139/0x139
[] ? gs_change+0xb/0xb

txg_sync        D ffff88018865b898     0 18781      2 0x00000000
ffff88018865b5f0 0000000000000046 ffff88022fc10028 ffff88011160c248
ffff88011160c200 000000000000ffc0 ffff88015dd89fd8 0000000000004000
ffff88015dd89fd8 000000000000ffc0 ffff88018865b5f0 ffff88015dd88010
Call Trace:
[] ? check_preempt_curr+0x25/0x62
[] ? ttwu_do_wakeup+0x11/0x83
[] ? try_to_wake_up+0x1d9/0x1eb
[] ? __wake_up_common+0x41/0x78
[] ? cv_wait_common+0xb8/0x141 [spl]
[] ? wake_up_bit+0x23/0x23
[] ? zio_wait+0xe8/0x11c [zfs]
[] ? dsl_pool_sync+0xbf/0x428 [zfs]
[] ? spa_sync+0x47d/0x829 [zfs]
[] ? txg_sync_thread+0x29a/0x3f6 [zfs]
[] ? set_user_nice+0x115/0x139
[] ? txg_thread_exit+0x2b/0x2b [zfs]
[] ? __thread_create+0x2df/0x2df [spl]
[] ? thread_generic_wrapper+0x6a/0x75 [spl]
[] ? kthread+0x7a/0x82
[] ? kernel_thread_helper+0x4/0x10
[] ? kthread_worker_fn+0x139/0x139
[] ? gs_change+0xb/0xb

rsync           D ffff88020b19f008     0 19506  19505 0x00000000
ffff88020b19ed60 0000000000000082 0000000000000001 0000000000000007
ffff880151257930 000000000000ffc0 ffff88015f65dfd8 0000000000004000
ffff88015f65dfd8 000000000000ffc0 ffff88020b19ed60 ffff88015f65c010
Call Trace:
[] ? extract_buf+0x76/0xc2
[] ? invalidate_interrupt1+0xe/0x20
[] ? __mutex_lock_slowpath+0xe2/0x128
[] ? mutex_lock+0x12/0x25
[] ? vdev_cache_read+0x92/0x2de [zfs]
[] ? zio_vdev_io_start+0x1c1/0x228 [zfs]
[] ? zio_nowait+0xd0/0xf4 [zfs]
[] ? vdev_mirror_io_start+0x2fa/0x313 [zfs]
[] ? vdev_config_generate+0xac7/0xac7 [zfs]
[] ? zio_nowait+0xd0/0xf4 [zfs]
[] ? arc_read_nolock+0x662/0x673 [zfs]
[] ? arc_read+0xc2/0x146 [zfs]
[] ? dnode_block_freed+0xfe/0x119 [zfs]
[] ? dbuf_fill_done+0x61/0x61 [zfs]
[] ? dbuf_read+0x3ce/0x5c0 [zfs]
[] ? dnode_hold_impl+0x1a8/0x43c [zfs]
[] ? remove_reference+0x93/0x9f [zfs]
[] ? dmu_bonus_hold+0x22/0x26e [zfs]
[] ? zfs_zget+0x5c/0x19f [zfs]
[] ? zfs_dirent_lock+0x447/0x48f [zfs]
[] ? zfs_zaccess_aces_check+0x1d5/0x203 [zfs]
[] ? zfs_dirlook+0x20a/0x276 [zfs]
[] ? zfs_lookup+0x26e/0x2b6 [zfs]
[] ? zpl_lookup+0x47/0x80 [zfs]
[] ? d_alloc_and_lookup+0x43/0x60
[] ? do_lookup+0x1c9/0x2bb
[] ? path_lookupat+0xe2/0x5af
[] ? do_path_lookup+0x1d/0x5f
[] ? user_path_at_empty+0x49/0x84
[] ? tsd_exit+0x83/0x18d [spl]
[] ? cp_new_stat+0xdf/0xf1
[] ? vfs_fstatat+0x43/0x70
[] ? sys_newlstat+0x11/0x2d
[] ? system_call_fastpath+0x16/0x1b

rsync           D ffff8801ba0c6cb8     0 19511  19510 0x00000000
ffff8801ba0c6a10 0000000000000086 ffffffff8115539c 0000000000000001
ffff8801512572a0 000000000000ffc0 ffff8801b1051fd8 0000000000004000
ffff8801b1051fd8 000000000000ffc0 ffff8801ba0c6a10 ffff8801b1050010
Call Trace:
[] ? cpumask_any_but+0x28/0x37
[] ? __schedule+0x727/0x7b0
[] ? rcu_implicit_dynticks_qs+0x3f/0x60
[] ? force_quiescent_state+0x1c3/0x230
[] ? __mutex_lock_slowpath+0xe2/0x128
[] ? arc_buf_remove_ref+0xe6/0xf4 [zfs]
[] ? mutex_lock+0x12/0x25
[] ? zfs_zinactive+0x5a/0xd4 [zfs]
[] ? zfs_inactive+0x106/0x19e [zfs]
[] ? evict+0x78/0x117
[] ? dispose_list+0x2c/0x36
[] ? shrink_icache_memory+0x278/0x2a8
[] ? shrink_slab+0xe3/0x153
[] ? do_try_to_free_pages+0x253/0x3f0
[] ? get_page_from_freelist+0x47a/0x4ad
[] ? try_to_free_pages+0x79/0x7e
[] ? __alloc_pages_nodemask+0x48b/0x6c8
[] ? __get_free_pages+0x12/0x52
[] ? spl_kmem_cache_alloc+0x236/0x975 [spl]
[] ? dbuf_create+0x38/0x32e [zfs]
[] ? dnode_hold_impl+0x3f0/0x43c [zfs]
[] ? dbuf_create_bonus+0x16/0x1f [zfs]
[] ? dmu_bonus_hold+0x10b/0x26e [zfs]
[] ? zfs_zget+0x5c/0x19f [zfs]
[] ? zfs_dirent_lock+0x447/0x48f [zfs]
[] ? zfs_zaccess_aces_check+0x1d5/0x203 [zfs]
[] ? zfs_dirlook+0x20a/0x276 [zfs]
[] ? zfs_lookup+0x26e/0x2b6 [zfs]
[] ? zpl_lookup+0x47/0x80 [zfs]
[] ? d_alloc_and_lookup+0x43/0x60
[] ? do_lookup+0x1c9/0x2bb
[] ? path_lookupat+0xe2/0x5af
[] ? strncpy_from_user+0x9/0x4e
[] ? do_path_lookup+0x1d/0x5f
[] ? user_path_at_empty+0x49/0x84
[] ? tsd_exit+0x83/0x18d [spl]
[] ? cp_new_stat+0xdf/0xf1
[] ? vfs_fstatat+0x43/0x70
[] ? sys_newlstat+0x11/0x2d
[] ? system_call_fastpath+0x16/0x1b

rsync           D ffff88022365e728     0 19522  19521 0x00000000
ffff88022365e480 0000000000000082 ffffffff8115539c ffff8801b219b438
ffff8802219bcea0 000000000000ffc0 ffff8801b219bfd8 0000000000004000
ffff8801b219bfd8 000000000000ffc0 ffff88022365e480 ffff8801b219a010
Call Trace:
[] ? cpumask_any_but+0x28/0x37
[] ? __pagevec_release+0x19/0x22
[] ? move_active_pages_to_lru+0x130/0x154
[] ? select_task_rq_fair+0x35e/0x791
[] ? common_interrupt+0xe/0x13
[] ? sched_clock_local+0x13/0x76
[] ? __mutex_lock_slowpath+0xe2/0x128
[] ? check_preempt_curr+0x25/0x62
[] ? mutex_lock+0x12/0x25
[] ? zfs_zinactive+0x5a/0xd4 [zfs]
[] ? zfs_inactive+0x106/0x19e [zfs]
[] ? evict+0x78/0x117
[] ? dispose_list+0x2c/0x36
[] ? shrink_icache_memory+0x278/0x2a8
[] ? shrink_slab+0xe3/0x153
[] ? do_try_to_free_pages+0x253/0x3f0
[] ? get_page_from_freelist+0x47a/0x4ad
[] ? try_to_free_pages+0x79/0x7e
[] ? __alloc_pages_nodemask+0x48b/0x6c8
[] ? __get_free_pages+0x12/0x52
[] ? spl_kmem_cache_alloc+0x236/0x975 [spl]
[] ? dbuf_create+0x38/0x32e [zfs]
[] ? dnode_hold_impl+0x3f0/0x43c [zfs]
[] ? dbuf_create_bonus+0x16/0x1f [zfs]
[] ? dmu_bonus_hold+0x10b/0x26e [zfs]
[] ? zfs_zget+0x5c/0x19f [zfs]
[] ? zfs_dirent_lock+0x447/0x48f [zfs]
[] ? zfs_zaccess_aces_check+0x1d5/0x203 [zfs]
[] ? zfs_dirlook+0x20a/0x276 [zfs]
[] ? zfs_lookup+0x26e/0x2b6 [zfs]
[] ? zpl_lookup+0x47/0x80 [zfs]
[] ? d_alloc_and_lookup+0x43/0x60
[] ? do_lookup+0x1c9/0x2bb
[] ? path_lookupat+0xe2/0x5af
[] ? do_path_lookup+0x1d/0x5f
[] ? user_path_at_empty+0x49/0x84
[] ? tsd_exit+0x83/0x18d [spl]
[] ? cp_new_stat+0xdf/0xf1
[] ? vfs_fstatat+0x43/0x70
[] ? sys_newlstat+0x11/0x2d
[] ? system_call_fastpath+0x16/0x1b

rsync           D ffff8802216c93c8     0 19527  19526 0x00000000
ffff8802216c9120 0000000000000082 000000000000000a ffff880000000000
ffff88022e8a6750 000000000000ffc0 ffff88021c0f1fd8 0000000000004000
ffff88021c0f1fd8 000000000000ffc0 ffff8802216c9120 ffff88021c0f0010
Call Trace:
[] ? zap_get_leaf_byblk+0x1b5/0x249 [zfs]
[] ? zap_leaf_array_match+0x166/0x197 [zfs]
[] ? remove_reference+0x93/0x9f [zfs]
[] ? arc_buf_remove_ref+0xe6/0xf4 [zfs]
[] ? dbuf_rele_and_unlock+0x12b/0x19a [zfs]
[] ? __mutex_lock_slowpath+0xe2/0x128
[] ? mutex_lock+0x12/0x25
[] ? zfs_zget+0x46/0x19f [zfs]
[] ? zfs_dirent_lock+0x447/0x48f [zfs]
[] ? zfs_zaccess_aces_check+0x1d5/0x203 [zfs]
[] ? zfs_dirlook+0x20a/0x276 [zfs]
[] ? zfs_lookup+0x26e/0x2b6 [zfs]
[] ? zpl_lookup+0x47/0x80 [zfs]
[] ? d_alloc_and_lookup+0x43/0x60
[] ? do_lookup+0x1c9/0x2bb
[] ? path_lookupat+0xe2/0x5af
[] ? do_path_lookup+0x1d/0x5f
[] ? user_path_at_empty+0x49/0x84
[] ? tsd_exit+0x83/0x18d [spl]
[] ? cp_new_stat+0xdf/0xf1
[] ? vfs_fstatat+0x43/0x70
[] ? sys_newlstat+0x11/0x2d
[] ? system_call_fastpath+0x16/0x1b

zfs-fuse has pool v26 available in the Git repo.

[Edit: full arcstats included]

behlendorf commented 12 years ago

We're working on some spl kmem improvements which may help with this. Expect patches in the next week or so. Two cases have been identified which can result in memory reclaim from the slab being less the optimal.

jspiros commented 12 years ago

I'm also experiencing this exact issue when attempting an rsync. Has there been any progress; is there anything I should try? I'm running 0.6.0-rc8 on Debian wheezy/sid using the deb instructions on the website.

behlendorf commented 12 years ago

Several patches which should help with this have been merged in to master post rc8. Please try the latest master source. Additionally, you can try increasing /proc/sys/vm/min_free_kbytes to say 256m or so.

jspiros commented 12 years ago

I was actually just looking at #154 and it seemed to be related, and was just looking for confirmation that I should indeed try the latest master. So, thanks, will do, and will report back.

ryao commented 12 years ago

@jspiros Is there anything useful being output to dmesg?

You might want to run sysctl kernel.hung_task_timeout_secs=15 to reduce the amount of time it takes for the hang check timer to report hangs to dmesg from 120 seconds. This requires that your kernel is compiled with CONFIG_HANGCHECK_TIMER=Y.

jspiros commented 12 years ago

@ryao Indeed, there were many hung tasks, even with the default of 120 seconds. The first three were kswapd0, arc_reclaim, and rsync, though many others hung after that.

I'm currently running the latest master, and I set min_free_kbytes to 256MB as recommended, but if you tell me what information you want should the same problem occur again tonight (I have some rsync backups that run nightly), I'll try to get it here. I was able to run some commands as root, including dmesg, so just tell me what would be helpful and I'll be sure to get as much information as I can.

ryao commented 12 years ago

The hung task information in the dmesg output would be useful.

DeHackEd commented 12 years ago

Fresh git pulls from this morning on both spl and zfs. Versions are 2371321e8a3d29e19cbc3a280f48636211abb480 and ee191e802c515ae509aec382cac33c9d89235dc5

Arc size 2^31 bytes. It's the same setup from my first post.

# cat /proc/spl/kstat/zfs/arcstats
10 1 0x01 77 3696 216989312967 3547642914356
name                            type data
hits                            4    4820390
misses                          4    1632473
demand_data_hits                4    2
demand_data_misses              4    58777
demand_metadata_hits            4    4437436
demand_metadata_misses          4    875202
prefetch_data_hits              4    0
prefetch_data_misses            4    0
prefetch_metadata_hits          4    382952
prefetch_metadata_misses        4    698494
mru_hits                        4    676077
mru_ghost_hits                  4    483254
mfu_hits                        4    3761394
mfu_ghost_hits                  4    483721
deleted                         4    418327
recycle_miss                    4    643047
mutex_miss                      4    2093
evict_skip                      4    96315269
evict_l2_cached                 4    0
evict_l2_eligible               4    15802661888
evict_l2_ineligible             4    7292848128
hash_elements                   4    267779
hash_elements_max               4    299201
hash_collisions                 4    328936
hash_chains                     4    71327
hash_chain_max                  4    8
p                               4    135680512
c                               4    2147483648
c_min                           4    293601280
c_max                           4    2147483648
size                            4    2306523480
hdr_size                        4    90425088
data_size                       4    1618198528
other_size                      4    597899864
anon_size                       4    3948544
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    199776256
mru_evict_data                  4    0
mru_evict_metadata              4    0
mru_ghost_size                  4    1946481152
mru_ghost_evict_data            4    16462336
mru_ghost_evict_metadata        4    1930018816
mfu_size                        4    1414473728
mfu_evict_data                  4    0
mfu_evict_metadata              4    0
mfu_ghost_size                  4    200998912
mfu_ghost_evict_data            4    0
mfu_ghost_evict_metadata        4    200998912
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_hdr_miss              4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_free_on_write                4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_hdr_size                     4    0
memory_throttle_count           4    0
memory_direct_count             4    0
memory_indirect_count           4    0
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    1992
arc_meta_used                   4    2306523480
arc_meta_limit                  4    536870912
arc_meta_max                    4    2657959776

I caught it before it the OS threatened to lock up and SIGSTOP'd all my rsync processes. While it was paused the arc 'size' was slowly dropping and I resumed once it got below 1.6G. Right now the size is 2,718,915,552 and still climbing unless I stop it again.

jspiros commented 12 years ago

@behlendorf Your suggestions helped me make it through the nightly rsync, and everything seems to be working fine.

@ryao Due to this, I'm afraid I do not currently have any dmesg output for you (yet). If something goes wrong again, I'll provide it, of course.

behlendorf commented 12 years ago

@jspiros That's good news. The hope is that going forward some of Richards VM work will remove the need for this sort of tuning but we're not quite there yet.

DeHackEd commented 12 years ago

I (partially?) retract my previous post. I left the rsync running so it would presumably ruin the system, but it didn't die. That was some hours ago and the ARC is now restraining itself at about 570-580 megabytes. I'm turning up the heat a bit to see if I can coerce a useful stack dump out of it or find out if it properly behaves now.

Edit: after turning up the pressure the ARC did eventually grow to the breaking point. It seems to fare better but still cracks under intense pressure.

jspiros commented 12 years ago

Okay, well, this morning was doing alright until just now. It probably doesn't help that I decided to scrub my pool, which will take about three days (18TB in [usable] capacity), overlapping with the nightly rsync.

Here's what dmesg is looking like right now.


[161160.692017] INFO: task kswapd0:33 blocked for more than 120 seconds.
[161160.692055] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[161160.692093] kswapd0         D ffff88042f388240     0    33      2 0x00000000
[161160.692098]  ffff88042f388240 0000000000000046 ffff8801a8c8b600 ffff88042df12830
[161160.692103]  0000000000013540 ffff88042c3affd8 ffff88042c3affd8 ffff88042f388240
[161160.692108]  ffff88042d645800 ffffffffa0e81aa4 00000000000080d0 ffff880412b89c20
[161160.692113] Call Trace:
[161160.692158]  [<ffffffffa0e81aa4>] ? sa_attr_register_sync.isra.6+0x19d/0x1ac [zfs]
[161160.692169]  [<ffffffff81348a0f>] ? __mutex_lock_common.isra.5+0xff/0x164
[161160.692172]  [<ffffffff810363c7>] ? should_resched+0x5/0x23
[161160.692175]  [<ffffffff813488fd>] ? mutex_lock+0x1a/0x2d
[161160.692178]  [<ffffffff810ea5e0>] ? kfree+0x5b/0x6c
[161160.692194]  [<ffffffffa0ec0518>] ? zfs_zinactive+0x43/0xa7 [zfs]
[161160.692212]  [<ffffffffa0eb72e6>] ? zfs_inactive+0x102/0x139 [zfs]
[161160.692216]  [<ffffffff8110bb48>] ? evict+0x96/0x148
[161160.692218]  [<ffffffff8110bdb2>] ? dispose_list+0x27/0x31
[161160.692221]  [<ffffffff8110c8f2>] ? prune_icache_sb+0x24f/0x25e
[161160.692224]  [<ffffffff810fb5cb>] ? prune_super+0xd5/0x13f
[161160.692227]  [<ffffffff810c13f5>] ? shrink_slab+0x18f/0x24d
[161160.692229]  [<ffffffff810c3cea>] ? balance_pgdat+0x283/0x4b7
[161160.692232]  [<ffffffff810c41e8>] ? kswapd+0x2ca/0x2e6
[161160.692235]  [<ffffffff8105f533>] ? add_wait_queue+0x3c/0x3c
[161160.692237]  [<ffffffff810c3f1e>] ? balance_pgdat+0x4b7/0x4b7
[161160.692239]  [<ffffffff8105eeed>] ? kthread+0x76/0x7e
[161160.692243]  [<ffffffff81350534>] ? kernel_thread_helper+0x4/0x10
[161160.692245]  [<ffffffff8105ee77>] ? kthread_worker_fn+0x139/0x139
[161160.692248]  [<ffffffff81350530>] ? gs_change+0x13/0x13
[161160.692258] INFO: task arc_adapt:954 blocked for more than 120 seconds.
[161160.692281] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[161160.692308] arc_adapt       D ffff88042bf18140     0   954      2 0x00000000
[161160.692311]  ffff88042bf18140 0000000000000046 ffff880300000001 ffff88041b5690a0
[161160.692315]  0000000000013540 ffff88040ed4dfd8 ffff88040ed4dfd8 ffff88042bf18140
[161160.692318]  ffff88042d645800 00000001a0e81aa4 ffff880363073680 ffff880412b89c20
[161160.692321] Call Trace:
[161160.692324]  [<ffffffff81348a0f>] ? __mutex_lock_common.isra.5+0xff/0x164
[161160.692326]  [<ffffffff813488fd>] ? mutex_lock+0x1a/0x2d
[161160.692345]  [<ffffffffa0e914c2>] ? txg_rele_to_sync+0x48/0x55 [zfs]
[161160.692361]  [<ffffffffa0ec0518>] ? zfs_zinactive+0x43/0xa7 [zfs]
[161160.692377]  [<ffffffffa0eb72e6>] ? zfs_inactive+0x102/0x139 [zfs]
[161160.692381]  [<ffffffff8110bb48>] ? evict+0x96/0x148
[161160.692383]  [<ffffffff8110bdb2>] ? dispose_list+0x27/0x31
[161160.692385]  [<ffffffff8110c8f2>] ? prune_icache_sb+0x24f/0x25e
[161160.692388]  [<ffffffff810fb5cb>] ? prune_super+0xd5/0x13f
[161160.692404]  [<ffffffffa0eb47fa>] ? zfs_sb_prune+0x76/0x9a [zfs]
[161160.692420]  [<ffffffffa0ecab51>] ? zpl_inode_alloc+0x66/0x66 [zfs]
[161160.692434]  [<ffffffffa0ecab6a>] ? zpl_prune_sb+0x19/0x1e [zfs]
[161160.692438]  [<ffffffff810fab4d>] ? iterate_supers_type+0x61/0xc2
[161160.692452]  [<ffffffffa0ecab6f>] ? zpl_prune_sb+0x1e/0x1e [zfs]
[161160.692467]  [<ffffffffa0ecab9d>] ? zpl_prune_sbs+0x2e/0x39 [zfs]
[161160.692479]  [<ffffffffa0e517a3>] ? arc_adjust_meta+0x101/0x171 [zfs]
[161160.692490]  [<ffffffffa0e51813>] ? arc_adjust_meta+0x171/0x171 [zfs]
[161160.692503]  [<ffffffffa0e51864>] ? arc_adapt_thread+0x51/0xb4 [zfs]
[161160.692510]  [<ffffffffa0440848>] ? thread_generic_wrapper+0x6a/0x75 [spl]
[161160.692517]  [<ffffffffa04407de>] ? __thread_create+0x2be/0x2be [spl]
[161160.692519]  [<ffffffff8105eeed>] ? kthread+0x76/0x7e
[161160.692522]  [<ffffffff81350534>] ? kernel_thread_helper+0x4/0x10
[161160.692525]  [<ffffffff8105ee77>] ? kthread_worker_fn+0x139/0x139
[161160.692527]  [<ffffffff81350530>] ? gs_change+0x13/0x13
[161160.692546] INFO: task txg_quiesce:1178 blocked for more than 120 seconds.
[161160.692570] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[161160.692597] txg_quiesce     D ffff88043fc13540     0  1178      2 0x00000000
[161160.692599]  ffff8804161a7020 0000000000000046 ffffffff00000000 ffffffff8160d020
[161160.692603]  0000000000013540 ffff88041a32dfd8 ffff88041a32dfd8 ffff8804161a7020
[161160.692606]  0000000000000202 0000000181349424 ffffc900066d12a8 ffffc900066d12a0
[161160.692609] Call Trace:
[161160.692616]  [<ffffffffa0445ec6>] ? cv_wait_common+0xb8/0x141 [spl]
[161160.692619]  [<ffffffff8105f533>] ? add_wait_queue+0x3c/0x3c
[161160.692638]  [<ffffffffa0e92330>] ? txg_quiesce_thread+0x202/0x2ae [zfs]
[161160.692659]  [<ffffffffa0e9212e>] ? txg_sync_thread+0x3c6/0x3c6 [zfs]
[161160.692666]  [<ffffffffa0440848>] ? thread_generic_wrapper+0x6a/0x75 [spl]
[161160.692672]  [<ffffffffa04407de>] ? __thread_create+0x2be/0x2be [spl]
[161160.692675]  [<ffffffff8105eeed>] ? kthread+0x76/0x7e
[161160.692677]  [<ffffffff81350534>] ? kernel_thread_helper+0x4/0x10
[161160.692680]  [<ffffffff8105ee77>] ? kthread_worker_fn+0x139/0x139
[161160.692682]  [<ffffffff81350530>] ? gs_change+0x13/0x13
[161160.692693] INFO: task avahi-daemon:3937 blocked for more than 120 seconds.
[161160.692718] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[161160.692745] avahi-daemon    D ffff88043fc13540     0  3937      1 0x00000000
[161160.692748]  ffff8804247827b0 0000000000000082 ffff88033734d440 ffffffff8160d020
[161160.692751]  0000000000013540 ffff8804236f9fd8 ffff8804236f9fd8 ffff8804247827b0
[161160.692754]  ffff88042d645800 ffffffffa0e81aa4 ffff88033734d440 ffff880412b89c20
[161160.692757] Call Trace:
[161160.692777]  [<ffffffffa0e81aa4>] ? sa_attr_register_sync.isra.6+0x19d/0x1ac [zfs]
[161160.692780]  [<ffffffff81348a0f>] ? __mutex_lock_common.isra.5+0xff/0x164
[161160.692782]  [<ffffffff813488fd>] ? mutex_lock+0x1a/0x2d
[161160.692801]  [<ffffffffa0e914c2>] ? txg_rele_to_sync+0x48/0x55 [zfs]
[161160.692816]  [<ffffffffa0ec0518>] ? zfs_zinactive+0x43/0xa7 [zfs]
[161160.692832]  [<ffffffffa0eb72e6>] ? zfs_inactive+0x102/0x139 [zfs]
[161160.692836]  [<ffffffff8110bb48>] ? evict+0x96/0x148
[161160.692838]  [<ffffffff8110bdb2>] ? dispose_list+0x27/0x31
[161160.692840]  [<ffffffff8110c8f2>] ? prune_icache_sb+0x24f/0x25e
[161160.692843]  [<ffffffff810fb5cb>] ? prune_super+0xd5/0x13f
[161160.692845]  [<ffffffff810c13f5>] ? shrink_slab+0x18f/0x24d
[161160.692848]  [<ffffffff810c354d>] ? do_try_to_free_pages+0x25f/0x3de
[161160.692850]  [<ffffffff810c3937>] ? try_to_free_pages+0xa9/0xe9
[161160.692853]  [<ffffffff810baa2b>] ? __alloc_pages_nodemask+0x4a9/0x726
[161160.692856]  [<ffffffff810e3dec>] ? alloc_pages_current+0xb2/0xcc
[161160.692859]  [<ffffffff810b7631>] ? __get_free_pages+0x6/0x42
[161160.692862]  [<ffffffff81106ebd>] ? __pollwait+0x56/0xce
[161160.692864]  [<ffffffff810ff741>] ? pipe_poll+0x2a/0x7b
[161160.692866]  [<ffffffff81107b8f>] ? do_sys_poll+0x1dc/0x347
[161160.692869]  [<ffffffff81106e67>] ? poll_freewait+0x97/0x97
[161160.692871]  [<ffffffff81106f35>] ? __pollwait+0xce/0xce
[161160.692873]  [<ffffffff81106f35>] ? __pollwait+0xce/0xce
[161160.692876]  [<ffffffff81106f35>] ? __pollwait+0xce/0xce
[161160.692878]  [<ffffffff81106f35>] ? __pollwait+0xce/0xce
[161160.692880]  [<ffffffff81106f35>] ? __pollwait+0xce/0xce
[161160.692883]  [<ffffffff81106f35>] ? __pollwait+0xce/0xce
[161160.692885]  [<ffffffff81106f35>] ? __pollwait+0xce/0xce
[161160.692887]  [<ffffffff81106f35>] ? __pollwait+0xce/0xce
[161160.692889]  [<ffffffff81106f35>] ? __pollwait+0xce/0xce
[161160.692892]  [<ffffffff8104b197>] ? timespec_add_safe+0x32/0x63
[161160.692895]  [<ffffffff81107d85>] ? sys_poll+0x43/0xab
[161160.692897]  [<ffffffff8134e3d2>] ? system_call_fastpath+0x16/0x1b
[161160.692917] INFO: task ntpd:4648 blocked for more than 120 seconds.
[161160.692939] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[161160.692966] ntpd            D ffff88041d332ee0     0  4648      1 0x00000004
[161160.692969]  ffff88041d332ee0 0000000000000082 ffff880115471cc0 ffff88014956a080
[161160.692972]  0000000000013540 ffff880426907fd8 ffff880426907fd8 ffff88041d332ee0
[161160.692976]  ffff88042d645800 ffffffffa0e81aa4 ffff880115471cc0 ffff880412b89c20
[161160.692979] Call Trace:
[161160.692998]  [<ffffffffa0e81aa4>] ? sa_attr_register_sync.isra.6+0x19d/0x1ac [zfs]
[161160.693001]  [<ffffffff81348a0f>] ? __mutex_lock_common.isra.5+0xff/0x164
[161160.693003]  [<ffffffff810363c7>] ? should_resched+0x5/0x23
[161160.693006]  [<ffffffff813488fd>] ? mutex_lock+0x1a/0x2d
[161160.693009]  [<ffffffff8107066d>] ? arch_local_irq_save+0x11/0x17
[161160.693024]  [<ffffffffa0ec0518>] ? zfs_zinactive+0x43/0xa7 [zfs]
[161160.693041]  [<ffffffffa0eb72e6>] ? zfs_inactive+0x102/0x139 [zfs]
[161160.693045]  [<ffffffff8110bb48>] ? evict+0x96/0x148
[161160.693047]  [<ffffffff8110bdb2>] ? dispose_list+0x27/0x31
[161160.693049]  [<ffffffff8110c8f2>] ? prune_icache_sb+0x24f/0x25e
[161160.693052]  [<ffffffff810fb5cb>] ? prune_super+0xd5/0x13f
[161160.693054]  [<ffffffff810c13f5>] ? shrink_slab+0x18f/0x24d
[161160.693056]  [<ffffffff810c354d>] ? do_try_to_free_pages+0x25f/0x3de
[161160.693059]  [<ffffffff810c3937>] ? try_to_free_pages+0xa9/0xe9
[161160.693061]  [<ffffffff810baa2b>] ? __alloc_pages_nodemask+0x4a9/0x726
[161160.693064]  [<ffffffff810e3dec>] ? alloc_pages_current+0xb2/0xcc
[161160.693066]  [<ffffffff810b88a9>] ? get_pageblock_flags_group+0x3a/0x6e
[161160.693069]  [<ffffffff810b7631>] ? __get_free_pages+0x6/0x42
[161160.693071]  [<ffffffff81106ebd>] ? __pollwait+0x56/0xce
[161160.693074]  [<ffffffff81283dff>] ? datagram_poll+0x1b/0xd0
[161160.693077]  [<ffffffff812d0741>] ? udp_poll+0xf/0x42
[161160.693079]  [<ffffffff81107394>] ? do_select+0x2b0/0x46a
[161160.693082]  [<ffffffff81106e67>] ? poll_freewait+0x97/0x97
[161160.693084]  [<ffffffff81106f35>] ? __pollwait+0xce/0xce
[161160.693086]  [<ffffffff81106f35>] ? __pollwait+0xce/0xce
[161160.693089]  [<ffffffff81106f35>] ? __pollwait+0xce/0xce
[161160.693091]  [<ffffffff81106f35>] ? __pollwait+0xce/0xce
[161160.693093]  [<ffffffff81106f35>] ? __pollwait+0xce/0xce
[161160.693096]  [<ffffffff81106f35>] ? __pollwait+0xce/0xce
[161160.693098]  [<ffffffff81106f35>] ? __pollwait+0xce/0xce
[161160.693100]  [<ffffffff81106f35>] ? __pollwait+0xce/0xce
[161160.693102]  [<ffffffff81106f35>] ? __pollwait+0xce/0xce
[161160.693105]  [<ffffffff811076c3>] ? core_sys_select+0x175/0x21a
[161160.693107]  [<ffffffff810363c7>] ? should_resched+0x5/0x23
[161160.693110]  [<ffffffff81055c48>] ? set_current_blocked+0x2d/0x43
[161160.693113]  [<ffffffff8100e342>] ? do_signal+0x547/0x610
[161160.693116]  [<ffffffff810363c7>] ? should_resched+0x5/0x23
[161160.693118]  [<ffffffff8134810f>] ? _cond_resched+0x7/0x1c
[161160.693121]  [<ffffffff810159d8>] ? check_for_xstate+0x27/0x7c
[161160.693123]  [<ffffffff81015cc5>] ? restore_i387_xstate+0xaa/0x15d
[161160.693125]  [<ffffffff810566b3>] ? do_sigaltstack+0xaa/0x13e
[161160.693128]  [<ffffffff811077f0>] ? sys_select+0x88/0xad
[161160.693130]  [<ffffffff8134e8ec>] ? stub_rt_sigreturn+0x6c/0xc0
[161160.693133]  [<ffffffff8134e3d2>] ? system_call_fastpath+0x16/0x1b
[161160.693135] INFO: task radvd:4719 blocked for more than 120 seconds.
[161160.693157] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[161160.693183] radvd           D ffff88041d389610     0  4719      1 0x00000000
[161160.693186]  ffff88041d389610 0000000000000082 ffff880288008f00 ffff880427534080
[161160.693189]  0000000000013540 ffff8804201d3fd8 ffff8804201d3fd8 ffff88041d389610
[161160.693193]  ffff88042d645800 ffffffffa0e81aa4 ffff880288008f00 ffff880412b89c20
[161160.693196] Call Trace:
[161160.693214]  [<ffffffffa0e81aa4>] ? sa_attr_register_sync.isra.6+0x19d/0x1ac [zfs]
[161160.693217]  [<ffffffff81348a0f>] ? __mutex_lock_common.isra.5+0xff/0x164
[161160.693220]  [<ffffffff810363c7>] ? should_resched+0x5/0x23
[161160.693222]  [<ffffffff813488fd>] ? mutex_lock+0x1a/0x2d
[161160.693224]  [<ffffffff810ea5e0>] ? kfree+0x5b/0x6c
[161160.693240]  [<ffffffffa0ec0518>] ? zfs_zinactive+0x43/0xa7 [zfs]
[161160.693257]  [<ffffffffa0eb72e6>] ? zfs_inactive+0x102/0x139 [zfs]
[161160.693261]  [<ffffffff8110bb48>] ? evict+0x96/0x148
[161160.693263]  [<ffffffff8110bdb2>] ? dispose_list+0x27/0x31
[161160.693265]  [<ffffffff8110c8f2>] ? prune_icache_sb+0x24f/0x25e
[161160.693268]  [<ffffffff810fb5cb>] ? prune_super+0xd5/0x13f
[161160.693270]  [<ffffffff810c13f5>] ? shrink_slab+0x18f/0x24d
[161160.693272]  [<ffffffff810c354d>] ? do_try_to_free_pages+0x25f/0x3de
[161160.693275]  [<ffffffff810c3937>] ? try_to_free_pages+0xa9/0xe9
[161160.693277]  [<ffffffff810baa2b>] ? __alloc_pages_nodemask+0x4a9/0x726
[161160.693281]  [<ffffffff8113e179>] ? proc_alloc_inode+0x96/0x96
[161160.693284]  [<ffffffff810e9b55>] ? kmem_getpages+0x4c/0xd9
[161160.693287]  [<ffffffff810eaef5>] ? fallback_alloc+0x123/0x1c2
[161160.693289]  [<ffffffff810eb4c8>] ? kmem_cache_alloc_trace+0x79/0xce
[161160.693292]  [<ffffffff8111160b>] ? seq_read+0xb5/0x34c
[161160.693294]  [<ffffffff8113e6f3>] ? proc_reg_read+0x78/0x91
[161160.693297]  [<ffffffff810f9321>] ? vfs_read+0x9f/0xe6
[161160.693300]  [<ffffffff810d5933>] ? sys_mmap_pgoff+0x125/0x160
[161160.693302]  [<ffffffff810f93ad>] ? sys_read+0x45/0x6b
[161160.693304]  [<ffffffff8134e3d2>] ? system_call_fastpath+0x16/0x1b
[161160.693308] INFO: task dropbox:4920 blocked for more than 120 seconds.
[161160.693331] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[161160.693357] dropbox         D ffff88043fd13540     0  4920      1 0x00000000
[161160.693360]  ffff880427534080 0000000000000086 ffff88018fc43600 ffff88042d074730
[161160.693363]  0000000000013540 ffff8803f479bfd8 ffff8803f479bfd8 ffff880427534080
[161160.693366]  ffff88042d645800 ffffffffa0e81aa4 ffff88018fc43600 ffff880412b89c20
[161160.693369] Call Trace:
[161160.693388]  [<ffffffffa0e81aa4>] ? sa_attr_register_sync.isra.6+0x19d/0x1ac [zfs]
[161160.693391]  [<ffffffff81348a0f>] ? __mutex_lock_common.isra.5+0xff/0x164
[161160.693394]  [<ffffffff813488fd>] ? mutex_lock+0x1a/0x2d
[161160.693413]  [<ffffffffa0e914c2>] ? txg_rele_to_sync+0x48/0x55 [zfs]
[161160.693428]  [<ffffffffa0ec0518>] ? zfs_zinactive+0x43/0xa7 [zfs]
[161160.693444]  [<ffffffffa0eb72e6>] ? zfs_inactive+0x102/0x139 [zfs]
[161160.693448]  [<ffffffff8110bb48>] ? evict+0x96/0x148
[161160.693450]  [<ffffffff8110bdb2>] ? dispose_list+0x27/0x31
[161160.693452]  [<ffffffff8110c8f2>] ? prune_icache_sb+0x24f/0x25e
[161160.693455]  [<ffffffff810fb5cb>] ? prune_super+0xd5/0x13f
[161160.693457]  [<ffffffff810c13f5>] ? shrink_slab+0x18f/0x24d
[161160.693459]  [<ffffffff810c354d>] ? do_try_to_free_pages+0x25f/0x3de
[161160.693462]  [<ffffffff810c3937>] ? try_to_free_pages+0xa9/0xe9
[161160.693464]  [<ffffffff810baa2b>] ? __alloc_pages_nodemask+0x4a9/0x726
[161160.693467]  [<ffffffff810e5123>] ? alloc_pages_vma+0xe2/0xe7
[161160.693470]  [<ffffffff810cff31>] ? handle_pte_fault+0x165/0x79f
[161160.693472]  [<ffffffff810cd641>] ? pte_offset_kernel+0x16/0x35
[161160.693475]  [<ffffffff8134c3c3>] ? do_page_fault+0x312/0x337
[161160.693477]  [<ffffffff81037e30>] ? set_next_entity+0x32/0x55
[161160.693480]  [<ffffffff8100d755>] ? __switch_to+0x1e5/0x258
[161160.693482]  [<ffffffff810397ec>] ? finish_task_switch+0x88/0xba
[161160.693485]  [<ffffffff813480b4>] ? __schedule+0x5ac/0x5c3
[161160.693487]  [<ffffffff813499b5>] ? page_fault+0x25/0x30
[161160.693492] INFO: task irssi:7797 blocked for more than 120 seconds.
[161160.693514] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[161160.693540] irssi           D ffff88043fc13540     0  7797   4834 0x00000000
[161160.693543]  ffff8804269de300 0000000000000082 ffff880157971800 ffff88042d075510
[161160.693546]  0000000000013540 ffff880366853fd8 ffff880366853fd8 ffff8804269de300
[161160.693549]  ffff88042d645800 ffffffffa0e81aa4 ffff880157971800 ffff880412b89c20
[161160.693553] Call Trace:
[161160.693572]  [<ffffffffa0e81aa4>] ? sa_attr_register_sync.isra.6+0x19d/0x1ac [zfs]
[161160.693575]  [<ffffffff81348a0f>] ? __mutex_lock_common.isra.5+0xff/0x164
[161160.693577]  [<ffffffff813488fd>] ? mutex_lock+0x1a/0x2d
[161160.693596]  [<ffffffffa0e914c2>] ? txg_rele_to_sync+0x48/0x55 [zfs]
[161160.693611]  [<ffffffffa0ec0518>] ? zfs_zinactive+0x43/0xa7 [zfs]
[161160.693627]  [<ffffffffa0eb72e6>] ? zfs_inactive+0x102/0x139 [zfs]
[161160.693631]  [<ffffffff8110bb48>] ? evict+0x96/0x148
[161160.693633]  [<ffffffff8110bdb2>] ? dispose_list+0x27/0x31
[161160.693636]  [<ffffffff8110c8f2>] ? prune_icache_sb+0x24f/0x25e
[161160.693638]  [<ffffffff810fb5cb>] ? prune_super+0xd5/0x13f
[161160.693640]  [<ffffffff810c13f5>] ? shrink_slab+0x18f/0x24d
[161160.693643]  [<ffffffff810c354d>] ? do_try_to_free_pages+0x25f/0x3de
[161160.693645]  [<ffffffff810c3937>] ? try_to_free_pages+0xa9/0xe9
[161160.693648]  [<ffffffff810baa2b>] ? __alloc_pages_nodemask+0x4a9/0x726
[161160.693651]  [<ffffffff810e9b55>] ? kmem_getpages+0x4c/0xd9
[161160.693653]  [<ffffffff810eaef5>] ? fallback_alloc+0x123/0x1c2
[161160.693656]  [<ffffffff810ebb5c>] ? kmem_cache_alloc+0x73/0xea
[161160.693658]  [<ffffffff811254c6>] ? fsnotify_create_event+0x37/0x141
[161160.693661]  [<ffffffff81124bb5>] ? send_to_group.isra.1+0x130/0x177
[161160.693663]  [<ffffffff81124d56>] ? fsnotify+0x15a/0x20c
[161160.693665]  [<ffffffff81124f9f>] ? __fsnotify_parent+0x8a/0xa3
[161160.693668]  [<ffffffff810f8a36>] ? fsnotify_modify+0x3b/0x5c
[161160.693670]  [<ffffffff810f9254>] ? vfs_write+0xbb/0xe9
[161160.693672]  [<ffffffff810f9418>] ? sys_write+0x45/0x6b
[161160.693674]  [<ffffffff8134e3d2>] ? system_call_fastpath+0x16/0x1b
[161160.693679] INFO: task java:598 blocked for more than 120 seconds.
[161160.693700] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[161160.693727] java            D ffff88043fc93540     0   598      1 0x00000000
[161160.693730]  ffff880425297850 0000000000000082 0000000000000000 ffff88042f1140c0
[161160.693733]  0000000000013540 ffff8801062bdfd8 ffff8801062bdfd8 ffff880425297850
[161160.693736]  0000000000000001 00000001817a8dd0 ffffffff817a8770 ffff880369fe7be0
[161160.693739] Call Trace:
[161160.693742]  [<ffffffff813493b7>] ? rwsem_down_failed_common+0xe0/0x114
[161160.693746]  [<ffffffff811b1b93>] ? call_rwsem_down_write_failed+0x13/0x20
[161160.693748]  [<ffffffff81348d24>] ? down_write+0x25/0x27
[161160.693751]  [<ffffffff810d63dd>] ? sys_mprotect+0xc5/0x1fc
[161160.693753]  [<ffffffff8134e3d2>] ? system_call_fastpath+0x16/0x1b

I will try to avoid rebooting the system, in case anyone has any suggestions that might resolve this problem online.

Update: Nevermind, I realized that with kswapd0 in deadlock it's highly unlikely that there's anything I could do to fix this problem without a reboot, so I rebooted. On an unrelated note, I'm happy to see that the scrub is continuing from where it left off.

DeHackEd commented 12 years ago

While using ryao's "gentoo" branch of his own fork here on on github I killed the same machine again. Here's more data dumps.

Edit: must have been too much for github. Here's a pastebin of what I wanted: http://pastebin.com/303uhMen

DeHackEd commented 12 years ago

After talking with ryao in IRC I've been asked to add a few points to this issue.

The vdev is a single device, /dev/sda, which is actually 13x drives in a hardware RAID-6 array. It can churn out pretty high IOPS for random reads which is the significant portion of the system's job - lots of rsync (as a sync target) instances all scanning what is tens of millions of files. It's also compressed with a ratio of about 1.7x so it grows pretty fast. (No Dedup)

The 'size' field from the kstats for the ARC, during heavy IO, just grows past the c and c_max fields, but it also seems to have some kind of background job that is draining the ARC at all times to around 512 megabytes. When completely idle, the size just shrinks mysteriously. When under relatively light load, it may float in place but still stay under c. It seems to stop shrinking at about 512 megabytes whether idle or not. I'd guess that under heavy load the background shrink just isn't keeping up with the foreground fill.

As a hack I can keep the system stable by SIGSTOP'ing all the rsync processes if the size gets dangerously high (my breaking point seems to be around 3.3 billion bytes) and SIGCONT once it's fallen back down again.

In my most recent reports I've been setting sysctl vm.min_free_kbytes to 524288

jspiros commented 12 years ago

After switching to vm.min_free_kbytes = 512M from 256M I've been able to survive the nightly rsyncs, however, now I'm running into exactly the same problem except with CrashPlan (a java application for remote backups) causing it instead of rsync. Exactly the same problem in that kswapd0 deadlocks followed by arc_reclaim and then everything else. I needed to get the system back online and neglected to copy the dmesg output, but I confirmed that it is essentially identical to the previous output from the rsync-caused problem. I am hesitant to turn CrashPlan back on until there's something new to try.

DeHackEd commented 12 years ago

Still trying to collect useful information. I set the arc_max to 256 megabytes just for experimentation. It still deadlocked.

Stack traces and arcstats from kstat in pastebin: http://pastebin.com/cgKZwaSJ

Two sets of stack traces are provided, set about 2.5 hours apart. The first is from long after the system had hung - I don't know when that actually happened.

vm.min_free_kbytes = 67584 (bootup default)

# free -m
             total       used       free     shared    buffers     cached
Mem:          7990       7835        155          0         52        111
-/+ buffers/cache:       7671        319
Swap:         4095          1       4094

# history | grep zfs.ko
insmod zfs.ko zfs_arc_max=268435456 zfs_arc_min=134217728 zfs_arc_shrink_shift=4 zfs_vdev_max_pending=20 zfs_vdev_min_pending=8

DeHackEd commented 12 years ago

http://pastebin.com/pH7KyBKa

I turned on debugging and turned off a bunch of compiler optimizations and locked it up again the same way. These stack traces look a lot nicer to decode.

jspiros commented 12 years ago

After speaking with @DeHackEd on IRC about the cause of this problem, I tried limiting the queue depth (to 16 from 32) on the SATA drives that make up my pool using hdparm. I experienced success with this, in that I survived the night with both CrashPlan and the nightly rsyncs running. This isn't a long term fix, as it does slow everything down, but if someone else is in a situation like mine, it might help until this issue is solved.

DeHackEd commented 12 years ago

The crashing/deadlocking side of the issue is apparently fixed by https://bugs.gentoo.org/show_bug.cgi?id=416685 (which I applied and cleaned up for kernel 3.0.x). Now it can run for over an hour with nothing controlling its execution speed. I'll update if anything goes wrong on me.

While running in this mode the ARC appears to behave itself better with regard to following the c value. This is just a rough observation right now as the ARC size is unlimited and generally can't reach the maximum due to memory usage. That said I did witness c shrink below c_max and the size was generally following the new c value. This is just one brief observation though - more grinding is required.

Note: I am currently using ryao's 'gentoo' tree.

behlendorf commented 12 years ago

The memory management and swap improvements in issue #883 have now been merged in to master and should remove the need for the kernel vmalloc() patch. If you could verify these resolve your issue I think we'll be able to close this issue.

DeHackEd commented 12 years ago

I can try installing it probably on Tuesday, but it may be a few days before the result is yay/nay.

behlendorf commented 12 years ago

No great rush, I'm not going anywhere. But it would be nice to know. :)

jspiros commented 12 years ago

I'm heading to DjangoCon for a week, so I'm going to wait until I get back to give this a test. Last thing I need is my server dying when I'm not physically around to reboot it. That said, I'm very excited, and look forward to testing it out when I get back. :)

behlendorf commented 12 years ago

To be clear (this bug has wandered a bit), the changes merged just address the deadlock not any misbehavior by the ARC. Assuming the deadlock is resolved, I fully expect it will be, I'd suggest we close the bug and reopen a new one from any bad ARC behavior.

behlendorf commented 12 years ago

Since I haven't heard otherwise, I'm closing this issue because the deadlock has been fixed. If someone is able to recreate the issue with the master source we can reopen this bug, or file a new one.

DeHackEd commented 12 years ago

# cat /proc/spl/kstat/zfs/arcstats 4 1 0x01 77 3696 3042886422925 64409337601438 name type data hits 4 248393341 misses 4 75936476 demand_data_hits 4 105856 demand_data_misses 4 10155 demand_metadata_hits 4 234545007 demand_metadata_misses 4 40492914 prefetch_data_hits 4 0 prefetch_data_misses 4 11009 prefetch_metadata_hits 4 13742478 prefetch_metadata_misses 4 35422398 mru_hits 4 18658796 mru_ghost_hits 4 8980156 mfu_hits 4 217641924 mfu_ghost_hits 4 33122819 deleted 4 34384069 recycle_miss 4 52654888 mutex_miss 4 171299 evict_skip 4 28471305530 evict_l2_cached 4 769955428864 evict_l2_eligible 4 286991601152 evict_l2_ineligible 4 286747104256 hash_elements 4 2731819 hash_elements_max 4 2731819 hash_collisions 4 20413265 hash_chains 4 506626 hash_chain_max 4 18 p 4 340169571 c 4 5442450994 c_min 4 680306374 c_max 4 5442450994 size 4 7394144008 hdr_size 4 211937584 data_size 4 4567342592 other_size 4 2046281288 anon_size 4 3329536 anon_evict_data 4 0 anon_evict_metadata 4 0 mru_size 4 555188224 mru_evict_data 4 0 mru_evict_metadata 4 688128 mru_ghost_size 4 4887883776 mru_ghost_evict_data 4 0 mru_ghost_evict_metadata 4 4887883776 mfu_size 4 4008824832 mfu_evict_data 4 0 mfu_evict_metadata 4 2129920 mfu_ghost_size 4 554135040 mfu_ghost_evict_data 4 0 mfu_ghost_evict_metadata 4 554135040 l2_hits 4 30877561 l2_misses 4 45058896 l2_feeds 4 61336 l2_rw_clash 4 1219 l2_read_bytes 4 520022814720 l2_write_bytes 4 40076504064 l2_writes_sent 4 48373 l2_writes_done 4 48373 l2_writes_error 4 0 l2_writes_hdr_miss 4 75 l2_evict_lock_retry 4 0 l2_evict_reading 4 0 l2_free_on_write 4 287468 l2_abort_lowmem 4 11 l2_cksum_bad 4 0 l2_io_error 4 0 l2_size 4 37175568896 l2_hdr_size 4 602028576 memory_throttle_count 4 0 memory_direct_count 4 55 memory_indirect_count 4 1127 arc_no_grow 4 0 arc_tempreserve 4 0 arc_loaned_bytes 4 0 arc_prune 4 106172 arc_meta_used 4 7394144008 arc_meta_limit 4 1360612748 arc_meta_max 4 8175635800 ZFS git checkout fc24f7c887a040b6dc9f2a3dd3d5ae0c03a5d639 compiled with only 'configure --with-spl=/path/to/spl-build/'

SPL git checkout ac8ca67a88bdb8be08456785116a8824fba735df compiled with 'configure --enable-debug'

Kernel is RHEL 6.3 up-to-date (2.6.32-279.5.2.el6.x86_64)

behlendorf commented 12 years ago

The point being that we appear to have exceeded the arc c_max and arc_meta_limit? But no deadlocks correct?

DeHackEd commented 12 years ago

Correct, in spite of going over my set limits it was/is stable. The backup cycle has now finished and the ARC has shrunk naturally on its own from much reduced activity (currently floating around 1300 megabytes while it does some last minute work).

It's only been one cycle but I currently have no crashes/hangs/etc. Stability testing comes tonight when I turn the pressure up a bit (more concurrent threads) but that's not directly relevant to this issue.

DeHackEd commented 12 years ago

After a bit of playing around I'd come to a possible conclusion. When activity is low (load average ~15 or less... yeah that's low by my standards) the ARC seems to behave normally. When activity is higher (currently ~23) it will go over its limit.

# cat /proc/spl/kstat/zfs/arcstats 
4 1 0x01 77 3696 20842511127 88312299340944
name                            type data
p                               4    340219235
c                               4    5442450994
c_min                           4    680306374
c_max                           4    5442450994
size                            4    6010999752
hdr_size                        4    179171888
data_size                       4    3746268672
other_size                      4    1539191432
memory_throttle_count           4    0
memory_direct_count             4    5958
memory_indirect_count           4    12149
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    58588
arc_meta_used                   4    5986955720
arc_meta_limit                  4    1360612748
arc_meta_max                    4    7889218664

arc_adapt     D 0000000000000001     0  1671      2 0x00000080
 ffff880410659c30 0000000000000046 ffff880410659bc0 ffffffffa04fbb74
 ffff880410659bc0 ffff880024b4d800 0000000000000000 0000000000000001
 ffff880420ac9098 ffff880410659fd8 000000000000fb88 ffff880420ac9098
Call Trace:
 [] ? arc_buf_remove_ref+0xd4/0x120 [zfs]
 [] ? dmu_buf_rele+0x30/0x40 [zfs]
 [] __mutex_lock_slowpath+0x13e/0x180
 [] mutex_lock+0x2b/0x50
 [] zfs_zinactive+0x7e/0x110 [zfs]
 [] zfs_inactive+0x7f/0x220 [zfs]
 [] zpl_clear_inode+0xe/0x10 [zfs]
 [] clear_inode+0xac/0x140
 [] dispose_list+0x40/0x120
 [] shrink_icache_memory+0x274/0x2e0
 [] ? zpl_prune_sbs+0x0/0x70 [zfs]
 [] zpl_prune_sbs+0x57/0x70 [zfs]
 [] arc_adjust_meta+0x120/0x1e0 [zfs]
 [] ? arc_adapt_thread+0x0/0xd0 [zfs]
 [] ? arc_adapt_thread+0x0/0xd0 [zfs]
 [] arc_adapt_thread+0x6a/0xd0 [zfs]
 [] thread_generic_wrapper+0x71/0xd0 [spl]
 [] ? thread_generic_wrapper+0x0/0xd0 [spl]
 [] kthread+0x96/0xa0
 [] child_rip+0xa/0x20
 [] ? kthread+0x0/0xa0
 [] ? child_rip+0x0/0x20
txg_sync      D 0000000000000000     0  1807      2 0x00000080
 ffff8804000b5ba0 0000000000000046 000000003e5c2a80 0000000000000001
 ffff88023e5c2700 0000000000000082 ffff8804000b5b60 ffff880420d67d48
 ffff880410769af8 ffff8804000b5fd8 000000000000fb88 ffff880410769af8
Call Trace:
 [] ? prepare_to_wait_exclusive+0x4e/0x80
 [] cv_wait_common+0xe5/0x3d0 [spl]
 [] ? zio_execute+0x0/0xf0 [zfs]
 [] ? autoremove_wake_function+0x0/0x40
 [] __cv_wait+0x13/0x20 [spl]
 [] zio_wait+0xeb/0x160 [zfs]
 [] dsl_pool_sync+0x2a7/0x480 [zfs]
 [] spa_sync+0x397/0x9a0 [zfs]
 [] txg_sync_thread+0x2c1/0x490 [zfs]
 [] ? set_user_nice+0xc9/0x130
 [] ? txg_sync_thread+0x0/0x490 [zfs]
 [] thread_generic_wrapper+0x71/0xd0 [spl]
 [] ? thread_generic_wrapper+0x0/0xd0 [spl]
 [] kthread+0x96/0xa0
 [] child_rip+0xa/0x20
 [] ? kthread+0x0/0xa0
 [] ? child_rip+0x0/0x20

More stack traces available if you want them.

Currently active jobs: 2 pools (independent disks involved) each scrubbing, and a send/receive in progress across them. The bigger pool (10 TB in a single disk) is running a small number of rsyncs (~5) and a dozen or so threads essentially doing the job of du.

behlendorf commented 12 years ago

Reopening issue. But for clarity, the deadlock has been resolved and the remaining issue here is that fact that the ARC limit isn't being honored.

aarrpp commented 12 years ago

I am experiencing similar problems. Under heavy load (~ 60 rsyncs) ARC grows over limit. arc_adapt and kswapd1 load CPU very intensive The system freezes for a short time and performance degradation occure. In the end, the system hangs. Tested on the latest version from git.

load average: 49.08, 48.50, 43.52

cat /proc/spl/kstat/zfs/arcstats | grep -e arc_meta_used -e "^size " -e c_max -e limit

c_max 4 13421772800 size 4 10599128312 arc_meta_used 4 10474886904 arc_meta_limit 4 3355443200

free

         total       used       free     shared    buffers     cached

Mem: 24736592 24458768 277824 0 2588 45360 -/+ buffers/cache: 24410820 325772 Swap: 7811064 304 7810760

top

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

24718 root 0 -20 0 0 0 R 100 0.0 21:24.07 arc_adapt 144 root 20 0 0 0 0 R 75 0.0 0:17.07 kswapd1 143 root 20 0 0 0 0 R 67 0.0 0:30.36 kswapd0

0x7a [zfs] [4879179.203765] [] ? vfs_write+0xa9/0x102 [4879179.203768] [] ? sys_write+0x45/0x6e [4879179.203772] [] ? system_call_fastpath+0x16/0x1b [4879179.203775] rsync D ffff88000c22fa20 0 23594 26441 0x00000000 [4879179.203779] ffff88063e44b880 0000000000000082 ffff88052c40d9d8 ffff88052c40d9d4 [4879179.203783] 0000000000020000 0000000000000006 000000000000f9e0 ffff88052c40dfd8 [4879179.203787] 0000000000015780 0000000000015780 ffff88031c9be2e0 ffff88031c9be5d8 [4879179.203791] Call Trace: [4879179.203814] [] ? zio_wait_for_children+0x4f/0x59 [zfs] [4879179.203821] [] ? cv_wait_common+0xb9/0x141 [spl] [4879179.203824] [] ? autoremove_wake_function+0x0/0x2e [4879179.203847] [] ? zio_wait+0x129/0x161 [zfs] [4879179.203866] [] ? dbuf_read+0x3fd/0x4b9 [zfs] [4879179.203886] [] ? dmu_buf_hold+0xb0/0xe4 [zfs] [4879179.203912] [] ? zap_lockdir+0x58/0x541 [zfs] [4879179.203916] [] ? mutex_lock+0xd/0x31 [4879179.203941] [] ? zap_cursor_retrieve+0x53/0x1ff [zfs] [4879179.203962] [] ? RW_WRITE_HELD+0x67/0x81 [zfs] [4879179.203983] [] ? dmu_prefetch+0x95/0x169 [zfs] [4879179.204008] [] ? zfs_readdir+0x1ec/0x3b2 [zfs] [4879179.204012] [] ? filldir+0x0/0xb7 [4879179.204016] [] ? acl_permission_check+0x5c/0x7e [4879179.204020] [] ? file_move+0x1d/0x49 [4879179.204023] [] ? dentry_open+0x1c4/0x2bf [4879179.204027] [] ? do_filp_open+0x4e4/0x94b [4879179.204031] [] ? cp_new_stat+0xe9/0xfc [4879179.204035] [] ? filldir+0x0/0xb7 [4879179.204056] [] ? zpl_readdir+0x44/0x5c [zfs] [4879179.204060] [] ? filldir+0x0/0xb7 [4879179.204063] [] ? vfs_readdir+0x75/0xa7 [4879179.204067] [] ? sys_getdents+0x7a/0xc7 [4879179.204071] [] ? system_call_fastpath+0x16/0x1b [4879179.204073] rsync S ffff88000c28fa20 0 23602 23594 0x00000000 [4879179.204078] ffff88063e478000 0000000000000086 0000000000000000 ffffffff81192ddc [4879179.204082] 0000000000000000 ffff88023e145978 000000000000f9e0 ffff88023e145fd8 [4879179.204086] 0000000000015780 0000000000015780 ffff88033c2d1c40 ffff88033c2d1f38 [4879179.204090] Call Trace: [4879179.204094] [] ? rb_insert_color+0xba/0xe2 [4879179.204098] [] ? hrtimer_start_range_ns+0x22f/0x242 [4879179.204102] [] ? hrtimer_try_to_cancel+0x3a/0x43 [4879179.204106] [] ? schedule_hrtimeout_range+0xcd/0x112 [4879179.204110] [] ? hrtimer_wakeup+0x0/0x22 [4879179.204114] [] ? schedule_hrtimeout_range+0xaf/0x112 [4879179.204118] [] ? poll_schedule_timeout+0x31/0x4f [4879179.204122] [] ? do_select+0x50b/0x554 [4879179.204126] [] ? pollwait+0x0/0xd6 [4879179.204129] [] ? pollwake+0x0/0x59 [4879179.204134] [] ? kmalloc_node_track_caller+0xfa/0x11b [4879179.204139] [] ? sock_alloc_send_pskb+0xbd/0x2d8 [4879179.204142] [] ? alloc_skb+0x69/0x15a [4879179.204146] [] ? sock_alloc_send_pskb+0xbd/0x2d8 [4879179.204150] [] ? skb_queue_tail+0x17/0x3e [4879179.204154] [] ? sock_def_readable+0x10/0x62 [4879179.204158] [] ? unix_stream_sendmsg+0x24a/0x2c8 [4879179.204162] [] ? sock_aio_write+0xb1/0xbc [4879179.204166] [] ? core_sys_select+0x184/0x21e [4879179.204171] [] ? autoremove_wake_function+0x0/0x2e [4879179.204174] [] ? ktime_get_ts+0x68/0xb2 [4879179.204178] [] ? read_tsc+0xa/0x20 [4879179.204182] [] ? ktime_get_ts+0x68/0xb2 [4879179.204186] [] ? sys_select+0x92/0xbb [4879179.204189] [] ? system_call_fastpath+0x16/0x1b [4879179.204191] rsync S ffff88034acafa20 0 23632 26441 0x00000000 [4879179.204195] ffff88033e638e20 0000000000000086 0000000000000000 ffffffff81192ddc [4879179.204199] 0000000000000000 ffff8802a0357978 000000000000f9e0 ffff8802a0357fd8 [4879179.204202] 0000000000015780 0000000000015780 ffff880304548710 ffff880304548a08 [4879179.204206] Call Trace: [4879179.204209] [] ? rb_insert_color+0xba/0xe2 [4879179.204213] [] ? hrtimer_start_range_ns+0x22f/0x242 [4879179.204217] [] ? schedule_hrtimeout_range+0xcd/0x112 [4879179.204220] [] ? hrtimer_wakeup+0x0/0x22 [4879179.204223] [] ? schedule_hrtimeout_range+0xaf/0x112 [4879179.204227] [] ? poll_schedule_timeout+0x31/0x4f [4879179.204230] [] ? do_select+0x50b/0x554 [4879179.204234] [] ? pollwait+0x0/0xd6 [4879179.204237] [] ? pollwake+0x0/0x59 [4879179.204245] [] ? scsi_dispatch_cmd+0x1cf/0x23c [scsi_mod] [4879179.204249] [] ? switch_to+0x285/0x297 [4879179.204253] [] ? finish_task_switch+0x3a/0xaf [4879179.204257] [] ? thread_return+0x79/0xe0 [4879179.204260] [] ? read_tsc+0xa/0x20 [4879179.204263] [] ? radix_tree_delete+0xbf/0x1ba [4879179.204266] [] ? ktime_get_ts+0x68/0xb2 [4879179.204270] [] ? core_sys_select+0x184/0x21e [4879179.204274] [] ? handle_mm_fault+0x3b8/0x80f [4879179.204277] [] ? ktime_get_ts+0x68/0xb2 [4879179.204281] [] ? read_tsc+0xa/0x20 [4879179.204283] [] ? ktime_get_ts+0x68/0xb2 [4879179.204287] [] ? sys_select+0x92/0xbb [4879179.204290] [] ? system_call_fastpath+0x16/0x1b [4879179.204292] rsync D ffff8801b383f100 0 23638 23632 0x00000000 [4879179.204296] ffff8801b383f100 0000000000000082 50749dfa0000000d 0000000000035a34 [4879179.204301] 00005c5600000000 000000000000023d 000000000000f9e0 ffff8804070b7fd8 [4879179.204305] 0000000000015780 0000000000015780 ffff88063cea54c0 ffff88063cea57b8 [4879179.204309] Call Trace: [4879179.204315] [] ? kmem_free_debug+0xc5/0x10d [spl] [4879179.204340] [] ? dsl_dir_tempreserve_clear+0xbd/0xf1 [zfs] [4879179.204346] [] ? cv_wait_common+0xb9/0x141 [spl] [4879179.204350] [] ? autoremove_wake_function+0x0/0x2e [4879179.204376] [] ? txg_wait_open+0xd4/0xf3 [zfs] [4879179.204400] [] ? zfs_create+0x3df/0x699 [zfs] [4879179.204407] [] ? kmem_alloc_debug+0x142/0x2cd [spl] [4879179.204429] [] ? zpl_create+0x94/0xc0 [zfs] [4879179.204434] [] ? generic_permission+0xe/0x8a [4879179.204437] [] ? vfs_create+0x6d/0x89 [4879179.204441] [] ? do_filp_open+0x31e/0x94b [4879179.204445] [] ? handle_mm_fault+0x3b8/0x80f [4879179.204450] [] ? do_sys_open+0x55/0xfc [4879179.204453] [] ? system_call_fastpath+0x16/0x1b [4879179.204456] rsync S ffff88000c28fa20 0 23660 26441 0x00000000 [4879179.204461] ffff88063e478000 0000000000000086 0000000000000000 ffffffff81192ddc [4879179.204465] 0000000000000000 ffff8803370e3978 000000000000f9e0 ffff8803370e3fd8 [4879179.204469] 0000000000015780 0000000000015780 ffff88030454b170 ffff88030454b468 [4879179.204473] Call Trace: [4879179.204477] [] ? rb_insert_color+0xba/0xe2 [4879179.204481] [] ? hrtimer_start_range_ns+0x22f/0x242 [4879179.204485] [] ? schedule_hrtimeout_range+0xcd/0x112 [4879179.204489] [] ? hrtimer_wakeup+0x0/0x22 [4879179.204493] [] ? schedule_hrtimeout_range+0xaf/0x112 [4879179.204497] [] ? poll_schedule_timeout+0x31/0x4f [4879179.204501] [] ? do_select+0x50b/0x554 [4879179.204505] [] ? pollwait+0x0/0xd6 [4879179.204509] [] ? pollwake+0x0/0x59 [4879179.204513] [] ? switch_to+0x285/0x297 [4879179.204517] [] ? radix_tree_delete+0xbf/0x1ba [4879179.204521] [] ? thread_return+0x79/0xe0 [4879179.204525] [] ? blk_peek_request+0x199/0x1ad [4879179.204529] [] ? read_tsc+0xa/0x20 [4879179.204532] [] ? cpumask_any_but+0x28/0x34 [4879179.204536] [] ? core_sys_select+0x184/0x21e [4879179.204540] [] ? __lock_page+0x5d/0x63 [4879179.204545] [] ? handle_mm_fault+0x7aa/0x80f [4879179.204549] [] ? ktime_get_ts+0x68/0xb2 [4879179.204552] [] ? read_tsc+0xa/0x20 [4879179.204556] [] ? ktime_get_ts+0x68/0xb2 [4879179.204559] [] ? sys_select+0x92/0xbb [4879179.204563] [] ? system_call_fastpath+0x16/0x1b [4879179.204566] rsync D ffff8802ee90c6a0 0 23661 23660 0x00000000 [4879179.204570] ffff8802ee90c6a0 0000000000000082 50749dfa00000000 0000000000035a6d [4879179.204574] 00005c6d00000000 000000000000023d 000000000000f9e0 ffff8804e23f7fd8 [4879179.204578] 0000000000015780 0000000000015780 ffff880637011530 ffff880637011828 [4879179.204582] Call Trace: [4879179.204589] [] ? kmem_free_debug+0xc5/0x10d [spl] [4879179.204613] [] ? dsl_dir_tempreserve_clear+0xbd/0xf1 [zfs] [4879179.204620] [] ? cv_wait_common+0xb9/0x141 [spl] [4879179.204624] [] ? autoremove_wake_function+0x0/0x2e [4879179.204650] [] ? txg_wait_open+0xd4/0xf3 [zfs] [4879179.204674] [] ? zfs_setattr+0xad1/0x13d7 [zfs] [4879179.204681] [] ? kmem_alloc_debug+0x161/0x2cd [spl] [4879179.204685] [] ? kmalloc+0x12f/0x141 [4879179.204691] [] ? kmem_alloc_debug+0x142/0x2cd [spl] [4879179.204698] [] ? kmem_alloc_debug+0x142/0x2cd [spl] [4879179.204702] [] ? capable+0x22/0x41 [4879179.204725] [] ? zpl_setattr+0xd2/0xf6 [zfs] [4879179.204729] [] ? notify_change+0x180/0x2c5 [4879179.204733] [] ? chown_common+0x66/0x7c [4879179.204736] [] ? sys_lchown+0x43/0x61 [4879179.204740] [] ? system_call_fastpath+0x16/0x1b [4879179.204742] rsync S ffff88034acefb30 0 23749 26441 0x00000000 [4879179.204747] ffff88033e63c6a0 0000000000000082 ffff8801d6057938 ffff8801d6057934 [4879179.204751] 0000000000000000 ffff8801d6057978 000000000000f9e0 ffff8801d6057fd8 [4879179.204755] 0000000000015780 0000000000015780 ffff8801b383bf90 ffff8801b383c288 [4879179.204759] Call Trace: [4879179.204763] [] ? hrtimer_start_range_ns+0x22f/0x242 [4879179.204767] [] ? schedule_hrtimeout_range+0xcd/0x112 [4879179.204771] [] ? hrtimer_wakeup+0x0/0x22 [4879179.204775] [] ? schedule_hrtimeout_range+0xaf/0x112 [4879179.204779] [] ? poll_schedule_timeout+0x31/0x4f [4879179.204783] [] ? do_select+0x50b/0x554 [4879179.204787] [] ? __pollwait+0x0/0xd6 [4879179.204790] [] ? pollwake+0x0/0x59 [4879179.204795] [] ? blk_peek_request+0x199/0x1ad [4879179.204803] [] ? scsi_dispatch_cmd+0x1cf/0x23c [scsi_mod] [4879179.204808] [] ? scm_recv+0x33/0xaf [4879179.204811] [] ? skb_dequeue+0x50/0x58 [4879179.204815] [] ? unix_stream_recvmsg+0x59c/0x5d3 [4879179.204819] [] ? update_curr+0xa6/0x147 [4879179.204823] [] ? thread_return+0x79/0xe0 [4879179.204827] [] ? sock_aio_read+0xb9/0xc4 [4879179.204831] [] ? core_sys_select+0x184/0x21e [4879179.204835] [] ? autoremove_wake_function+0x0/0x2e [4879179.204839] [] ? ktime_get_ts+0x68/0xb2 [4879179.204843] [] ? read_tsc+0xa/0x20 [4879179.204846] [] ? ktime_get_ts+0x68/0xb2 [4879179.204850] [] ? sys_select+0x92/0xbb [4879179.204854] [] ? system_call_fastpath+0x16/0x1b [4879179.204856] rsync D ffff880304548000 0 23770 23749 0x00000000 [4879179.204861] ffff880304548000 0000000000000086 0000000000000000 00000000810412ee [4879179.204865] ffff88000c235780 0000000000000001 000000000000f9e0 ffff880588b2bfd8 [4879179.204869] 0000000000015780 0000000000015780 ffff880637013f90 ffff880637014288 [4879179.204873] Call Trace: [4879179.204876] [] ? kmalloc+0x12f/0x141 [4879179.204882] [] ? cv_wait_common+0xb9/0x141 [spl] [4879179.204885] [] ? autoremove_wake_function+0x0/0x2e [4879179.204909] [] ? txg_wait_open+0xd4/0xf3 [zfs] [4879179.204931] [] ? dmu_tx_assign+0x108/0x378 [zfs] [4879179.204957] [] ? zfs_rmnode+0x187/0x2e3 [zfs] [4879179.204982] [] ? zfs_zinactive+0x9e/0xf7 [zfs] [4879179.205007] [] ? zfs_inactive+0x122/0x15e [zfs] [4879179.205029] [] ? zpl_inode_delete+0x0/0x58 [zfs] [4879179.205052] [] ? zpl_inode_delete+0x0/0x58 [zfs] [4879179.205056] [] ? clear_inode+0x79/0xd0 [4879179.205059] [] ? generic_delete_inode+0xdc/0x168 [4879179.205063] [] ? d_kill+0x40/0x61 [4879179.205067] [] ? dput+0x152/0x15e [4879179.205070] [] ? sys_renameat+0x1a8/0x20d [4879179.205077] [] ? kmem_free_debug+0xc5/0x10d [spl] [4879179.205080] [] ? notify_change+0x2b3/0x2c5 [4879179.205085] [] ? mntput_no_expire+0x23/0xee [4879179.205089] [] ? sys_fchmodat+0xa8/0xb6 [4879179.205092] [] ? system_call_fastpath+0x16/0x1b [4879179.205095] rsync S ffff88000c22fa20 0 23828 26441 0x00000000 [4879179.205100] ffff88063e44b880 0000000000000082 0000000000000000 ffffffff81192ddc [4879179.205103] 0000000000000000 ffff8802526f9978 000000000000f9e0 ffff8802526f9fd8 [4879179.205107] 0000000000015780 0000000000015780 ffff88030c9d9c40 ffff88030c9d9f38 [4879179.205111] Call Trace: [4879179.205115] [] ? rb_insert_color+0xba/0xe2 [4879179.205119] [] ? hrtimer_start_range_ns+0x22f/0x242 [4879179.205123] [] ? hrtimer_try_to_cancel+0x3a/0x43 [4879179.205128] [] ? schedule_hrtimeout_range+0xcd/0x112 [4879179.205131] [] ? hrtimer_wakeup+0x0/0x22 [4879179.205135] [] ? schedule_hrtimeout_range+0xaf/0x112 [4879179.205139] [] ? poll_schedule_timeout+0x31/0x4f [4879179.205143] [] ? do_select+0x50b/0x554 [4879179.205147] [] ? pollwait+0x0/0xd6 [4879179.205151] [] ? pollwake+0x0/0x59 [4879179.205155] [] ? switch_to+0x285/0x297 [4879179.205160] [] ? finish_task_switch+0x3a/0xaf [4879179.205163] [] ? thread_return+0x79/0xe0 [4879179.205168] [] ? blk_peek_request+0x199/0x1ad [4879179.205172] [] ? read_tsc+0xa/0x20 [4879179.205175] [] ? ktime_get_ts+0x68/0xb2 [4879179.205179] [] ? core_sys_select+0x184/0x21e [4879179.205184] [] ? __lock_page+0x5d/0x63 [4879179.205188] [] ? wake_bit_function+0x0/0x23 [4879179.205192] [] ? swap_info_get+0x6a/0xa3 [4879179.205196] [] ? handle_mm_fault+0x6da/0x80f [4879179.205200] [] ? ktime_get_ts+0x68/0xb2 [4879179.205203] [] ? read_tsc+0xa/0x20 [4879179.205207] [] ? ktime_get_ts+0x68/0xb2 [4879179.205210] [] ? sys_select+0x92/0xbb [4879179.205214] [] ? system_call_fastpath+0x16/0x1b [4879179.205217] rsync D ffff88063ca1eaa0 0 23831 23828 0x00000000 [4879179.205221] ffff88033d75a350 0000000000000086 0000000000000000 00000000810412ee [4879179.205225] ffff88000c295780 0000000000000008 000000000000f9e0 ffff88028b081fd8 [4879179.205230] 0000000000015780 0000000000015780 ffff8802c816dbd0 ffff8802c816dec8 [4879179.205234] Call Trace: [4879179.205240] [] ? cv_wait_common+0xb9/0x141 [spl] [4879179.205244] [] ? autoremove_wake_function+0x0/0x2e [4879179.205270] [] ? txg_wait_open+0xd4/0xf3 [zfs] [4879179.205294] [] ? zfs_rename+0x94c/0xc11 [zfs] [4879179.205299] [] ? tsd_hash_search+0xad/0x1a3 [spl] [4879179.205320] [] ? zpl_rename+0x46/0x5e [zfs] [4879179.205324] [] ? vfs_rename+0x213/0x3c5 [4879179.205328] [] ? sys_renameat+0x191/0x20d [4879179.205332] [] ? handle_mm_fault+0x3b8/0x80f [4879179.205336] [] ? notify_change+0x2b3/0x2c5 [4879179.205340] [] ? mntput_no_expire+0x23/0xee [4879179.205344] [] ? do_page_fault+0x2e0/0x2fc [4879179.205348] [] ? system_call_fastpath+0x16/0x1b [4879179.205350] rsync R running task 0 23855 26441 0x00000000 [4879179.205354] 0000000000000001 0000000000000004 ffff880310a25118 ffffffff810bb279 [4879179.205358] 0000000000000000 ffffffff810bb2f1 0000000000000282 ffff8803fa2a35a0 [4879179.205363] 0000000000000018 ffffffffa036a1b4 ffff88033c2a7100 ffff88033c2a7400 [4879179.205367] Call Trace: [4879179.205371] [] ? determine_dirtyable_memory+0xd/0x1d [4879179.205375] [] ? get_dirty_limits+0x1d/0x259 [4879179.205381] [] ? spl_slab_reclaim+0x86/0x347 [spl] [4879179.205387] [] ? spl_kmem_cache_reap_now+0x114/0x175 [spl] [4879179.205390] [] ? down_read+0x9/0x19 [4879179.205395] [] ? spl_kmem_cache_generic_shrinker+0x6f/0xbe [spl] [4879179.205401] [] ? spl_kmem_cache_generic_shrinker+0x19/0x1e [spl] [4879179.205405] [] ? shrink_slab+0xe0/0x153 [4879179.205409] [] ? try_to_free_pages+0x26a/0x38e [4879179.205412] [] ? isolate_pages_global+0x0/0x20f [4879179.205439] [] ? sa_build_index+0x29a/0x2ae [zfs] [4879179.205443] [] ? alloc_pages_nodemask+0x3d4/0x5fc [4879179.205447] [] ? new_slab+0x42/0x1ca [4879179.205451] [] ? __slab_alloc+0x1f0/0x39b [4879179.205458] [] ? kmalloc_nofail+0x2b/0x3b [spl] [4879179.205477] [] ? dbuf_read+0x10d/0x4b9 [zfs] [4879179.205481] [] ? kmalloc+0xf1/0x141 [4879179.205487] [] ? kmalloc_nofail+0x2b/0x3b [spl] [4879179.205493] [] ? kmalloc_nofail+0x2b/0x3b [spl] [4879179.205499] [] ? kmem_alloc_debug+0x161/0x2cd [spl] [4879179.205520] [] ? dmu_buf_hold+0xd4/0xe4 [zfs] [4879179.205524] [] ? down_read+0x9/0x19 [4879179.205549] [] ? zap_lockdir+0x371/0x541 [zfs] [4879179.205576] [] ? zap_name_alloc+0x3e/0x119 [zfs] [4879179.205601] [] ? zap_lookup_norm+0x5a/0x153 [zfs] [4879179.205627] [] ? zap_lookup+0x2a/0x2f [zfs] [4879179.205653] [] ? zfs_dirent_lock+0x3b7/0x47a [zfs] [4879179.205679] [] ? zfs_zaccess_aces_check+0x1ec/0x21c [zfs] [4879179.205704] [] ? zfs_dirlook+0x1cb/0x228 [zfs] [4879179.205728] [] ? zfs_lookup+0x27a/0x2c0 [zfs] [4879179.205732] [] ? d_rehash+0x2a/0x3a [4879179.205754] [] ? zpl_lookup+0x47/0x80 [zfs] [4879179.205758] [] ? d_alloc+0x14b/0x186 [4879179.205762] [] ? do_lookup+0xd3/0x15d [4879179.205765] [] ? link_path_walk+0x5a5/0x6f5 [4879179.205769] [] ? path_walk+0x66/0xc9 [4879179.205773] [] ? do_path_lookup+0x20/0x77 [4879179.205776] [] ? user_path_at+0x48/0x79 [4879179.205783] [] ? tsd_exit+0x83/0x193 [spl] [4879179.205787] [] ? cp_new_stat+0xe9/0xfc [4879179.205791] [] ? vfs_fstatat+0x2c/0x57 [4879179.205795] [] ? sys_newlstat+0x11/0x30 [4879179.205799] [] ? device_not_available+0x1b/0x20 [4879179.205803] [] ? system_call_fastpath+0x16/0x1b [4879179.205805] rsync S ffff88034ac6fa20 0 23856 23855 0x00000000 [4879179.205810] ffff88033e4ae2e0 0000000000000086 0000000000000000 0000000000000282 [4879179.205814] 0000000000000000 ffff8803c3d13978 000000000000f9e0 ffff8803c3d13fd8 [4879179.205818] 0000000000015780 0000000000015780 ffff880395caa350 ffff880395caa648 [4879179.205822] Call Trace: [4879179.205826] [] ? hrtimer_start_range_ns+0x22f/0x242 [4879179.205831] [] ? schedule_hrtimeout_range+0xcd/0x112 [4879179.205834] [] ? hrtimer_wakeup+0x0/0x22 [4879179.205838] [] ? schedule_hrtimeout_range+0xaf/0x112 [4879179.205842] [] ? poll_schedule_timeout+0x31/0x4f [4879179.205846] [] ? do_select+0x50b/0x554 [4879179.205850] [] ? pollwait+0x0/0xd6 [4879179.205854] [] ? pollwake+0x0/0x59 [4879179.205859] [] ? kmalloc_node_track_caller+0xfa/0x11b [4879179.205864] [] ? sock_alloc_send_pskb+0xbd/0x2d8 [4879179.205868] [] ? zone_watermark_ok+0x20/0xb1 [4879179.205872] [] ? zone_statistics+0x3c/0x5d [4879179.205876] [] ? zone_watermark_ok+0x20/0xb1 [4879179.205879] [] ? zone_statistics+0x3c/0x5d [4879179.205883] [] ? get_page_from_freelist+0x56b/0x760 [4879179.205888] [] ? alloc_pages_nodemask+0x11c/0x5fc [4879179.205892] [] ? core_sys_select+0x184/0x21e [4879179.205896] [] ? pte_alloc_one+0x2a/0x31 [4879179.205900] [] ? page_add_new_anon_rmap+0x58/0x78 [4879179.205905] [] ? handle_mm_fault+0x35c/0x80f [4879179.205908] [] ? read_tsc+0xa/0x20 [4879179.205912] [] ? ktime_get_ts+0x68/0xb2 [4879179.205916] [] ? sys_select+0x92/0xbb [4879179.205920] [] ? system_call_fastpath+0x16/0x1b [4879179.205923] rsync R running task 0 23905 26441 0x00000000 [4879179.205927] 0000000000000000 0000000000000064 000000010001e880 ffff88033b5770f8 [4879179.205931] ffff88039365b4c8 ffff88039365b4c8 ffff88039365b4d8 ffff88039365b4d8 [4879179.205935] ffff88039365b518 ffff880310a20000 0000000000001284 ffff880310a21040 [4879179.205939] Call Trace: [4879179.205946] [] ? spl_kmem_cache_reap_now+0x114/0x175 [spl] [4879179.205952] [] ? spl_kmem_cache_generic_shrinker+0x6f/0xbe [spl] [4879179.205958] [] ? spl_kmem_cache_generic_shrinker+0x19/0x1e [spl] [4879179.205962] [] ? shrink_slab+0xe0/0x153 [4879179.205966] [] ? try_to_free_pages+0x26a/0x38e [4879179.205969] [] ? isolate_pages_global+0x0/0x20f [4879179.205973] [] ? alloc_pages_nodemask+0x3d4/0x5fc [4879179.205978] [] ? new_slab+0x42/0x1ca [4879179.205982] [] ? slab_alloc+0x1f0/0x39b [4879179.205988] [] ? spl_kmem_cache_alloc+0x692/0xc4a [spl] [4879179.205994] [] ? del_timer_sync+0xc/0x16 [4879179.205997] [] ? kmalloc+0xf1/0x141 [4879179.206003] [] ? spl_kmem_cache_alloc+0x692/0xc4a [spl] [4879179.206009] [] ? spl_kmem_cache_alloc+0x692/0xc4a [spl] [4879179.206014] [] ? autoremove_wake_function+0x0/0x2e [4879179.206037] [] ? dnode_create+0x2e/0x13f [zfs] [4879179.206060] [] ? dnode_hold_impl+0x276/0x37e [zfs] [4879179.206077] [] ? remove_reference+0x99/0xa5 [zfs] [4879179.206099] [] ? dmu_bonus_hold+0x1e/0x19f [zfs] [4879179.206123] [] ? zfs_zget+0x5d/0x19d [zfs] [4879179.206148] [] ? zfs_dirent_lock+0x43d/0x47a [zfs] [4879179.206174] [] ? zfs_zaccess_aces_check+0x1ec/0x21c [zfs] [4879179.206200] [] ? zfs_dirlook+0x1cb/0x228 [zfs] [4879179.206225] [] ? zfs_lookup+0x27a/0x2c0 [zfs] [4879179.206229] [] ? d_rehash+0x2a/0x3a [4879179.206252] [] ? zpl_lookup+0x47/0x80 [zfs] [4879179.206256] [] ? d_alloc+0x14b/0x186 [4879179.206259] [] ? do_lookup+0xd3/0x15d [4879179.206263] [] ? link_path_walk+0x5a5/0x6f5 [4879179.206266] [] ? path_walk+0x66/0xc9 [4879179.206269] [] ? do_path_lookup+0x20/0x77 [4879179.206272] [] ? user_path_at+0x48/0x79 [4879179.206278] [] ? tsd_exit+0x83/0x193 [spl] [4879179.206282] [] ? cp_new_stat+0xe9/0xfc [4879179.206286] [] ? vfs_fstatat+0x2c/0x57 [4879179.206290] [] ? sys_newlstat+0x11/0x30 [4879179.206294] [] ? system_call_fastpath+0x16/0x1b [4879179.206296] rsync S ffff88000c20fa20 0 23907 23905 0x00000000 [4879179.206301] ffffffff814891f0 0000000000000082 ffff88040f7f3938 ffff88040f7f3934 [4879179.206305] 0000000000000000 ffff88040f7f3978 000000000000f9e0 ffff88040f7f3fd8 [4879179.206309] 0000000000015780 0000000000015780 ffff88033c2a0000 ffff88033c2a02f8 [4879179.206313] Call Trace: [4879179.206317] [] ? hrtimer_start_range_ns+0x22f/0x242 [4879179.206321] [] ? schedule_hrtimeout_range+0xcd/0x112 [4879179.206325] [] ? hrtimer_wakeup+0x0/0x22 [4879179.206329] [] ? schedule_hrtimeout_range+0xaf/0x112 [4879179.206333] [] ? poll_schedule_timeout+0x31/0x4f [4879179.206336] [] ? do_select+0x50b/0x554 [4879179.206339] [] ? pollwait+0x0/0xd6 [4879179.206343] [] ? pollwake+0x0/0x59 [4879179.206347] [] ? kmalloc_node_track_caller+0xfa/0x11b [4879179.206351] [] ? sock_alloc_send_pskb+0xbd/0x2d8 [4879179.206354] [] ? alloc_skb+0x69/0x15a [4879179.206357] [] ? sock_alloc_send_pskb+0xbd/0x2d8 [4879179.206361] [] ? zone_watermark_ok+0x20/0xb1 [4879179.206364] [] ? zone_statistics+0x3c/0x5d [4879179.206368] [] ? get_page_from_freelist+0x56b/0x760 [4879179.206372] [] ? unix_stream_sendmsg+0x24a/0x2c8 [4879179.206376] [] ? core_sys_select+0x184/0x21e [4879179.206379] [] ? page_add_new_anon_rmap+0x58/0x78 [4879179.206383] [] ? handle_mm_fault+0x35c/0x80f [4879179.206387] [] ? read_tsc+0xa/0x20 [4879179.206390] [] ? ktime_get_ts+0x68/0xb2 [4879179.206393] [] ? sys_select+0x92/0xbb [4879179.206397] [] ? system_call_fastpath+0x16/0x1b [4879179.206399] bash S ffff880395cae2e0 0 23908 7184 0x00000000 [4879179.206403] ffff880395cae2e0 0000000000000086 0000000400000004 ffff88033c2d7138 [4879179.206407] ffffffff8103fc3e ffff88034ac8fae0 000000000000f9e0 ffff88026c58bfd8 [4879179.206410] 0000000000015780 0000000000015780 ffff88033c2d7100 ffff88033c2d73f8 [4879179.206414] Call Trace: [4879179.206418] [] ? update_curr+0xa6/0x147 [4879179.206423] [] ? wait_consider_task+0x9cd/0x9da [4879179.206427] [] ? do_wait+0x1ae/0x21b [4879179.206430] [] ? do_fork+0x254/0x31e [4879179.206434] [] ? sys_wait4+0x95/0xb0 [4879179.206438] [] ? child_wait_callback+0x0/0x48 [4879179.206442] [] ? system_call_fastpath+0x16/0x1b [4879179.206444] logger S 0000000000000000 0 23909 23908 0x00000000 [4879179.206449] ffff88033e659c40 0000000000000082 ffff8804aa7277b8 ffffffff812b2df6 [4879179.206453] ffff88033a820cc0 00000039fa58dd58 000000000000f9e0 ffff8804fa58dfd8 [4879179.206457] 0000000000015780 0000000000015780 ffff880395cae2e0 ffff880395cae5d8 [4879179.206462] Call Trace: [4879179.206465] [] ? unix_dgram_sendmsg+0x3fd/0x444 [4879179.206469] [] ? path_to_nameidata+0x15/0x37 [4879179.206473] [] ? pipe_wait+0x59/0x74 [4879179.206476] [] ? autoremove_wake_function+0x0/0x2e [4879179.206481] [] ? mutex_lock+0xd/0x31 [4879179.206484] [] ? pipe_read+0x33c/0x3af [4879179.206488] [] ? do_sync_read+0xce/0x113 [4879179.206492] [] ? sys_sendto+0xf5/0x138 [4879179.206495] [] ? autoremove_wake_function+0x0/0x2e [4879179.206499] [] ? vfs_read+0xa6/0xff [4879179.206503] [] ? sys_read+0x45/0x6e [4879179.206507] [] ? system_call_fastpath+0x16/0x1b [4879179.206509] rsync D ffff88034ac6fa20 0 23910 26441 0x00000000 [4879179.206514] ffff88033e4ae2e0 0000000000000082 0000000000000000 ffff8802b1842700 [4879179.206518] 0000000000020000 0000000000000006 000000000000f9e0 ffff8805e198bfd8 [4879179.206522] 0000000000015780 0000000000015780 ffff880395cab880 ffff880395cabb78 [4879179.206526] Call Trace: [4879179.206550] [] ? zio_wait_for_children+0x4f/0x59 [zfs] [4879179.206573] [] ? zio_done+0x6b/0xa9d [zfs] [4879179.206579] [] ? cv_wait_common+0xb9/0x141 [spl] [4879179.206583] [] ? autoremove_wake_function+0x0/0x2e [4879179.206607] [] ? zio_wait+0x129/0x161 [zfs] [4879179.206626] [] ? dbuf_read+0x3fd/0x4b9 [zfs] [4879179.206648] [] ? dmu_buf_hold+0xb0/0xe4 [zfs] [4879179.206673] [] ? zap_lockdir+0x58/0x541 [zfs] [4879179.206700] [] ? zap_lookup_norm+0x3f/0x153 [zfs] [4879179.206726] [] ? zap_lookup+0x2a/0x2f [zfs] [4879179.206751] [] ? zfs_dirent_lock+0x3b7/0x47a [zfs] [4879179.206776] [] ? zfs_zaccess_aces_check+0x1ec/0x21c [zfs] [4879179.206802] [] ? zfs_dirlook+0x1cb/0x228 [zfs] [4879179.206806] [] ? tcp_write_xmit+0x466/0x96c [4879179.206831] [] ? zfs_lookup+0x27a/0x2c0 [zfs] [4879179.206835] [] ? _spin_lock_bh+0x9/0x25 [4879179.206857] [] ? zpl_lookup+0x47/0x80 [zfs] [4879179.206861] [] ? d_alloc+0x14b/0x186 [4879179.206865] [] ? do_lookup+0xd3/0x15d [4879179.206869] [] ? link_path_walk+0x5a5/0x6f5 [4879179.206872] [] ? path_walk+0x66/0xc9 [4879179.206876] [] ? do_path_lookup+0x20/0x77 [4879179.206880] [] ? user_path_at+0x48/0x79 [4879179.206883] [] ? autoremove_wake_function+0x0/0x2e [4879179.206887] [] ? ktime_get_ts+0x68/0xb2 [4879179.206891] [] ? vfs_fstatat+0x2c/0x57 [4879179.206895] [] ? sys_newlstat+0x11/0x30 [4879179.206899] [] ? vfs_write+0xcd/0x102 [4879179.206902] [] ? sys_write+0x45/0x6e [4879179.206906] [] ? system_call_fastpath+0x16/0x1b [4879179.206909] rsync S ffff88034acafa20 0 23913 23910 0x00000000 [4879179.206913] ffff88033e638e20 0000000000000082 ffff880222213938 ffff880222213934 [4879179.206917] 0000000000000000 ffff880222213978 000000000000f9e0 ffff880222213fd8 [4879179.206922] 0000000000015780 0000000000015780 ffff8801fc7fd4c0 ffff8801fc7fd7b8 [4879179.206926] Call Trace: [4879179.206930] [] ? __hrtimer_start_range_ns+0x22f/0x242 [4879179.206934] [] ? schedule_hrtimeout_range+0xcd/0x112 [4879179.206938] [] ? hrtimer_wakeup+0x0/0x22 [4879179.206942] [] ? schedule_hrtimeout_range+0xaf/0x112 [4879179.206946] [] ? poll_schedule_timeout+0x31/0x4f [4879179.206950] [] ? do_select+0x50b/0x554 [4879179.206954] [] ? pollwait+0x0/0xd6 [4879179.206958] [] ? pollwake+0x0/0x59 [4879179.206962] [] ? kmalloc_node_track_caller+0xfa/0x11b [4879179.206967] [] ? sock_alloc_send_pskb+0xbd/0x2d8 [4879179.206970] [] ? alloc_skb+0x69/0x15a [4879179.206974] [] ? sock_alloc_send_pskb+0xbd/0x2d8 [4879179.206979] [] ? load_balance_fair+0x1c9/0x2a5 [4879179.206982] [] ? skb_queue_tail+0x17/0x3e [4879179.206986] [] ? sock_def_readable+0x10/0x62 [4879179.206991] [] ? unix_stream_sendmsg+0x24a/0x2c8 [4879179.206995] [] ? sock_aio_write+0xb1/0xbc [4879179.206999] [] ? core_sys_select+0x184/0x21e [4879179.207003] [] ? autoremove_wake_function+0x0/0x2e [4879179.207007] [] ? ktime_get_ts+0x68/0xb2 [4879179.207011] [] ? read_tsc+0xa/0x20 [4879179.207014] [] ? ktime_get_ts+0x68/0xb2 [4879179.207018] [] ? sys_select+0x92/0xbb [4879179.207021] [] ? system_call_fastpath+0x16/0x1b [4879179.207024] rsync R running task 0 23920 26441 0x00000000 [4879179.207028] ffff8803124ba350 0000000000000086 0000000100000080 ffff8802f6e7f0f8 [4879179.207032] ffff88041e4554c8 ffff88041e4554c8 000000000000f9e0 ffff88041e455fd8 [4879179.207037] 0000000000015780 0000000000015780 ffff8805ff409c40 ffff8805ff409f40 [4879179.207041] Call Trace: [4879179.207047] [] ? spl_kmem_cache_reap_now+0x114/0x175 [spl] [4879179.207052] [] ? cond_resched+0x1d/0x26 [4879179.207056] [] ? _cond_resched+0x24/0x2f [4879179.207059] [] ? shrink_slab+0x10c/0x153 [4879179.207063] [] ? try_to_free_pages+0x26a/0x38e [4879179.207067] [] ? isolate_pages_global+0x0/0x20f [4879179.207070] [] ? __alloc_pages_nodemask+0x3d4/0x5fc [4879179.207074] [] ? new_slab+0x42/0x1ca [4879179.207078] [] ? slab_alloc+0x1f0/0x39b [4879179.207083] [] ? spl_kmem_cache_alloc+0x692/0xc4a [spl] [4879179.207087] [] ? del_timer_sync+0xc/0x16 [4879179.207090] [] ? kmalloc+0xf1/0x141 [4879179.207096] [] ? spl_kmem_cache_alloc+0x692/0xc4a [spl] [4879179.207102] [] ? spl_kmem_cache_alloc+0x692/0xc4a [spl] [4879179.207106] [] ? autoremove_wake_function+0x0/0x2e [4879179.207129] [] ? dnode_create+0x2e/0x13f [zfs] [4879179.207152] [] ? dnode_hold_impl+0x276/0x37e [zfs] [4879179.207170] [] ? remove_reference+0x99/0xa5 [zfs] [4879179.207192] [] ? dmu_bonus_hold+0x1e/0x19f [zfs] [4879179.207216] [] ? zfs_zget+0x5d/0x19d [zfs] [4879179.207241] [] ? zfs_dirent_lock+0x43d/0x47a [zfs] [4879179.207267] [] ? zfs_zaccess_aces_check+0x1ec/0x21c [zfs] [4879179.207293] [] ? zfs_dirlook+0x1cb/0x228 [zfs] [4879179.207318] [] ? zfs_lookup+0x27a/0x2c0 [zfs] [4879179.207322] [] ? d_rehash+0x2a/0x3a [4879179.207345] [] ? zpl_lookup+0x47/0x80 [zfs] [4879179.207349] [] ? d_alloc+0x14b/0x186 [4879179.207352] [] ? do_lookup+0xd3/0x15d [4879179.207356] [] ? link_path_walk+0x5a5/0x6f5 [4879179.207360] [] ? path_walk+0x66/0xc9 [4879179.207363] [] ? do_path_lookup+0x20/0x77 [4879179.207367] [] ? user_path_at+0x48/0x79 [4879179.207373] [] ? tsd_exit+0x83/0x193 [spl] [4879179.207378] [] ? cp_new_stat+0xe9/0xfc [4879179.207382] [] ? vfs_fstatat+0x2c/0x57 [4879179.207385] [] ? sys_newlstat+0x11/0x30 [4879179.207389] [] ? system_call_fastpath+0x16/0x1b [4879179.207391] rsync R running task 0 23921 23920 0x00000000 [4879179.207395] 0000000000000000 0000000000000064 0000000100000080 ffff8801ef5c70f8 [4879179.207400] ffff8802544817f8 ffff8802544817f8 ffff880254481808 ffff880254481808 [4879179.207404] ffff88031734f118 ffff88031734c000 000000000058ce11 ffff8802ef939040 [4879179.207409] Call Trace: [4879179.207415] [] ? spl_kmem_cache_reap_now+0x114/0x175 [spl] [4879179.207422] [] ? spl_kmem_cache_generic_shrinker+0x6f/0xbe [spl] [4879179.207429] [] ? spl_kmem_cache_generic_shrinker+0x19/0x1e [spl] [4879179.207433] [] ? shrink_slab+0xe0/0x153 [4879179.207437] [] ? try_to_free_pages+0x26a/0x38e [4879179.207442] [] ? isolate_pages_global+0x0/0x20f [4879179.207445] [] ? ktime_get_ts+0x68/0xb2 [4879179.207449] [] ? alloc_pages_nodemask+0x3d4/0x5fc [4879179.207454] [] ? new_slab+0x42/0x1ca [4879179.207457] [] ? slab_alloc+0x1f0/0x39b [4879179.207461] [] ? sock_alloc_send_pskb+0xbd/0x2d8 [4879179.207465] [] ? kmalloc_node_track_caller+0xbb/0x11b [4879179.207469] [] ? sock_alloc_send_pskb+0xbd/0x2d8 [4879179.207472] [] ? alloc_skb+0x69/0x15a [4879179.207475] [] ? sock_alloc_send_pskb+0xbd/0x2d8 [4879179.207480] [] ? unix_stream_sendmsg+0x13b/0x2c8 [4879179.207484] [] ? sock_aio_write+0xb1/0xbc [4879179.207488] [] ? do_sync_write+0xce/0x113 [4879179.207492] [] ? autoremove_wake_function+0x0/0x2e [4879179.207496] [] ? ktime_get_ts+0x68/0xb2 [4879179.207500] [] ? poll_select_copy_remaining+0xd0/0xf3 [4879179.207503] [] ? vfs_write+0xbc/0x102 [4879179.207507] [] ? sys_write+0x45/0x6e [4879179.207511] [] ? system_call_fastpath+0x16/0x1b [4879179.207513] rsync S ffff8801b3838710 0 23922 26441 0x00000000 [4879179.207517] ffff8801b3838710 0000000000000086 0000000000000000 ffff880114e33934 [4879179.207521] 0000000000000000 ffff880114e33978 000000000000f9e0 ffff880114e33fd8 [4879179.207526] 0000000000015780 0000000000015780 ffff88030c9df810 ffff88030c9dfb08 [4879179.207530] Call Trace: [4879179.207534] [] ? hrtimer_start_range_ns+0x22f/0x242 [4879179.207538] [] ? hrtimer_try_to_cancel+0x3a/0x43 [4879179.207542] [] ? schedule_hrtimeout_range+0xcd/0x112 [4879179.207546] [] ? hrtimer_wakeup+0x0/0x22 [4879179.207550] [] ? schedule_hrtimeout_range+0xaf/0x112 [4879179.207554] [] ? poll_schedule_timeout+0x31/0x4f [4879179.207558] [] ? do_select+0x50b/0x554 [4879179.207562] [] ? pollwait+0x0/0xd6 [4879179.207566] [] ? pollwake+0x0/0x59 [4879179.207569] [] ? zone_watermark_ok+0x20/0xb1 [4879179.207573] [] ? zone_statistics+0x3c/0x5d [4879179.207578] [] ? switch_to+0x1ad/0x297 [4879179.207581] [] ? call_function_single_interrupt+0xe/0x20 [4879179.207586] [] ? finish_task_switch+0x44/0xaf [4879179.207590] [] ? finish_task_switch+0x3a/0xaf [4879179.207594] [] ? thread_return+0x79/0xe0 [4879179.207598] [] ? read_tsc+0xa/0x20 [4879179.207601] [] ? ktime_get_ts+0x68/0xb2 [4879179.207605] [] ? core_sys_select+0x184/0x21e [4879179.207610] [] ? lock_page+0x5d/0x63 [4879179.207613] [] ? wake_bit_function+0x0/0x23 [4879179.207617] [] ? swap_info_get+0x6a/0xa3 [4879179.207620] [] ? handle_mm_fault+0x6da/0x80f [4879179.207623] [] ? ktime_get_ts+0x68/0xb2 [4879179.207627] [] ? read_tsc+0xa/0x20 [4879179.207629] [] ? ktime_get_ts+0x68/0xb2 [4879179.207633] [] ? sys_select+0x92/0xbb [4879179.207636] [] ? system_call_fastpath+0x16/0x1b [4879179.207638] rsync D ffff8803124bbf90 0 23924 23922 0x00000000 [4879179.207642] ffff8803124bbf90 0000000000000082 50749dfa00000001 00000000000359c8 [4879179.207646] 00005d7400000000 000000000000023d 000000000000f9e0 ffff8801a609bfd8 [4879179.207649] 0000000000015780 0000000000015780 ffff88030c9dcdb0 ffff88030c9dd0a8 [4879179.207653] Call Trace: [4879179.207656] [] ? kmalloc+0x12f/0x141 [4879179.207661] [] ? kmem_alloc_debug+0x142/0x2cd [spl] [4879179.207667] [] ? cv_wait_common+0xb9/0x141 [spl] [4879179.207670] [] ? autoremove_wake_function+0x0/0x2e [4879179.207694] [] ? txg_wait_open+0xd4/0xf3 [zfs] [4879179.207717] [] ? dmu_tx_assign+0x108/0x378 [zfs] [4879179.207738] [] ? dmu_free_long_range_impl+0x159/0x1e1 [zfs] [4879179.207760] [] ? dmu_free_long_range+0x40/0x60 [zfs] [4879179.207764] [] ? unmap_mapping_range+0x7e/0x244 [4879179.207790] [] ? zfs_rmnode+0xa7/0x2e3 [zfs] [4879179.207814] [] ? zfs_zinactive+0x9e/0xf7 [zfs] [4879179.207839] [] ? zfs_inactive+0x122/0x15e [zfs] [4879179.207862] [] ? zpl_inode_delete+0x0/0x58 [zfs] [4879179.207884] [] ? zpl_inode_delete+0x0/0x58 [zfs] [4879179.207888] [] ? clear_inode+0x79/0xd0 [4879179.207891] [] ? generic_delete_inode+0xdc/0x168 [4879179.207895] [] ? d_kill+0x40/0x61 [4879179.207899] [] ? dput+0x152/0x15e [4879179.207902] [] ? sys_renameat+0x1a8/0x20d [4879179.207907] [] ? handle_mm_fault+0x3b8/0x80f [4879179.207910] [] ? notify_change+0x2b3/0x2c5 [4879179.207914] [] ? mntput_no_expire+0x23/0xee [4879179.207918] [] ? do_page_fault+0x2e0/0x2fc [4879179.207922] [] ? system_call_fastpath+0x16/0x1b [4879179.207924] rsync S 0000000000000000 0 23925 26441 0x00000000 [4879179.207928] ffff88033e78f100 0000000000000086 0000000000000003 ffffffff81192ddc [4879179.207931] 0000000000000000 ffff88027fe45978 000000000000f9e0 ffff88027fe45fd8 [4879179.207936] 0000000000015780 0000000000015780 ffff8801b383aa60 ffff8801b383ad58 [4879179.207940] Call Trace: [4879179.207944] [] ? rb_insert_color+0xba/0xe2 [4879179.207948] [] ? hrtimer_start_range_ns+0x22f/0x242 [4879179.207952] [] ? hrtimer_try_to_cancel+0x3a/0x43 [4879179.207956] [] ? schedule_hrtimeout_range+0xcd/0x112 [4879179.207960] [] ? hrtimer_wakeup+0x0/0x22 [4879179.207964] [] ? schedule_hrtimeout_range+0xaf/0x112 [4879179.207968] [] ? poll_schedule_timeout+0x31/0x4f [4879179.207972] [] ? do_select+0x50b/0x554 [4879179.207976] [] ? pollwait+0x0/0xd6 [4879179.207980] [] ? pollwake+0x0/0x59 [4879179.207990] [] ? scsi_dma_map+0x79/0x96 [scsi_mod] [4879179.207997] [] ? aac_src_deliver_message+0x31/0xa7 [aacraid] [4879179.208004] [] ? aac_fib_send+0x134/0x290 [aacraid] [4879179.208010] [] ? aac_read_raw_io+0xac/0xb8 [aacraid] [4879179.208017] [] ? io_callback+0x0/0x123 [aacraid] [4879179.208021] [] ? switch_to+0x285/0x297 [4879179.208025] [] ? update_curr+0xa6/0x147 [4879179.208028] [] ? set_next_entity+0x34/0x56 [4879179.208032] [] ? pick_next_task_fair+0xca/0xd6 [4879179.208036] [] ? finish_task_switch+0x3a/0xaf [4879179.208040] [] ? thread_return+0x79/0xe0 [4879179.208044] [] ? blk_peek_request+0x199/0x1ad [4879179.208048] [] ? read_tsc+0xa/0x20 [4879179.208051] [] ? ktime_get_ts+0x68/0xb2 [4879179.208055] [] ? core_sys_select+0x184/0x21e [4879179.208059] [] ? lock_page+0x5d/0x63 [4879179.208063] [] ? wake_bit_function+0x0/0x23 [4879179.208066] [] ? swap_info_get+0x6a/0xa3 [4879179.208070] [] ? handle_mm_fault+0x6da/0x80f [4879179.208073] [] ? ktime_get_ts+0x68/0xb2 [4879179.208077] [] ? read_tsc+0xa/0x20 [4879179.208080] [] ? ktime_get_ts+0x68/0xb2 [4879179.208083] [] ? sys_select+0x92/0xbb [4879179.208087] [] ? system_call_fastpath+0x16/0x1b [4879179.208089] rsync D ffff88033c9a7810 0 23927 23925 0x00000000 [4879179.208093] ffff88033c9a7810 0000000000000086 50749dfa00000001 00000000000359ce [4879179.208097] 00005d7700000000 000000000000023d 000000000000f9e0 ffff880169011fd8 [4879179.208101] 0000000000015780 0000000000015780 ffff8803124bbf90 ffff8803124bc288 [4879179.208105] Call Trace: [4879179.208108] [] ? kmalloc+0x12f/0x141 [4879179.208114] [] ? kmem_alloc_debug+0x142/0x2cd [spl] [4879179.208121] [] ? cv_wait_common+0xb9/0x141 [spl] [4879179.208125] [] ? autoremove_wake_function+0x0/0x2e [4879179.208151] [] ? txg_wait_open+0xd4/0xf3 [zfs] [4879179.208174] [] ? dmu_tx_assign+0x108/0x378 [zfs] [4879179.208194] [] ? dmu_free_long_range_impl+0x159/0x1e1 [zfs] [4879179.208215] [] ? dmu_free_long_range+0x40/0x60 [zfs] [4879179.208219] [] ? unmap_mapping_range+0x7e/0x244 [4879179.208244] [] ? zfs_rmnode+0xa7/0x2e3 [zfs] [4879179.208268] [] ? zfs_zinactive+0x9e/0xf7 [zfs] [4879179.208291] [] ? zfs_inactive+0x122/0x15e [zfs] [4879179.208313] [] ? zpl_inode_delete+0x0/0x58 [zfs] [4879179.208335] [] ? zpl_inode_delete+0x0/0x58 [zfs] [4879179.208339] [] ? clear_inode+0x79/0xd0 [4879179.208342] [] ? generic_delete_inode+0xdc/0x168 [4879179.208346] [] ? d_kill+0x40/0x61 [4879179.208350] [] ? dput+0x152/0x15e [4879179.208354] [] ? sys_renameat+0x1a8/0x20d [4879179.208358] [] ? handle_mm_fault+0x3b8/0x80f [4879179.208362] [] ? notify_change+0x2b3/0x2c5 [4879179.208366] [] ? mntput_no_expire+0x23/0xee [4879179.208370] [] ? do_page_fault+0x2e0/0x2fc [4879179.208374] [] ? system_call_fastpath+0x16/0x1b [4879179.208377] cron S ffff88034ac6fa20 0 23931 1991 0x00000000 [4879179.208381] ffff88033e4ae2e0 0000000000000086 0000000000000000 ffff880500000005 [4879179.208385] 0000000000000007 ffff88034acafae0 000000000000f9e0 ffff8801f4741fd8 [4879179.208389] 0000000000015780 0000000000015780 ffff8801fc7faa60 ffff8801fc7fad58 [4879179.208393] Call Trace: [4879179.208398] [] ? wait_consider_task+0x9cd/0x9da [4879179.208402] [] ? do_wait+0x1ae/0x21b [4879179.208405] [] ? do_fork+0x254/0x31e [4879179.208409] [] ? sys_wait4+0x95/0xb0 [4879179.208412] [] ? child_wait_callback+0x0/0x48 [4879179.208416] [] ? page_fault+0x25/0x30 [4879179.208420] [] ? system_call_fastpath+0x16/0x1b [4879179.208422] sh S ffff88034accfa20 0 23934 23931 0x00000000 [4879179.208427] ffff88033e63aa60 0000000000000082 0000000000000000 ffff880300000007 [4879179.208431] 000000000000000e ffff88034acefae0 000000000000f9e0 ffff88036a129fd8 [4879179.208435] 0000000000015780 0000000000015780 ffff8805ff40b880 ffff8805ff40bb78 [4879179.208439] Call Trace: [4879179.208443] [] ? wait_consider_task+0x9cd/0x9da [4879179.208447] [] ? do_wait+0x1ae/0x21b [4879179.208450] [] ? do_fork+0x254/0x31e [4879179.208454] [] ? sys_wait4+0x95/0xb0 [4879179.208457] [] ? child_wait_callback+0x0/0x48 [4879179.208461] [] ? page_fault+0x25/0x30 [4879179.208464] [] ? system_call_fastpath+0x16/0x1b [4879179.208467] bash S ffff8803bcb5f810 0 23937 18270 0x00000000 [4879179.208471] ffff8803bcb5f810 0000000000000082 0000000600000006 ffff88030c9de318 [4879179.208475] ffffffff8103fc3e ffff88034accfae0 000000000000f9e0 ffff8802cb3d7fd8 [4879179.208479] 0000000000015780 0000000000015780 ffff88030c9de2e0 ffff88030c9de5d8 [4879179.208483] Call Trace: [4879179.208487] [] ? update_curr+0xa6/0x147 [4879179.208490] [] ? wait_consider_task+0x9cd/0x9da [4879179.208494] [] ? do_wait+0x1ae/0x21b [4879179.208497] [] ? do_fork+0x254/0x31e [4879179.208500] [] ? sys_wait4+0x95/0xb0 [4879179.208504] [] ? child_wait_callback+0x0/0x48 [4879179.208507] [] ? system_call_fastpath+0x16/0x1b [4879179.208510] logger S ffff88034ac4fa20 0 23938 23937 0x00000000 [4879179.208514] ffff88033e4ac6a0 0000000000000086 0000000000000000 0000000000000000 [4879179.208518] 0000000000000c29 ffff88039f6ab610 000000000000f9e0 ffff88035fa1bfd8 [4879179.208523] 0000000000015780 0000000000015780 ffff8803bcb5f810 ffff8803bcb5fb08 [4879179.208527] Call Trace: [4879179.208531] [] ? d_lookup+0xc3/0x107 [4879179.208534] [] ? pipe_wait+0x59/0x74 [4879179.208538] [] ? autoremove_wake_function+0x0/0x2e [4879179.208542] [] ? mutex_lock+0xd/0x31 [4879179.208545] [] ? pipe_read+0x33c/0x3af [4879179.208549] [] ? do_sync_read+0xce/0x113 [4879179.208553] [] ? autoremove_wake_function+0x0/0x2e [4879179.208557] [] ? mmap_region+0x3b5/0x4f3 [4879179.208561] [] ? vfs_read+0xa6/0xff [4879179.208564] [] ? sys_read+0x45/0x6e [4879179.208568] [] ? system_call_fastpath+0x16/0x1b [4879179.208571] nsca_checks.s S ffff88063cea46a0 0 23940 23934 0x00000000 [4879179.208575] ffff88063cea46a0 0000000000000086 0000000500000005 ffff88033c2a0748 [4879179.208579] ffffffff8103fc3e ffff88034acafae0 000000000000f9e0 ffff8804a6939fd8 [4879179.208583] 0000000000015780 0000000000015780 ffff88033c2a0710 ffff88033c2a0a08 [4879179.208588] Call Trace: [4879179.208591] [] ? update_curr+0xa6/0x147 [4879179.208595] [] ? wait_consider_task+0x9cd/0x9da [4879179.208599] [] ? do_wait+0x1ae/0x21b [4879179.208602] [] ? do_fork+0x254/0x31e [4879179.208606] [] ? sys_wait4+0x95/0xb0 [4879179.208609] [] ? child_wait_callback+0x0/0x48 [4879179.208612] [] ? system_call_fastpath+0x16/0x1b [4879179.208615] nsca_wrapper S ffff88034ac2fa20 0 23972 23940 0x00000000 [4879179.208619] ffff88033e4aaa60 0000000000000086 0000000000000000 00000040000200da [4879179.208624] ffff880340005a08 000000020b2a5000 000000000000f9e0 ffff88058012bfd8 [4879179.208628] 0000000000015780 0000000000015780 ffff88063cea46a0 ffff88063cea4998 [4879179.208632] Call Trace: [4879179.208636] [] ? pipe_wait+0x59/0x74 [4879179.208639] [] ? autoremove_wake_function+0x0/0x2e [4879179.208643] [] ? mutex_lock+0xd/0x31 [4879179.208646] [] ? pipe_read+0x33c/0x3af [4879179.208650] [] ? do_sync_read+0xce/0x113 [4879179.208654] [] ? invalidate_interrupt4+0xe/0x20 [4879179.208658] [] ? autoremove_wake_function+0x0/0x2e [4879179.208662] [] ? handle_mm_fault+0x7aa/0x80f [4879179.208666] [] ? do_fork+0x254/0x31e [4879179.208671] [] ? do_sigaction+0x7b/0x171 [4879179.208674] [] ? vfs_read+0xa6/0xff [4879179.208678] [] ? sys_read+0x45/0x6e [4879179.208681] [] ? system_call_fastpath+0x16/0x1b [4879179.208685] nsca_wrapper S ffff88034accfa20 0 23973 23972 0x00000000 [4879179.208689] ffff88033e63aa60 0000000000000086 0000000000000000 ffff880300000005 [4879179.208693] 000000000000000e ffff88034acafae0 000000000000f9e0 ffff880556377fd8 [4879179.208697] 0000000000015780 0000000000015780 ffff88063cea69f0 ffff88063cea6ce8 [4879179.208702] Call Trace: [4879179.208705] [] ? wait_consider_task+0x9cd/0x9da [4879179.208709] [] ? do_wait+0x1ae/0x21b [4879179.208712] [] ? do_fork+0x254/0x31e [4879179.208716] [] ? sys_wait4+0x95/0xb0 [4879179.208720] [] ? child_wait_callback+0x0/0x48 [4879179.208724] [] ? page_fault+0x25/0x30 [4879179.208727] [] ? system_call_fastpath+0x16/0x1b [4879179.208730] check_proc.sh S ffff88033c2a46a0 0 23974 23973 0x00000000 [4879179.208734] ffff88033c2a46a0 0000000000000086 0000000000000000 00000040000200da [4879179.208738] ffff880340005a08 00000002b0579000 000000000000f9e0 ffff88037450bfd8 [4879179.208742] 0000000000015780 0000000000015780 ffff88033c2a1c40 ffff88033c2a1f38 [4879179.208746] Call Trace: [4879179.208750] [] ? pipe_wait+0x59/0x74 [4879179.208754] [] ? autoremove_wake_function+0x0/0x2e [4879179.208758] [] ? mutex_lock+0xd/0x31 [4879179.208761] [] ? pipe_read+0x33c/0x3af [4879179.208765] [] ? select_task_rq_fair+0x682/0x843 [4879179.208769] [] ? do_sync_read+0xce/0x113 [4879179.208773] [] ? autoremove_wake_function+0x0/0x2e [4879179.208778] [] ? handle_mm_fault+0x7aa/0x80f [4879179.208781] [] ? do_fork+0x254/0x31e [4879179.208785] [] ? vfs_read+0xa6/0xff [4879179.208788] [] ? sys_read+0x45/0x6e [4879179.208792] [] ? system_call_fastpath+0x16/0x1b [4879179.208795] check_proc.sh S ffff88048051c6a0 0 23975 23974 0x00000000 [4879179.208799] ffff88048051c6a0 0000000000000082 0000000d0000000d ffff88033c2a46d8 [4879179.208803] ffffffff8103fc3e ffff88034ac2fae0 000000000000f9e0 ffff8805776c5fd8 [4879179.208807] 0000000000015780 0000000000015780 ffff88033c2a46a0 ffff88033c2a4998 [4879179.208812] Call Trace: [4879179.208815] [] ? update_curr+0xa6/0x147 [4879179.208818] [] ? wait_consider_task+0x9cd/0x9da [4879179.208821] [] ? do_wait+0x1ae/0x21b [4879179.208824] [] ? do_fork+0x254/0x31e [4879179.208828] [] ? sys_wait4+0x95/0xb0 [4879179.208831] [] ? child_wait_callback+0x0/0x48 [4879179.208835] [] ? page_fault+0x25/0x30 [4879179.208838] [] ? system_call_fastpath+0x16/0x1b [4879179.208841] ps D 0000000000000000 0 23976 23975 0x00000000 [4879179.208845] ffff88033e751530 0000000000000082 ffff88063c884980 ffffffff8104828d [4879179.208849] 0000000000015780 ffff88033e751530 000000000000f9e0 ffff8805dc3fffd8 [4879179.208853] 0000000000015780 0000000000015780 ffff88033c2a7810 ffff88033c2a7b08 [4879179.208857] Call Trace: [4879179.208861] [] ? finish_task_switch+0x3a/0xaf [4879179.208865] [] ? mutex_lock_common+0x122/0x192 [4879179.208869] [] ? mutex_lock+0x1a/0x31 [4879179.208873] [] ? do_lookup+0x80/0x15d [4879179.208876] [] ? link_path_walk+0x5a5/0x6f5 [4879179.208880] [] ? path_walk+0x66/0xc9 [4879179.208884] [] ? do_path_lookup+0x20/0x77 [4879179.208888] [] ? user_path_at+0x48/0x79 [4879179.208892] [] ? handle_mm_fault+0x3b8/0x80f [4879179.208895] [] ? get_task_mm+0x10/0x3a [4879179.208899] [] ? vfs_fstatat+0x2c/0x57 [4879179.208903] [] ? sys_newstat+0x11/0x30 [4879179.208907] [] ? do_page_fault+0x2e0/0x2fc [4879179.208911] [] ? page_fault+0x25/0x30 [4879179.208915] [] ? system_call_fastpath+0x16/0x1b [4879179.208917] grep S 0000000000000000 0 23977 23975 0x00000000 [4879179.208921] ffff88033e65a350 0000000000000082 0000000000000000 ffffffff812fd185 [4879179.208926] 0000000000000c29 ffff88061ac48c90 000000000000f9e0 ffff880624f27fd8 [4879179.208930] 0000000000015780 0000000000015780 ffff88033c2a3170 ffff88033c2a3468 [4879179.208934] Call Trace: [4879179.208937] [] ? page_fault+0x25/0x30 [4879179.208941] [] ? pipe_wait+0x59/0x74 [4879179.208945] [] ? autoremove_wake_function+0x0/0x2e [4879179.208948] [] ? mutex_lock+0xd/0x31 [4879179.208952] [] ? pipe_read+0x33c/0x3af [4879179.208956] [] ? do_sync_read+0xce/0x113 [4879179.208960] [] ? autoremove_wake_function+0x0/0x2e [4879179.208964] [] ? handle_mm_fault+0x35c/0x80f [4879179.208968] [] ? do_vfs_ioctl+0x48d/0x4cb [4879179.208971] [] ? vfs_read+0xa6/0xff [4879179.208975] [] ? sys_read+0x45/0x6e [4879179.208978] [] ? system_call_fastpath+0x16/0x1b [4879179.208980] grep S 0000000000000000 0 23978 23975 0x00000000 [4879179.208984] ffff88033e78e9f0 0000000000000082 0000000000000000 ffffffff812fd185 [4879179.208988] 0000000000000c29 ffff8803dfe1f670 000000000000f9e0 ffff88057655ffd8 [4879179.208992] 0000000000015780 0000000000015780 ffff88033c2a0e20 ffff88033c2a1118 [4879179.208996] Call Trace: [4879179.208999] [] ? page_fault+0x25/0x30 [4879179.209004] [] ? d_splice_alias+0xc1/0xc9 [4879179.209007] [] ? pipe_wait+0x59/0x74 [4879179.209011] [] ? autoremove_wake_function+0x0/0x2e [4879179.209014] [] ? mutex_lock+0xd/0x31 [4879179.209018] [] ? pipe_read+0x33c/0x3af [4879179.209022] [] ? do_sync_read+0xce/0x113 [4879179.209025] [] ? autoremove_wake_function+0x0/0x2e [4879179.209030] [] ? handle_mm_fault+0x35c/0x80f [4879179.209034] [] ? do_vfs_ioctl+0x48d/0x4cb [4879179.209038] [] ? vfs_read+0xa6/0xff [4879179.209041] [] ? sys_read+0x45/0x6e [4879179.209045] [] ? system_call_fastpath+0x16/0x1b [4879179.209047] grep S ffff88034ac4fa20 0 23979 23975 0x00000000 [4879179.209052] ffff88033e4ac6a0 0000000000000082 0000000000000000 ffffffff812fd185 [4879179.209056] 0000000000000c29 ffff8804f060e890 000000000000f9e0 ffff8805b7027fd8 [4879179.209060] 0000000000015780 0000000000015780 ffff8803bcb5e2e0 ffff8803bcb5e5d8 [4879179.209064] Call Trace: [4879179.209067] [] ? page_fault+0x25/0x30 [4879179.209071] [] ? __d_lookup+0xc3/0x107 [4879179.209075] [] ? pipe_wait+0x59/0x74 [4879179.209078] [] ? autoremove_wake_function+0x0/0x2e [4879179.209082] [] ? mutex_lock+0xd/0x31 [4879179.209085] [] ? pipe_read+0x33c/0x3af [4879179.209089] [] ? do_sync_read+0xce/0x113 [4879179.209093] [] ? autoremove_wake_function+0x0/0x2e [4879179.209097] [] ? handle_mm_fault+0x35c/0x80f [4879179.209101] [] ? do_vfs_ioctl+0x48d/0x4cb [4879179.209105] [] ? vfs_read+0xa6/0xff [4879179.209108] [] ? sys_read+0x45/0x6e [4879179.209112] [] ? system_call_fastpath+0x16/0x1b [4879179.209115] wc S ffff88034ac4fa20 0 23980 23975 0x00000000 [4879179.209119] ffff88033e4ac6a0 0000000000000082 0000000000000000 ffffffff8106c5b3 [4879179.209123] ffff88063cf2c088 ffff88063cf2c098 000000000000f9e0 ffff880612b2bfd8 [4879179.209127] 0000000000015780 0000000000015780 ffff88048051c6a0 ffff88048051c998 [4879179.209131] Call Trace: [4879179.209135] [] ? ktime_get_ts+0x68/0xb2 [4879179.209139] [] ? io_schedule+0x93/0xb7 [4879179.209142] [] ? pipe_wait+0x59/0x74 [4879179.209145] [] ? autoremove_wake_function+0x0/0x2e [4879179.209149] [] ? mutex_lock+0xd/0x31 [4879179.209153] [] ? pipe_read+0x33c/0x3af [4879179.209157] [] ? lock_page+0x9/0x1f [4879179.209161] [] ? do_sync_read+0xce/0x113 [4879179.209164] [] ? autoremove_wake_function+0x0/0x2e [4879179.209169] [] ? handle_mm_fault+0x3b8/0x80f [4879179.209173] [] ? vfs_read+0xa6/0xff

aarrpp commented 12 years ago

I tried to reproduce this bug in a virtual environment, but everything worked correctly. I think the reason this is a lot of metadata. I have several backup servers with pool sizes from 1-4TB. On each of them ~ 1000 datasets with snapshots. And each of them has this bug. To reproduce it enough to run parallel find on all datasets. I must periodically run echo 3>/proc/sys/vm/drop_caches to clean memory, but it does not always help. How can i help you to resolve this issue?

ryao commented 12 years ago

@behlendorf I had a discussion with @DeHackEd about this in IRC. It seems that he can only reproduce this issue when L2ARC is present. He also reports that my code from a few months ago that depended upon the kernel VM patch is more stable. A difference that comes to mind is the following change to the SPL:

diff --git a/include/sys/kmem.h b/include/sys/kmem.h
index 796af44..633278b 100644
--- a/include/sys/kmem.h
+++ b/include/sys/kmem.h
@@ -41,7 +41,7 @@
 /*
  * Memory allocation interfaces
  */
-#define KM_SLEEP   GFP_KERNEL  /* Can sleep, never fails */
+#define KM_SLEEP   (GFP_NOIO | __GFP_HIGH) /* Can sleep, never fails */
 #define KM_NOSLEEP GFP_ATOMIC  /* Can not sleep, may fail */
 #define KM_PUSHPAGE    (GFP_NOIO | __GFP_HIGH) /* Use reserved memory */
 #define KM_NODEBUG __GFP_NOWARN    /* Suppress warnings */

I have asked him to test it.

ryao commented 12 years ago

Looking at the original arcstats that @DeHackEd posted, it looks like arc_meta_limit is not honored when L2ARC is present. I do not have time to look through the code right now, but I think that this has to do with the L2ARC-only buffers used to keep track of the L2ARC map. Here are some references:

The current version of the ZFS ARC splits the lists above into separate data and metadata lists, and also has a list for anonymous buffers and one for L2ARC only buffers (which I added when I developed the L2ARC).

http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/

Nexenta by default allocates 1/4 of your ARC cache (RAM) to metadata caching.  Your L2ARC map is considered metadata.

http://www.zfsbuild.com/2011/11/18/dedupe-be-careful/

It feels to me like memory management of the L2ARC map is not being done correctly under sustained load conditions.

behlendorf commented 12 years ago

@ryao @DeHackEd Ahh! That makes perfect sense. Now I'm kicking myself for not exposing the l2c_only arc_state_t in arcstats when I originally did it for the other arc states. The above patch corrects that issue. Can you please give it a try and verify that the new l2c_only_* entries in arcstats account for most of the ARC meta data.

Now exactly what do it about it is a different question. At the moment there is no mechanism for memory pressure on the system to cause these headers to be evicted. Doing so would effectively abandon the data buffers in the L2ARC. This is safe to do but not exactly optimal for performance. There is probably a case to be made that this is preferable as a last resort when the system needs memory.

aarrpp commented 12 years ago

In my case l2 cache is not used. I did some traces and found that the function arc_adjust_meta from arc_adapt thread does not release memory under heavy load. And ARC continues to grow.

cat /proc/spl/kstat/zfs/arcstats 
18 1 0x01 77 3696 1385740506293 550072514185777
name                            type data
hits                            4    1776619018
misses                          4    376121117
demand_data_hits                4    1
demand_data_misses              4    3663549
demand_metadata_hits            4    1460578271
demand_metadata_misses          4    223287392
prefetch_data_hits              4    0
prefetch_data_misses            4    1
prefetch_metadata_hits          4    316040746
prefetch_metadata_misses        4    149170175
mru_hits                        4    752453116
mru_ghost_hits                  4    79549979
mfu_hits                        4    708235442
mfu_ghost_hits                  4    38397570
deleted                         4    284816991
recycle_miss                    4    287211321
mutex_miss                      4    23972601
evict_skip                      4    155959865459
evict_l2_cached                 4    0
evict_l2_eligible               4    958429659648
evict_l2_ineligible             4    2156683706880
hash_elements                   4    105127
hash_elements_max               4    359810
hash_collisions                 4    82501288
hash_chains                     4    9173
hash_chain_max                  4    8
p                               4    786415616
c                               4    838860800
c_min                           4    838860800
c_max                           4    1000000000
size                            4    3559781976
hdr_size                        4    51919712
data_size                       4    1377200128
other_size                      4    2130662136
anon_size                       4    209853952
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    853744128
mru_evict_data                  4    0
mru_evict_metadata              4    26845184
mru_ghost_size                  4    122683904
mru_ghost_evict_data            4    92719104
mru_ghost_evict_metadata        4    29964800
mfu_size                        4    313602048
mfu_evict_data                  4    0
mfu_evict_metadata              4    1341440
mfu_ghost_size                  4    158740992
mfu_ghost_evict_data            4    0
mfu_ghost_evict_metadata        4    158740992
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_hdr_miss              4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_free_on_write                4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_hdr_size                     4    0
memory_throttle_count           4    31594
memory_direct_count             4    1527614
memory_indirect_count           4    0
arc_no_grow                     4    1
arc_tempreserve                 4    1441792
arc_loaned_bytes                4    0
arc_prune                       4    476481
arc_meta_used                   4    3392569944
arc_meta_limit                  4    500000000
arc_meta_max                    4    9300442064

# cat /proc/loadavg 
48.54 49.87 42.61 2/1002 19639
# ps ax | grep rsync | wc -l
86
# zfs list | wc -l
438

# zpool list
NAME      SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
backups  9.06T  1.56T  7.50T    17%  1.00x  ONLINE  -

# zpool status -v
  pool: backups
 state: ONLINE
 scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    backups     ONLINE       0     0     0
      sdb       ONLINE       0     0     0

errors: No known data errors

behlendorf commented 12 years ago

@arp096 You have a different but equally interesting case. In your case the limit was exceeded because very little of your meta data can be safely evicted, it appears that most of it is of type "other". This likely means it's holding dnodes and bonus buffers for lots of active files. I would expect that when you back off the workload it quickly brings the usage under control?

ryao commented 12 years ago

In most of the arcstats being shown here, L2ARC does not appear to be used. It is only the most recent one from @DeHackEd that shows use. This issue could be describing two separate issues that happen to have the same symptoms.

aarrpp commented 12 years ago

@behlendorf Yes, that's right. And bug is not present if the parallel rsync processes is less than ten.

DeHackEd commented 12 years ago

That does describe my situation somewhat as well. When the number of rsync processes is relatively low (best guess of the maximum safe range is 10-15) the system is stable and the ARC size is honoured, L2ARC or not. When It's much higher (around 20 maybe?) I do get the memory usage problem.

What I can't be sure about is how L2ARC factors into things. After @ryao asked me to try stressing it with the L2ARC removed I wasn't able to. Just now I've tried putting medium pressure on (10 rsyncs + send/recv) with the L2ARC and it starts tipping over sometimes (4300 MB of ARC used out of 4096 allowed, just floating above the line but clearly moreso than should be tolerated). Remove the L2arc yet raise the pressure (15 rsyncs + send/recv) and it behaves much better.

I always assumed the cache was responsible for changing the load average from 15 processes in the D state most of the time (waiting on the disk IO) to 15 processes in the D state a lot less of the time (not waiting so much) but now I'm completely confused.

aarrpp commented 12 years ago

It is strange that the metadata cache continues to grow even when turned off.

zfs set primarycache=none backups echo 3 >/proc/sys/vm/drop_caches

And after some minutes cat /proc/spl/kstat/zfs/arcstats 18 1 0x01 77 3696 1385740506293 637293224967258 name type data hits 4 2067885441 misses 4 442672429 demand_data_hits 4 1 demand_data_misses 4 4061515 demand_metadata_hits 4 1701700132 demand_metadata_misses 4 264854116 prefetch_data_hits 4 0 prefetch_data_misses 4 1 prefetch_metadata_hits 4 366185308 prefetch_metadata_misses 4 173756797 mru_hits 4 883119982 mru_ghost_hits 4 91909389 mfu_hits 4 818691046 mfu_ghost_hits 4 49558553 deleted 4 331921146 recycle_miss 4 338637377 mutex_miss 4 28074479 evict_skip 4 182158175305 evict_l2_cached 4 0 evict_l2_eligible 4 979257182720 evict_l2_ineligible 4 2649794910720 hash_elements 4 471925 hash_elements_max 4 473143 hash_collisions 4 95524572 hash_chains 4 119278 hash_chain_max 4 8 p 4 52428800 c 4 838860800 c_min 4 838860800 c_max 4 1000000000 size 4 4623785504 hdr_size 4 139686000 data_size 4 1556584960 other_size 4 2927519664 anon_size 4 44545536 anon_evict_data 4 0 anon_evict_metadata 4 0 mru_size 4 1067032576 mru_evict_data 4 0 mru_evict_metadata 4 1146880 mru_ghost_size 4 688640 mru_ghost_evict_data 4 409088 mru_ghost_evict_metadata 4 279552 mfu_size 4 445050880 mfu_evict_data 4 0 mfu_evict_metadata 4 16384 mfu_ghost_size 4 741977600 mfu_ghost_evict_data 4 0 mfu_ghost_evict_metadata 4 741977600 l2_hits 4 0 l2_misses 4 0 l2_feeds 4 0 l2_rw_clash 4 0 l2_read_bytes 4 0 l2_write_bytes 4 0 l2_writes_sent 4 0 l2_writes_done 4 0 l2_writes_error 4 0 l2_writes_hdr_miss 4 0 l2_evict_lock_retry 4 0 l2_evict_reading 4 0 l2_free_on_write 4 0 l2_abort_lowmem 4 0 l2_cksum_bad 4 0 l2_io_error 4 0 l2_size 4 0 l2_hdr_size 4 0 memory_throttle_count 4 31594 memory_direct_count 4 2304247 memory_indirect_count 4 0 arc_no_grow 4 0 arc_tempreserve 4 0 arc_loaned_bytes 4 0 arc_prune 4 560316 arc_meta_used 4 4608719952 arc_meta_limit 4 500000000 arc_meta_max 4 9300442064

behlendorf commented 12 years ago

It doesn't help that "metadata" is rather vaguely defined in this context.

sciclon commented 12 years ago

i had the same issue, when i run several rsync's in a 1TB pool under KVM-qemu system (VM) I could see on the host that my virtual machine was eating all the cpus assigned to it. i limited ARC memory 2 days ago and everthing looks good for now. options zfs zfs_arc_max=3221225472 ### /etc/modprobe.d/zfs.conf

Besides I'm using Dedup in this pool, I've been reading about it and I understand it uses up to 25% of this memory. I calculated the amount of neccesary memory to keep the dedup table within the ARC memory. my question is : Does DDT table stay in this memory?

DeHackEd commented 11 years ago

Here's my most recent run of the backup cycle, which wasn't even at full speed ("only" 15 copies of rsync) and it crashed in record time. This time I was graphing a whole bunch of the arcstats from the spl data.

System information:

System: Dell Poweredge 1950
CPU: dual-core 2.33 GHz (Xeon 5140)
RAM: 16 GB, ECC
Disk: 11 TB (hardware RAID-6) for main drive, 80 GB SSD partition for L2ARC (initially empty)
OS: RHEL 6.3
Kernel 3.4.18 (no other patches) custom build
zpool: version=26 (zfs-fuse compatibility), ashift=9
zfs: atime=off, compression=gzip-9, secondarycache=metadata

ZFS: 4c837f0d931546e656b832caf11b8d4c2063d905 SPL: zfsonlinux/spl@366346c56592b1ff13020460dcbdd607c70ac7f1

(Could be more up-to-date at this point)

  pool: poolname1
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
    still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
    pool will no longer be accessible on older software versions.
 scan: scrub repaired 0 in 59h56m with 0 errors on Fri Sep 28 23:48:31 2012
config:
    NAME                                            STATE     READ WRITE CKSUM
    poolname1                                       ONLINE       0     0     0
      scsi-36001e4f022847f0016c30400b49f5666        ONLINE       0     0     0
    cache
      scsi-360000000000000001786fd14bcbe933c-part1  ONLINE       0     0     0
errors: No known data errors

Graphs: http://dehacked.2y.net/memory-recent.png http://dehacked.2y.net/arc-recent.png http://dehacked.2y.net/evict-recent.png

Data polling interval is 10 seconds at the RRD level.

openzfs / zfs

ARC grows well past the zfs_arc_max #676

cat /proc/spl/kstat/zfs/arcstats | grep -e arc_meta_used -e "^size " -e c_max -e limit

free

top