live-lock in arc_reclaim, blocking any pool IO

BjoKaSH commented 8 years ago

After upgrade to 0.6.5.3 (Ubuntu 14.04 with ZFS 0.6.5.3-1~trusty) pool IO hangs on moderate load with arc_reclaim in state "D" (but slowly spinning).

This may be related to issue #4106, although workload and stack traces are different. In my case arc_reclaim is using very little CPU (~1%) and no NFS is involved. Stack trace of processes accessing the pool contains arc_read and further down buf_hash_find.

The box is still running and at least partially responsive. If helpful, I can (try to) attach (k)gdb and try to get some more insights on the internal state of ZFS. But I would need some guidance as I am unfamiliar with the ZoL code base.

Workload causing hang:

The box was doing a "rsync ... --link-dest /some/zfs/dir /some/zfs/otherdir " receiving a file tree of some 600000 files and directories from a remote box. The involved ZFS dataset is deduped with a dedup ratio of 2.1

System state:

The ZFS pool is a 2-way mirror:

Archiv-1                                             3.55T  1.89T
  mirror                                             1.78T   965G
    ata-WDC_WD30EURS-63R8UY0_WD-WCAWZ2802006-part2       -      -
    ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N0382585-part2       -      -
  mirror                                             1.77T   971G
    ata-WDC_WD30EURS-63R8UY0_WD-WCAWZ2784580-part2       -      -
    ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N0398766-part2       -      -
cache                                                    -      -
  ata-KINGSTON_SVP200S37A60G_50026B7228022A7F-part3   967M  13.0G

The box has a second pool, but that was not used at the time. I have not tried to touch the second pool in any way in order to prevent further lock-up. Its state is unknown.

The box has 16GB of RAM, of which 5GB are currently free

Kernel is 3.13.0-71-generic #114-Ubuntu

The box has one CPU with 2 cores. One is idle, the other sitting in "wait" state, atop output:

PRC | sys    0.34s |  user   0.09s | #proc   1214 | #trun      1 |  #tslpi  1333 | #tslpu     7 | #zombie    2  | #exit      0 |
CPU | sys       2% |  user      1% | irq       0% | idle     98% |  wait     99% | guest     0% | curf 2.21GHz  | curscal  96% |
cpu | sys       1% |  user      0% | irq       0% | idle     98% |  cpu001 w  0% | guest     0% | curf 2.13GHz  | curscal  92% |
cpu | sys       1% |  user      1% | irq       0% | idle      0% |  cpu000 w 99% | guest     0% | curf 2.30GHz  | curscal 100% |
CPL | avg1    7.03 |  avg5    7.03 | avg15   7.05 |              |  csw    19065 | intr    7816 |               | numcpu     2 |
MEM | tot    15.6G |  free    5.6G | cache 957.0M | dirty   0.1M |  buff  278.6M | slab    7.1G |               |              |
SWP | tot    13.3G |  free   13.3G |              |              |               |              | vmcom   2.4G  | vmlim  21.1G |

The load of "7.03" corresponds to the seven tasks in state "D" (excerpt from "top"):

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                    
  487 root      20   0       0      0      0 D   0.3  0.0 148:07.37 arc_reclaim                                                
 6447 root       1 -19       0      0      0 D   0.0  0.0   6:03.45 z_wr_iss
 6763 root      20   0       0      0      0 D   0.0  0.0   0:46.95 txg_sync
10734 root      20   0       0      0      0 D   0.0  0.0  14:16.83 txg_sync
16393 root      20   0   44008  29580    888 D   0.0  0.2   0:07.74 rsync
21416 root      20   0   17520   9368    912 D   0.0  0.1   0:00.78 find
30804 root      20   0   30936   1352   1088 D   0.0  0.0   0:00.00 zpool

The hanging zpool command was trying to start a scrub of the pool (run by cron) The find was manually started to lock for some files. Not coming back lead to discovery of the hang. The rsync got stuck some days earlier, but went unnoticed (run by cron) until the interactive find didn't respond. Interestingly, I could "ls" the base of the tree searched by find right before starting find. Maybe the data was already in the cache.

There are 935 ZFS related tasks:

  1 x arc_reclaim  -> stack trace  (1)
  1 x arc_user_evicts
 64 x arc_prune
  1 x l2arc_feed
  1 x spl_dynamic_tas
  4 x spl_kmem_cache
 64 x spl_system_task
  2 x txg_quiesce
  2 x txg_sync  ->  stack trace (3, 4)
  1 x z_unmount
  2 x z_cl_int
  2 x z_cl_iss
  2 x z_fr_int
  2 x z_wr_iss  -> stack trace  (2, only the one in state "D")
  2 x z_null_int
  2 x z_null_iss
  2 x z_ioctl_int
  2 x z_ioctl_iss
 10 x z_wr_int_h
 10 x z_wr_iss_h
 16 x z_rd_iss
 24 x z_fr_iss_0, _1, ..., _7
 24 x z_rd_int_0, _1, ..., _7
 24 x z_wr_int_0, _1, ..., _7
 38 x zil_clean
128 x z_iput

I have no idea where the z_unmount comes from. There should have been no mount / umount activity going on.

Relevant stack traces according to sysrq-t

arc_reclaim : stack trace (1) (Yes, it hangs since at least two weeks - I was away I couldn't investigate until now.)

Dec 23 21:43:03 hpmicro kernel: [855468.651154] SysRq : Show Blocked State
Dec 23 21:43:03 hpmicro kernel: [855468.660197]   task                        PC stack   pid father
Dec 23 21:43:03 hpmicro kernel: [855468.660236] arc_reclaim     D ffff88040b413180     0   487      2 0x00000000
Dec 23 21:43:03 hpmicro kernel: [855468.660241]  ffff8803f0d4fb90 0000000000000046 ffff8803f0d59800 ffff8803f0d4ffd8
Dec 23 21:43:03 hpmicro kernel: [855468.660244]  0000000000013180 0000000000013180 ffff8803f0d59800 ffffffff81eb8e40
Dec 23 21:43:03 hpmicro kernel: [855468.660247]  ffff8803f0d4fbc0 000000010cbfe55c ffffffff81eb8e40 000000000000d158
Dec 23 21:43:03 hpmicro kernel: [855468.660251] Call Trace:
Dec 23 21:43:03 hpmicro kernel: [855468.660260]  [<ffffffff81728c69>] schedule+0x29/0x70
Dec 23 21:43:03 hpmicro kernel: [855468.660263]  [<ffffffff81727df5>] schedule_timeout+0x175/0x2d0
Dec 23 21:43:03 hpmicro kernel: [855468.660269]  [<ffffffff81074510>] ? ftrace_raw_output_tick_stop+0x70/0x70
Dec 23 21:43:03 hpmicro kernel: [855468.660288]  [<ffffffffa00d09dd>] __cv_destroy+0xbd/0xd0 [spl]
Dec 23 21:43:03 hpmicro kernel: [855468.660294]  [<ffffffff810ab3e0>] ? prepare_to_wait_event+0x100/0x100
Dec 23 21:43:03 hpmicro kernel: [855468.660319]  [<ffffffffa017bce9>] hdr_full_dest+0x19/0x70 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.660328]  [<ffffffffa00cb33a>] spl_kmem_cache_free+0x2a/0x1d0 [spl]
Dec 23 21:43:03 hpmicro kernel: [855468.660347]  [<ffffffffa0179270>] arc_hdr_destroy+0x190/0x250 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.660366]  [<ffffffffa017b6b5>] arc_evict_state+0x7c5/0x830 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.660388]  [<ffffffffa017b7c7>] arc_adjust_impl.constprop.27+0x37/0x50 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.660407]  [<ffffffffa017baa9>] arc_adjust+0x2c9/0x480 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.660426]  [<ffffffffa017d2b0>] arc_reclaim_thread+0xc0/0x230 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.660446]  [<ffffffffa017d1f0>] ? arc_shrink+0xc0/0xc0 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.660454]  [<ffffffffa00cbe81>] thread_generic_wrapper+0x71/0x80 [spl]
Dec 23 21:43:03 hpmicro kernel: [855468.660461]  [<ffffffffa00cbe10>] ? __thread_exit+0x20/0x20 [spl]
Dec 23 21:43:03 hpmicro kernel: [855468.660465]  [<ffffffff8108b822>] kthread+0xd2/0xf0
Dec 23 21:43:03 hpmicro kernel: [855468.660468]  [<ffffffff8108b750>] ? kthread_create_on_node+0x1c0/0x1c0
Dec 23 21:43:03 hpmicro kernel: [855468.660472]  [<ffffffff817353a8>] ret_from_fork+0x58/0x90
Dec 23 21:43:03 hpmicro kernel: [855468.660475]  [<ffffffff8108b750>] ? kthread_create_on_node+0x1c0/0x1c0

z_wr_iss : stack trace (2)

Dec 23 21:43:03 hpmicro kernel: [855468.660505] z_wr_iss        D ffff88040b433180     0  6447      2 0x00000000
Dec 23 21:43:03 hpmicro kernel: [855468.660508]  ffff8803cd515670 0000000000000046 ffff8803cd7dc800 ffff8803cd515fd8
Dec 23 21:43:03 hpmicro kernel: [855468.660511]  0000000000013180 0000000000013180 ffff8803cd7dc800 ffffffffa02b4770
Dec 23 21:43:03 hpmicro kernel: [855468.660514]  ffffffffa02b4774 ffff8803cd7dc800 00000000ffffffff ffffffffa02b4778
Dec 23 21:43:03 hpmicro kernel: [855468.660517] Call Trace:
Dec 23 21:43:03 hpmicro kernel: [855468.660522]  [<ffffffff81729189>] schedule_preempt_disabled+0x29/0x70
Dec 23 21:43:03 hpmicro kernel: [855468.660526]  [<ffffffff8172aff5>] __mutex_lock_slowpath+0x135/0x1b0
Dec 23 21:43:03 hpmicro kernel: [855468.660530]  [<ffffffff810aaf84>] ? __wake_up+0x44/0x50
Dec 23 21:43:03 hpmicro kernel: [855468.660534]  [<ffffffff8172b08f>] mutex_lock+0x1f/0x2f
Dec 23 21:43:03 hpmicro kernel: [855468.660553]  [<ffffffffa01766a0>] buf_hash_find+0xa0/0x150 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.660573]  [<ffffffffa017c5ff>] arc_read+0x10f/0xa80 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.660620]  [<ffffffffa0228441>] ? zio_null+0x61/0x70 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.660643]  [<ffffffffa0182b90>] ? dbuf_rele_and_unlock+0x3e0/0x3e0 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.660665]  [<ffffffffa0184668>] dbuf_read+0x2a8/0x950 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.660687]  [<ffffffffa0185188>] __dbuf_hold_impl+0x478/0x500 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.660709]  [<ffffffffa0185283>] dbuf_hold_impl+0x73/0xa0 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.660732]  [<ffffffffa0185590>] dbuf_hold+0x20/0x40 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.660757]  [<ffffffffa018dbb7>] dmu_buf_hold_noread+0x67/0x130 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.660783]  [<ffffffffa018dcaf>] dmu_buf_hold+0x2f/0x80 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.660824]  [<ffffffffa01efc5d>] zap_get_leaf_byblk+0x4d/0x2a0 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.660846]  [<ffffffffa0182cfe>] ? dmu_buf_rele+0xe/0x10 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.660885]  [<ffffffffa01ef75b>] ? zap_idx_to_blk+0xdb/0x150 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.660924]  [<ffffffffa01eff1b>] zap_deref_leaf+0x6b/0x80 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.660963]  [<ffffffffa01f0ed2>] fzap_length+0x32/0x90 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.661002]  [<ffffffffa01f492d>] ? zap_name_alloc_uint64+0x5d/0x70 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.661041]  [<ffffffffa01f5c45>] zap_length_uint64+0x75/0xc0 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.661066]  [<ffffffffa018b325>] ddt_zap_lookup+0x55/0xc0 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.661090]  [<ffffffffa018b2d0>] ? ddt_zap_prefetch+0x20/0x20 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.661115]  [<ffffffffa018a039>] ddt_lookup+0xd9/0x1d0 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.661152]  [<ffffffffa01eb63c>] ? vdev_mirror_io_start+0xac/0x190 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.661189]  [<ffffffffa01c9459>] ? zio_checksum_SHA256+0x189/0x230 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.661192]  [<ffffffff8172b082>] ? mutex_lock+0x12/0x2f
Dec 23 21:43:03 hpmicro kernel: [855468.661234]  [<ffffffffa0229e71>] zio_ddt_write+0x61/0x4a0 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.661242]  [<ffffffffa00c99a7>] ? spl_kmem_alloc+0xc7/0x170 [spl]
Dec 23 21:43:03 hpmicro kernel: [855468.661282]  [<ffffffffa0225619>] ? zio_push_transform+0x39/0x90 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.661323]  [<ffffffffa022be71>] ? zio_checksum_compute+0xd1/0x170 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.661364]  [<ffffffffa0226fe8>] zio_execute+0xc8/0x180 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.661372]  [<ffffffffa00ccfad>] taskq_thread+0x20d/0x410 [spl]
Dec 23 21:43:03 hpmicro kernel: [855468.661377]  [<ffffffff8109ace0>] ? wake_up_state+0x20/0x20
Dec 23 21:43:03 hpmicro kernel: [855468.661384]  [<ffffffffa00ccda0>] ? taskq_cancel_id+0x120/0x120 [spl]
Dec 23 21:43:03 hpmicro kernel: [855468.661388]  [<ffffffff8108b822>] kthread+0xd2/0xf0
Dec 23 21:43:03 hpmicro kernel: [855468.661391]  [<ffffffff8108b750>] ? kthread_create_on_node+0x1c0/0x1c0
Dec 23 21:43:03 hpmicro kernel: [855468.661394]  [<ffffffff817353a8>] ret_from_fork+0x58/0x90
Dec 23 21:43:03 hpmicro kernel: [855468.661398]  [<ffffffff8108b750>] ? kthread_create_on_node+0x1c0/0x1c0

txg_sync -> stack trace (3, the short running)

Dec 23 21:43:03 hpmicro kernel: [855468.661427] txg_sync        D ffff88040b413180     0  6763      2 0x00000000
Dec 23 21:43:03 hpmicro kernel: [855468.661430]  ffff8803cda47bc8 0000000000000046 ffff8803cc994800 ffff8803cda47fd8
Dec 23 21:43:03 hpmicro kernel: [855468.661434]  0000000000013180 0000000000013180 ffff8803cc994800 ffff88040b413a18
Dec 23 21:43:03 hpmicro kernel: [855468.661437]  ffff8800c2885aa8 ffff8800c2885ae8 ffff8800c2885ad0 0000000000000001
Dec 23 21:43:03 hpmicro kernel: [855468.661440] Call Trace:
Dec 23 21:43:03 hpmicro kernel: [855468.661443]  [<ffffffff81728f8d>] io_schedule+0x9d/0x140
Dec 23 21:43:03 hpmicro kernel: [855468.661452]  [<ffffffffa00d0b0f>] cv_wait_common+0x9f/0x120 [spl]
Dec 23 21:43:03 hpmicro kernel: [855468.661457]  [<ffffffff810ab3e0>] ? prepare_to_wait_event+0x100/0x100
Dec 23 21:43:03 hpmicro kernel: [855468.661465]  [<ffffffffa00d0be8>] __cv_wait_io+0x18/0x20 [spl]
Dec 23 21:43:03 hpmicro kernel: [855468.661506]  [<ffffffffa0229213>] zio_wait+0x123/0x210 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.661539]  [<ffffffffa01b37c1>] dsl_pool_sync+0xb1/0x470 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.661575]  [<ffffffffa01ce2b5>] spa_sync+0x365/0xb20 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.661579]  [<ffffffff810ab3f2>] ? autoremove_wake_function+0x12/0x40
Dec 23 21:43:03 hpmicro kernel: [855468.661583]  [<ffffffff8101b709>] ? read_tsc+0x9/0x20
Dec 23 21:43:03 hpmicro kernel: [855468.661622]  [<ffffffffa01e0639>] txg_sync_thread+0x3b9/0x620 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.661659]  [<ffffffffa01e0280>] ? txg_quiesce_thread+0x3f0/0x3f0 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.661667]  [<ffffffffa00cbe81>] thread_generic_wrapper+0x71/0x80 [spl]
Dec 23 21:43:03 hpmicro kernel: [855468.661674]  [<ffffffffa00cbe10>] ? __thread_exit+0x20/0x20 [spl]
Dec 23 21:43:03 hpmicro kernel: [855468.661677]  [<ffffffff8108b822>] kthread+0xd2/0xf0
Dec 23 21:43:03 hpmicro kernel: [855468.661681]  [<ffffffff8108b750>] ? kthread_create_on_node+0x1c0/0x1c0
Dec 23 21:43:03 hpmicro kernel: [855468.661684]  [<ffffffff817353a8>] ret_from_fork+0x58/0x90
Dec 23 21:43:03 hpmicro kernel: [855468.661687]  [<ffffffff8108b750>] ? kthread_create_on_node+0x1c0/0x1c0

txg_sync -> stack trace (4, the long running)

Dec 23 21:43:03 hpmicro kernel: [855468.661732] txg_sync        D ffff88040b433180     0 10734      2 0x00000000
Dec 23 21:43:03 hpmicro kernel: [855468.661735]  ffff8803cc0bf3f0 0000000000000046 ffff8803c7f93000 ffff8803cc0bffd8
Dec 23 21:43:03 hpmicro kernel: [855468.661738]  0000000000013180 0000000000013180 ffff8803c7f93000 ffffffffa02b4770
Dec 23 21:43:03 hpmicro kernel: [855468.661741]  ffffffffa02b4774 ffff8803c7f93000 00000000ffffffff ffffffffa02b4778
Dec 23 21:43:03 hpmicro kernel: [855468.661744] Call Trace:
Dec 23 21:43:03 hpmicro kernel: [855468.661749]  [<ffffffff81729189>] schedule_preempt_disabled+0x29/0x70
Dec 23 21:43:03 hpmicro kernel: [855468.661753]  [<ffffffff8172aff5>] __mutex_lock_slowpath+0x135/0x1b0
Dec 23 21:43:03 hpmicro kernel: [855468.661757]  [<ffffffff8172b08f>] mutex_lock+0x1f/0x2f
Dec 23 21:43:03 hpmicro kernel: [855468.661766]  [<ffffffffa00d0b17>] cv_wait_common+0xa7/0x120 [spl]
Dec 23 21:43:03 hpmicro kernel: [855468.661771]  [<ffffffff810ab3e0>] ? prepare_to_wait_event+0x100/0x100
Dec 23 21:43:03 hpmicro kernel: [855468.661779]  [<ffffffffa00d0ba5>] __cv_wait+0x15/0x20 [spl]
Dec 23 21:43:03 hpmicro kernel: [855468.661799]  [<ffffffffa017c5b5>] arc_read+0xc5/0xa80 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.661840]  [<ffffffffa022890c>] ? zio_read+0xbc/0xe0 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.661846]  [<ffffffff811a31fc>] ? __kmalloc_node+0x5c/0x2a0
Dec 23 21:43:03 hpmicro kernel: [855468.661853]  [<ffffffffa00c99a7>] ? spl_kmem_alloc+0xc7/0x170 [spl]
Dec 23 21:43:03 hpmicro kernel: [855468.661859]  [<ffffffffa00c99a7>] ? spl_kmem_alloc+0xc7/0x170 [spl]
Dec 23 21:43:03 hpmicro kernel: [855468.661879]  [<ffffffffa017c480>] ? arc_buf_remove_ref+0x150/0x150 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.661912]  [<ffffffffa01b87a6>] dsl_scan_visitbp+0x266/0xbf0 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.661945]  [<ffffffffa01b80aa>] ? dsl_scan_prefetch.isra.5+0xba/0xc0 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.661976]  [<ffffffffa01b88c1>] dsl_scan_visitbp+0x381/0xbf0 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.662008]  [<ffffffffa01b88c1>] dsl_scan_visitbp+0x381/0xbf0 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.662040]  [<ffffffffa01b88c1>] dsl_scan_visitbp+0x381/0xbf0 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.662071]  [<ffffffffa01b88c1>] dsl_scan_visitbp+0x381/0xbf0 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.662103]  [<ffffffffa01b88c1>] dsl_scan_visitbp+0x381/0xbf0 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.662123]  [<ffffffffa017c968>] ? arc_read+0x478/0xa80 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.662127]  [<ffffffff811a31fc>] ? __kmalloc_node+0x5c/0x2a0
Dec 23 21:43:03 hpmicro kernel: [855468.662158]  [<ffffffffa01b8dbb>] dsl_scan_visitbp+0x87b/0xbf0 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.662189]  [<ffffffffa01b9203>] dsl_scan_visitds+0xd3/0x4b0 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.662220]  [<ffffffffa01ba688>] dsl_scan_sync+0x2a8/0xc50 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.662261]  [<ffffffffa02259ad>] ? zio_destroy+0xcd/0xd0 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.662301]  [<ffffffffa022925a>] ? zio_wait+0x16a/0x210 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.662337]  [<ffffffffa01ce312>] spa_sync+0x3c2/0xb20 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.662341]  [<ffffffff810ab3f2>] ? autoremove_wake_function+0x12/0x40
Dec 23 21:43:03 hpmicro kernel: [855468.662345]  [<ffffffff8101b709>] ? read_tsc+0x9/0x20
Dec 23 21:43:03 hpmicro kernel: [855468.662383]  [<ffffffffa01e0639>] txg_sync_thread+0x3b9/0x620 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.662420]  [<ffffffffa01e0280>] ? txg_quiesce_thread+0x3f0/0x3f0 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.662428]  [<ffffffffa00cbe81>] thread_generic_wrapper+0x71/0x80 [spl]
Dec 23 21:43:03 hpmicro kernel: [855468.662435]  [<ffffffffa00cbe10>] ? __thread_exit+0x20/0x20 [spl]
Dec 23 21:43:03 hpmicro kernel: [855468.662438]  [<ffffffff8108b822>] kthread+0xd2/0xf0
Dec 23 21:43:03 hpmicro kernel: [855468.662441]  [<ffffffff8108b750>] ? kthread_create_on_node+0x1c0/0x1c0
Dec 23 21:43:03 hpmicro kernel: [855468.662445]  [<ffffffff817353a8>] ret_from_fork+0x58/0x90
Dec 23 21:43:03 hpmicro kernel: [855468.662448]  [<ffffffff8108b750>] ? kthread_create_on_node+0x1c0/0x1c0

rsync

Dec 23 21:43:03 hpmicro kernel: [855468.662456] rsync           D ffff88040b433180     0 16393      1 0x00000004
Dec 23 21:43:03 hpmicro kernel: [855468.662459]  ffff8803502cd750 0000000000000082 ffff880395e23000 ffff8803502cdfd8
Dec 23 21:43:03 hpmicro kernel: [855468.662462]  0000000000013180 0000000000013180 ffff880395e23000 ffffffffa02b4770
Dec 23 21:43:03 hpmicro kernel: [855468.662465]  ffffffffa02b4774 ffff880395e23000 00000000ffffffff ffffffffa02b4778
Dec 23 21:43:03 hpmicro kernel: [855468.662468] Call Trace:
Dec 23 21:43:03 hpmicro kernel: [855468.662473]  [<ffffffff81729189>] schedule_preempt_disabled+0x29/0x70
Dec 23 21:43:03 hpmicro kernel: [855468.662476]  [<ffffffff8172aff5>] __mutex_lock_slowpath+0x135/0x1b0
Dec 23 21:43:03 hpmicro kernel: [855468.662480]  [<ffffffff8172b08f>] mutex_lock+0x1f/0x2f
Dec 23 21:43:03 hpmicro kernel: [855468.662498]  [<ffffffffa01766a0>] buf_hash_find+0xa0/0x150 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.662518]  [<ffffffffa017c5ff>] arc_read+0x10f/0xa80 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.662559]  [<ffffffffa0228441>] ? zio_null+0x61/0x70 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.662581]  [<ffffffffa0182b90>] ? dbuf_rele_and_unlock+0x3e0/0x3e0 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.662603]  [<ffffffffa0184668>] dbuf_read+0x2a8/0x950 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.662629]  [<ffffffffa018dcd0>] dmu_buf_hold+0x50/0x80 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.662636]  [<ffffffffa00c9b10>] ? spl_kmem_zalloc+0xc0/0x170 [spl]
Dec 23 21:43:03 hpmicro kernel: [855468.662675]  [<ffffffffa01f4a38>] zap_lockdir+0x58/0x900 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.662682]  [<ffffffffa00c9b10>] ? spl_kmem_zalloc+0xc0/0x170 [spl]
Dec 23 21:43:03 hpmicro kernel: [855468.662721]  [<ffffffffa01f8357>] ? zfs_acl_node_alloc+0x27/0x60 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.662759]  [<ffffffffa01f897c>] ? zfs_acl_node_read.constprop.17+0xec/0x310 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.662796]  [<ffffffffa01f572f>] zap_lookup_norm+0x3f/0x1c0 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.662804]  [<ffffffffa00c99a7>] ? spl_kmem_alloc+0xc7/0x170 [spl]
Dec 23 21:43:03 hpmicro kernel: [855468.662840]  [<ffffffffa01f58e3>] zap_lookup+0x33/0x40 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.662879]  [<ffffffffa01fe5a2>] zfs_dirent_lock+0x4d2/0x580 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.662917]  [<ffffffffa01f9069>] ? zfs_zaccess_common+0x169/0x1a0 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.662955]  [<ffffffffa01fe6d1>] zfs_dirlook+0x81/0x310 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.662995]  [<ffffffffa02136e5>] zfs_lookup+0x2c5/0x310 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.663036]  [<ffffffffa022fcce>] zpl_lookup+0x8e/0x110 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.663040]  [<ffffffff811c817d>] lookup_real+0x1d/0x50
Dec 23 21:43:03 hpmicro kernel: [855468.663044]  [<ffffffff811c8ae3>] __lookup_hash+0x33/0x40
Dec 23 21:43:03 hpmicro kernel: [855468.663048]  [<ffffffff81721357>] lookup_slow+0x42/0xa7
Dec 23 21:43:03 hpmicro kernel: [855468.663052]  [<ffffffff811cd4c9>] path_lookupat+0x749/0x790
Dec 23 21:43:03 hpmicro kernel: [855468.663057]  [<ffffffff811a3475>] ? kmem_cache_alloc+0x35/0x1e0
Dec 23 21:43:03 hpmicro kernel: [855468.663061]  [<ffffffff811ce40f>] ? getname_flags+0x4f/0x190
Dec 23 21:43:03 hpmicro kernel: [855468.663065]  [<ffffffff811cd53b>] filename_lookup+0x2b/0xc0
Dec 23 21:43:03 hpmicro kernel: [855468.663069]  [<ffffffff811cf0e4>] user_path_at_empty+0x54/0x90
Dec 23 21:43:03 hpmicro kernel: [855468.663073]  [<ffffffff811cf131>] user_path_at+0x11/0x20
Dec 23 21:43:03 hpmicro kernel: [855468.663076]  [<ffffffff811c3200>] vfs_fstatat+0x50/0xa0
Dec 23 21:43:03 hpmicro kernel: [855468.663079]  [<ffffffff811c36e2>] SYSC_newlstat+0x22/0x40
Dec 23 21:43:03 hpmicro kernel: [855468.663083]  [<ffffffff811d5024>] ? dput+0x54/0x180
Dec 23 21:43:03 hpmicro kernel: [855468.663087]  [<ffffffff811de454>] ? mntput+0x24/0x40
Dec 23 21:43:03 hpmicro kernel: [855468.663090]  [<ffffffff811c81ce>] ? path_put+0x1e/0x30
Dec 23 21:43:03 hpmicro kernel: [855468.663094]  [<ffffffff811bd0b0>] ? SyS_chmod+0x50/0xa0
Dec 23 21:43:03 hpmicro kernel: [855468.663097]  [<ffffffff811c38fe>] SyS_newlstat+0xe/0x10
Dec 23 21:43:03 hpmicro kernel: [855468.663101]  [<ffffffff8173545d>] system_call_fastpath+0x1a/0x1f

find

Dec 23 21:43:03 hpmicro kernel: [855468.663111] find            D ffff88040b413180     0 21416   3218 0x00000004
Dec 23 21:43:03 hpmicro kernel: [855468.663114]  ffff8800c9ebb920 0000000000000086 ffff8803cc9ae000 ffff8800c9ebbfd8
Dec 23 21:43:03 hpmicro kernel: [855468.663117]  0000000000013180 0000000000013180 ffff8803cc9ae000 ffffffffa02b4770
Dec 23 21:43:03 hpmicro kernel: [855468.663120]  ffffffffa02b4774 ffff8803cc9ae000 00000000ffffffff ffffffffa02b4778
Dec 23 21:43:03 hpmicro kernel: [855468.663123] Call Trace:
Dec 23 21:43:03 hpmicro kernel: [855468.663128]  [<ffffffff81729189>] schedule_preempt_disabled+0x29/0x70
Dec 23 21:43:03 hpmicro kernel: [855468.663131]  [<ffffffff8172aff5>] __mutex_lock_slowpath+0x135/0x1b0
Dec 23 21:43:03 hpmicro kernel: [855468.663136]  [<ffffffff8172b08f>] mutex_lock+0x1f/0x2f
Dec 23 21:43:03 hpmicro kernel: [855468.663154]  [<ffffffffa01766a0>] buf_hash_find+0xa0/0x150 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.663174]  [<ffffffffa017c5ff>] arc_read+0x10f/0xa80 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.663214]  [<ffffffffa0228441>] ? zio_null+0x61/0x70 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.663236]  [<ffffffffa0182b90>] ? dbuf_rele_and_unlock+0x3e0/0x3e0 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.663258]  [<ffffffffa0184668>] dbuf_read+0x2a8/0x950 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.663284]  [<ffffffffa018dcd0>] dmu_buf_hold+0x50/0x80 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.663322]  [<ffffffffa01efc5d>] zap_get_leaf_byblk+0x4d/0x2a0 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.663360]  [<ffffffffa01eff1b>] zap_deref_leaf+0x6b/0x80 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.663397]  [<ffffffffa01f1ace>] fzap_cursor_retrieve+0x12e/0x2b0 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.663435]  [<ffffffffa01f5338>] zap_cursor_retrieve+0x58/0x300 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.663470]  [<ffffffffa01c8c4d>] ? sa_lookup+0x8d/0xa0 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.663473]  [<ffffffff811d1bf8>] ? filldir+0x88/0x100
Dec 23 21:43:03 hpmicro kernel: [855468.663513]  [<ffffffffa02149ee>] zfs_readdir+0x14e/0x4b0 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.663518]  [<ffffffff81170000>] ? pcpu_get_pages_and_bitmap+0x70/0xe0
Dec 23 21:43:03 hpmicro kernel: [855468.663521]  [<ffffffff811cddeb>] ? path_openat+0xbb/0x640
Dec 23 21:43:03 hpmicro kernel: [855468.663556]  [<ffffffffa01c4ca8>] ? rrw_exit+0x68/0x170 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.663560]  [<ffffffff81730a34>] ? __do_page_fault+0x204/0x570
Dec 23 21:43:03 hpmicro kernel: [855468.663600]  [<ffffffffa022f0f2>] zpl_iterate+0x52/0x80 [zfs]
Dec 23 21:43:03 hpmicro kernel: [855468.663604]  [<ffffffff811d1a55>] iterate_dir+0xa5/0xe0
Dec 23 21:43:03 hpmicro kernel: [855468.663607]  [<ffffffff811d1eb2>] SyS_getdents+0x92/0x120
Dec 23 21:43:03 hpmicro kernel: [855468.663609]  [<ffffffff811d1b70>] ? fillonedir+0xe0/0xe0
Dec 23 21:43:03 hpmicro kernel: [855468.663613]  [<ffffffff8172d0e8>] ? page_fault+0x28/0x30
Dec 23 21:43:03 hpmicro kernel: [855468.663617]  [<ffffffff8173545d>] system_call_fastpath+0x1a/0x1f

behlendorf commented 8 years ago

@BjoKaSH thanks for posting this. This does look like the second issue currently being discussed #4106. The stacks are slightly different but more clearly show the deadlock. Here's the critical bit from the stacks you posted with annotations for clarity.

txg_sync_thread
  spa_sync
    dsl_scan_sync
      dsl_scan_visitds
        dsl_scan_visitbp
        ...
          arc_read            - takes hash_lock
            __cv_wait
              cv_wait_common  - increments waiters if b_cv
                              - drops hash_lock
                              - schedule() called and task woken 
                mutex_lock    - waiting on hash_lock

arc_reclaim_thread
  arc_adjust
    arc_adjust_impl
      arc_evict_state
        arc_evict_state_impl  - take hash_lock
          arc_hdr_destroy
            spl_kmem_cache_free
              hdr_full_dest
                __cv_destroy  - waiting for waiters to exit b_cv

I think what has happened is clear, the mystery still lies in exactly how it happened. If your familiar with the kgdb or crash utilities it would be helpful if you could dump the contents of both the hash_lock mutex and the b_l1hdr.b_cv as a starting point.

tuxoko commented 8 years ago

@behlendorf This is a pretty funny situation. But I think this is pretty easy to solve. We really don't need to hold the mutex lock until we release the ref count on the cvp. We can just use a cvp private spin lock to protect itself.

BjoKaSH commented 8 years ago

Sooo, I am probably too late, but anyway, here comes the data (all from the arc_reclaim task, pid 487):

hash_lock

(gdb) print *$hl
$155 = (kmutex_t *) 0xffffffffa02b4770 <buf_hash_table+312720>
(gdb) print **$hl
$156 = {m_mutex = {count = {counter = -4}, wait_lock = {{rlock = {raw_lock = {{head_tail = 524296, tickets = {head = 8, 
                tail = 8}}}}}}, wait_list = {next = 0xffff8803cc0bf410, prev = 0xffff8800c9ebb940}, 
    owner = 0xffff8803f0d59800, spin_mlock = 0x0 <irq_stack_union>}, m_lock = {{rlock = {raw_lock = {{head_tail = 2590808684, 
            tickets = {head = 39532, tail = 39532}}}}}}, m_owner = 0xffff8803f0d59800}
(gdb) print (*$hl)->m_mutex.owner
$157 = (struct task_struct *) 0xffff8803f0d59800
(gdb) print (*$hl)->m_mutex.owner->comm
$158 = "arc_reclaim\000\000\000\000"
(gdb) print (*$hl)->m_mutex.owner->pid
$174 = 487
(gdb) print (*$hl)->m_mutex.wait_list
$159 = {next = 0xffff8803cc0bf410, prev = 0xffff8800c9ebb940}
(gdb) print (*$hl)->m_mutex.wait_list.next
$161 = (struct list_head *) 0xffff8803cc0bf410
(gdb) set $wm = (struct mutex_waiter *)(*$hl)->m_mutex.wait_list.next
(gdb) print $wm->task->comm
$162 = "txg_sync\000\000\000\000\000\000\000"
(gdb) print  $wm->task->pid
$164 = 10734
(gdb) print $wm->list.next
$165 = (struct list_head *) 0xffff8803502cd770
(gdb) set $wm = (struct mutex_waiter *) $wm->list.next
(gdb) print $wm->task->comm
$166 = "rsync\000it\000\000\000\000\000\000\000"
(gdb) print  $wm->task->pid
$167 = 16393
(gdb) set $wm = (struct mutex_waiter *) $wm->list.next
(gdb) print $wm->task->comm
$168 = "z_wr_iss\000\000\000\000\000\000\000"
(gdb) print  $wm->task->pid
$169 = 6447
(gdb) set $wm = (struct mutex_waiter *) $wm->list.next
(gdb) print $wm->task->comm
$170 = "find\000n\000t\000\000\000\000\000\000\000"
(gdb) print  $wm->task->pid
$171 = 21416
(gdb) set $wm = (struct mutex_waiter *) $wm->list.next
(gdb) print $wm->task->comm
$172 = "arc_reclaim\000\000\000\000"
(gdb) print  $wm->task->pid
$173 = 487

hdr and b_l1hdr.b_cv

(gdb) print $hdr
$177 = (arc_buf_hdr_t *) 0xffff8802f2dc9c30
(gdb) print *$hdr
$178 = {b_dva = {dva_word = {0, 0}}, b_birth = 0, b_freeze_cksum = 0x0 <irq_stack_union>, b_hash_next = 0x0 <irq_stack_union>, 
  b_flags = (ARC_FLAG_PREFETCH | ARC_FLAG_INDIRECT | ARC_FLAG_BUFC_METADATA | ARC_FLAG_HAS_L1HDR), b_size = 16384, 
  b_spa = 6460952436296833629, b_l2hdr = {b_dev = 0x0 <irq_stack_union>, b_daddr = 0, b_hits = 0, b_asize = 0, 
    b_compress = 0 '\000', b_l2node = {next = 0xdead000000100100, prev = 0xdead000000200200}}, b_l1hdr = {b_freeze_lock = {
      m_mutex = {count = {counter = 1}, wait_lock = {{rlock = {raw_lock = {{head_tail = 0, tickets = {head = 0, 
                    tail = 0}}}}}}, wait_list = {next = 0xffff8802f2dc9ca0, prev = 0xffff8802f2dc9ca0}, 
        owner = 0x0 <irq_stack_union>, spin_mlock = 0x0 <irq_stack_union>}, m_lock = {{rlock = {raw_lock = {{head_tail = 0, 
                tickets = {head = 0, tail = 0}}}}}}, m_owner = 0x0 <irq_stack_union>}, b_buf = 0x0 <irq_stack_union>, 
    b_datacnt = 0, b_cv = {cv_magic = 879052277, cv_event = {lock = {{rlock = {raw_lock = {{head_tail = 262148, tickets = {
                    head = 4, tail = 4}}}}}}, task_list = {next = 0xffff8802f2dc9cf0, prev = 0xffff8802f2dc9cf0}}, 
      cv_destroy = {lock = {{rlock = {raw_lock = {{head_tail = 1244940852, tickets = {head = 18996, tail = 18996}}}}}}, 
        task_list = {next = 0xffff8803f0d4fc70, prev = 0xffff8803f0d4fc70}}, cv_refs = {counter = 1}, cv_waiters = {
        counter = 1}, cv_mutex = 0xffffffffa02b4770 <buf_hash_table+312720>}, b_state = 0xffffffffa02e85e0 <ARC_anon>, 
    b_arc_node = {next = 0xdead000000100100, prev = 0xdead000000200200}, b_arc_access = 4376410910, b_mru_hits = 1, 
    b_mru_ghost_hits = 1, b_mfu_hits = 0, b_mfu_ghost_hits = 0, b_l2_hits = 0, b_refcnt = {rc_count = 0}, 
    b_acb = 0x0 <irq_stack_union>, b_tmp_cdata = 0x0 <irq_stack_union>}}

b_l1hdr.b_cv

(gdb) print $hdr->b_l1hdr.b_cv
$180 = {cv_magic = 879052277, cv_event = {lock = {{rlock = {raw_lock = {{head_tail = 262148, tickets = {head = 4,
                tail = 4}}}}}}, task_list = {next = 0xffff8802f2dc9cf0, prev = 0xffff8802f2dc9cf0}}, cv_destroy = {lock = {{
        rlock = {raw_lock = {{head_tail = 1244940852, tickets = {head = 18996, tail = 18996}}}}}}, task_list = {
      next = 0xffff8803f0d4fc70, prev = 0xffff8803f0d4fc70}}, cv_refs = {counter = 1}, cv_waiters = {counter = 1},
  cv_mutex = 0xffffffffa02b4770 <buf_hash_table+312720>}
(gdb) print $hdr->b_l1hdr.b_cv.cv_event.task_list.next
$182 = (struct list_head *) 0xffff8802f2dc9cf0
(gdb) print &$hdr->b_l1hdr.b_cv.cv_event.task_list.next 
$186 = (struct list_head **) 0xffff8802f2dc9cf0

cv_event.task_list is empty cv_destroy.task_list contains a single node, but I don't know what it refers to. What is the true type $hdr->b_l1hdr.b_cv.cv_destroy.task_list.next points to, i.e. beside an embedded list_head?

Took a bit longer to locate the data. Turned out kgdb doesn't work with that box, it doesn't have a serial port. So had to use plain GDB against /proc/kcore and "unwind" the stacks manually (by looking at "disass ..." and memory dumps of the stack page), which is a bit messy.

The box is still alive, so I can try to retrieve more data if still needed.

kernelOfTruth commented 8 years ago

@BjoKaSH ping

you have the possibility to modify and try out alternative deb packages ?

if yes, there were several fixes mentioned in #4106

https://github.com/zfsonlinux/zfs/issues/3979#issuecomment-171649725

is a summary of those from #4106

BjoKaSH commented 8 years ago

@kernelOfTruth I tried once in a while, but Debian packages are largely black magic to me. And anything dkms related is deep black magic. Nevertheless I'll try again in the next days.

Right now I wonder if there is a way to untangle the two threads, such that I can get a clean shutdown instead of just power-cycling the box. Or if any attempt to modify the state would always cause disaster.

@behlendorf was the contents of the hash_lock mutex and the b_l1hdr.b_cv above still useful, or do you need anything else?

behlendorf commented 8 years ago

@BjoKaSH thanks for the debug, we got what we needed from the back traces. The fix for this has been merged to the zfs master branch. You can roll custom debian packages if you need to:

http://zfsonlinux.org/generic-deb.html

zfsonlinux/spl@e843553 Don't hold mutex until release cv in cv_wait c96c36f Fix zsb->z_hold_mtx deadlock 0720116 Add zfs_object_mutex_size module option

kernelOfTruth commented 8 years ago

@BjoKaSH it looks like there are weekly builds available at https://launchpad.net/~zfs-native/+archive/ubuntu/daily

So I assume the next build should include these fixes & latest master changes

openzfs / zfs

live-lock in arc_reclaim, blocking any pool IO #4166