openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.61k stars 1.75k forks source link

zio deadlock #496

Closed behlendorf closed 12 years ago

behlendorf commented 12 years ago

After ZFS running smoothly after I reverted to a 3.0 kernel and limited the ARC size to 2GB (1/8th of memory), I got a lock up this morning.

At the time, there was a rsync going on from a remote machine onto a 6TiB ext4 FS mounted on top of a sparse zvol (with compression), and a ddrescue running on remote host onto a zvol served over nbd (nbd-server exporting the zvol) again with compression and with a block size of 4kiB.

At the time, ddrescue was dumping zeros so highly compressable data.

Hold on, another similar lockup again after I've rebooted it and resumed the ddrescue. This time, no rsync running, just the nbd-server, but it booted into 3.1.4 (my mistake). Here is some serial console capture (on the first occurrence)

[501360.008635] zvol/2          D ffffffff81605120     0   496      2 0x00000000
[501360.008635]  ffff88020c7adc20 0000000000000046 0000000000000001 ffff8804124902e0
[501360.008635]  ffff88020c7adfd8 ffff88020c7adfd8 ffff88020c7adfd8 0000000000012a40
[501360.008635]  ffff8802130e1720 ffff88020c7a2e40 ffff88020c7adc30 ffff880412490330
[501360.008635] Call Trace:
[501360.008635]  [] schedule+0x3f/0x60
[501360.008635]  [] cv_wait_common+0x77/0xd0 [spl]
[501360.008635]  [] ? add_wait_queue+0x60/0x60
[501360.008635]  [] __cv_wait+0x13/0x20 [spl]
[501360.008635]  [] txg_wait_open+0x73/0xa0 [zfs]
[501360.008635]  [] dmu_tx_wait+0xed/0xf0 [zfs]
[501360.008635]  [] dmu_tx_assign+0x66/0x420 [zfs]
[501360.008635]  [] ? dsl_dataset_block_freeable+0x40/0x60 [zfs]
[501360.008635]  [] zvol_write+0x7f/0x3a0 [zfs]
[501360.008635]  [] ? default_spin_lock_flags+0x9/0x10
[501360.008635]  [] taskq_thread+0x1b5/0x390 [spl]
[501360.008635]  [] ? try_to_wake_up+0x200/0x200
[501360.008635]  [] ? task_alloc+0x160/0x160 [spl]
[501360.008635]  [] kthread+0x8c/0xa0
[501360.008635]  [] kernel_thread_helper+0x4/0x10
[501360.008635]  [] ? flush_kthread_worker+0xa0/0xa0
[501360.008635]  [] ? gs_change+0x13/0x13

[501360.008635] txg_sync        D 0000000000000002     0   901      2 0x00000000
[501360.008635]  ffff880207679580 0000000000000046 ffffffff815f318e ffff880207679600
[501360.008635]  ffff880207679fd8 ffff880207679fd8 ffff880207679fd8 0000000000012a40
[501360.008635]  ffff880209d90000 ffff88020983ae40 ffff880207679fd8 ffff8800218d24c0
[501360.008635] Call Trace:
[501360.008635]  [] ? apic_timer_interrupt+0xe/0x20
[501360.008635]  [] schedule+0x3f/0x60
[501360.008635]  [] __mutex_lock_slowpath+0xd7/0x150
[501360.008635]  [] ? __kmalloc+0x13d/0x190
[501360.008635]  [] mutex_lock+0x2a/0x50
[501360.008635]  [] zio_add_child+0x61/0x120 [zfs]
[501360.008635]  [] zio_create+0x426/0x520 [zfs]
[501360.008635]  [] zio_write+0x7c/0xa0 [zfs]
[501360.008635]  [] ? arc_evict_ghost+0x350/0x350 [zfs]
[501360.008635]  [] arc_write+0xcf/0xf0 [zfs]
[501360.008635]  [] ? arc_cksum_compute.isra.16+0xc0/0xc0 [zfs]
[501360.008635]  [] ? arc_evict_ghost+0x350/0x350 [zfs]
[501360.008635]  [] dbuf_write.isra.8+0x19d/0x330 [zfs]
[501360.008635]  [] ? dbuf_destroy+0x170/0x170 [zfs]
[501360.008635]  [] ? dbuf_rele_and_unlock+0x210/0x210 [zfs]
[501360.008635]  [] ? mutex_lock+0x1d/0x50
[501360.008635]  [] dbuf_sync_leaf+0xd4/0x320 [zfs]
[501360.008635]  [] dbuf_sync_list+0x95/0xa0 [zfs]
[501360.008635]  [] dbuf_sync_indirect+0xa2/0x110 [zfs]
[501360.008635]  [] dbuf_sync_list+0x77/0xa0 [zfs]
[501360.008635]  [] dbuf_sync_indirect+0xa2/0x110 [zfs]
[501360.008635]  [] dbuf_sync_list+0x77/0xa0 [zfs]
[501360.008635]  [] dbuf_sync_indirect+0xa2/0x110 [zfs]
[501360.008635]  [] dbuf_sync_list+0x77/0xa0 [zfs]
[501360.008635]  [] dbuf_sync_indirect+0xa2/0x110 [zfs]
[501360.008635]  [] dbuf_sync_list+0x77/0xa0 [zfs]
[501360.008635]  [] dnode_sync+0x557/0xa40 [zfs]
[501360.008635]  [] ? dbuf_sync_list+0x77/0xa0 [zfs]
[501360.008635]  [] ? arc_write+0xcf/0xf0 [zfs]
[501360.008635]  [] dmu_objset_sync_dnodes+0xbb/0xe0 [zfs]
[501360.008635]  [] dmu_objset_sync+0x198/0x330 [zfs]
[501360.008635]  [] ? secondary_cache_changed_cb+0x20/0x20 [zfs]
[501360.008635]  [] ? dmu_objset_sync+0x330/0x330 [zfs]
[501360.008635]  [] ? mutex_lock+0x1d/0x50
[501360.008635]  [] dsl_dataset_sync+0x57/0x70 [zfs]
[501360.008635]  [] dsl_pool_sync+0xb1/0x450 [zfs]
[501360.008635]  [] spa_sync+0x38e/0xa00 [zfs]
[501360.008635]  [] ? __wake_up+0x53/0x70
[501360.008635]  [] ? __wake_up+0x53/0x70
[501360.008635]  [] txg_sync_thread+0x216/0x390 [zfs]
[501360.008635]  [] ? txg_init+0x260/0x260 [zfs]
[501360.008635]  [] ? txg_init+0x260/0x260 [zfs]
[501360.008635]  [] thread_generic_wrapper+0x78/0x90 [spl]
[501360.008635]  [] ? __thread_create+0x160/0x160 [spl]
[501360.008635]  [] kthread+0x8c/0xa0
[501360.008635]  [] kernel_thread_helper+0x4/0x10

[501360.008635] z_wr_iss/0      D 0000000000000000     0   737      2 0x00000000
[501360.008635]  ffff880209d89d00 0000000000000046 ffffffff815f318e ffff880209d89d80
[501360.008635]  ffff880209d89fd8 ffff880209d89fd8 ffff880209d89fd8 0000000000012a40
[501360.008635]  ffff88020fa31720 ffff880209d84560 ffff8800218d24c0 ffff8800218d24c0
[501360.008635] Call Trace:
[501360.008635]  [] ? apic_timer_interrupt+0xe/0x20
[501360.008635]  [] schedule+0x3f/0x60
[501360.008635]  [] __mutex_lock_slowpath+0xd7/0x150
[501360.008635]  [] ? dbuf_write_ready+0x122/0x1c0 [zfs]
[501360.008635]  [] mutex_lock+0x2a/0x50
[501360.008635]  [] zio_ready+0x1e0/0x3b0 [zfs]
[501360.008635]  [] zio_execute+0x9f/0xf0 [zfs]
[501360.008635]  [] taskq_thread+0x1b5/0x390 [spl]
[501360.008635]  [] ? try_to_wake_up+0x200/0x200
[501360.008635]  [] ? task_alloc+0x160/0x160 [spl]
[501360.008635]  [] kthread+0x8c/0xa0
[501360.008635]  [] kernel_thread_helper+0x4/0x10
[501360.008635]  [] ? flush_kthread_worker+0xa0/0xa0
[501360.008635]  [] ? gs_change+0x13/0x13
[501360.008635] z_wr_iss/6      D 0000000000000006     0   743      2 0x00000000
[501360.008635]  ffff880209d9dd00 0000000000000046 ffffffff815f318e ffff880209d9dd80
[501360.008635]  ffff880209d9dfd8 ffff880209d9dfd8 ffff880209d9dfd8 0000000000012a40
[501360.008635]  ffff880209d6ae40 ffff880209d95c80 ffff8800218d24c0 ffff8800218d24c0
[501360.008635] Call Trace:
[501360.008635]  [] ? apic_timer_interrupt+0xe/0x20
[501360.008635]  [] schedule+0x3f/0x60
[501360.008635]  [] __mutex_lock_slowpath+0xd7/0x150
[501360.008635]  [] ? dbuf_write_ready+0x122/0x1c0 [zfs]
[501360.008635]  [] mutex_lock+0x2a/0x50
[501360.008635]  [] zio_ready+0x1e0/0x3b0 [zfs]
[501360.008635]  [] zio_execute+0x9f/0xf0 [zfs]
[501360.008635]  [] taskq_thread+0x1b5/0x390 [spl]
[501360.008635]  [] ? try_to_wake_up+0x200/0x200
[501360.008635]  [] ? task_alloc+0x160/0x160 [spl]
[501360.008635]  [] kthread+0x8c/0xa0
[501360.008635]  [] kernel_thread_helper+0x4/0x10
[501360.008635]  [] ? flush_kthread_worker+0xa0/0xa0
[501360.008635]  [] ? gs_change+0x13/0x13
[501360.008635] z_wr_iss/11     D 000000000000000b     0   748      2 0x00000000
[501360.008635]  ffff880209dbfd00 0000000000000046 ffffffff815f318e ffff880209dbfd80
[501360.008635]  ffff880209dbffd8 ffff880209dbffd8 ffff880209dbffd8 0000000000012a40
[501360.008635]  ffff880209d82e40 ffff880209dadc80 ffff8800218d24c0 ffff8800218d24c0
[501360.008635] Call Trace:
[501360.008635]  [] ? apic_timer_interrupt+0xe/0x20
[501360.008635]  [] schedule+0x3f/0x60
[501360.008635]  [] __mutex_lock_slowpath+0xd7/0x150
[501360.008635]  [] ? dbuf_write_ready+0x122/0x1c0 [zfs]
[501360.008635]  [] mutex_lock+0x2a/0x50
[501360.008635]  [] zio_ready+0x1e0/0x3b0 [zfs]
[501360.008635]  [] zio_execute+0x9f/0xf0 [zfs]
[501360.008635]  [] taskq_thread+0x1b5/0x390 [spl]
[501360.008635]  [] ? try_to_wake_up+0x200/0x200
[501360.008635]  [] ? task_alloc+0x160/0x160 [spl]
[501360.008635]  [] kthread+0x8c/0xa0
[501360.008635]  [] kernel_thread_helper+0x4/0x10
[501360.008635]  [] ? flush_kthread_worker+0xa0/0xa0
[501360.008635]  [] ? gs_change+0x13/0x13
behlendorf commented 12 years ago

It looks like you've found a deadlock which must be more likely to occur with your particular workload. From that stacks you posted it looks like the following has occurred.

The full list of stacks can be found here in the mailing list archive. http://groups.google.com/a/zfsonlinux.org/group/zfs-devel/browse_thread/thread/24a87da1893a7241#

stephane-chazelas commented 12 years ago

Thanks Brian. In case that's of any help, here's the Blocked State ouput from another occurrence (this time with the only thing happening being nbd-server writing to the zvol (from a strace, in 31 * 4kB write(2)s).

As I said on the ML, if I limit the rate at which the data is being transferred and written to the zvol, it doesn't occur. Otherwise, it does occur quite consistently after a few GB of data transferred.

http://pastebin.com/raw.php?i=iNmKvKas

gunnarbeutner commented 12 years ago

If you can reliably reproduce this issue it would be interesting to see the output of "zpool events -v" once that deadlock has happened.

stephane-chazelas commented 12 years ago

2011-12-17 04:29:51 -0800, Gunnar Beutner:

If you can reliably reproduce this issue it would be interesting to see the output of "zpool events -v" once that deadlock has happened. [...]

The thing is that I can't run anything when that happens and after a hard reset (sysrq-b), zpool events doesn't show anything at the deadlock time.

Stephane

behlendorf commented 12 years ago

If you set the zfs_zevent_console=1 module option all zfs events will also be logged immediately to the console. This way you won't need to have a functional system to run zpool events on at the time of the problem.

stephane-chazelas commented 12 years ago

Thanks. I just did that.

New occurrence, no event. I did get the "statechange" events upon mounting the FS, nothing after that.

Transfered some data maxing out the 1Gb/s. It hung after about 90GB were transfered.

tisoft commented 12 years ago

I had a similar issue after updating the kernel to 3.0.0-14-server. Downgrading to 3.0.0-13-server solved this for me.

putnam commented 12 years ago

I think I am experiencing this issue consistently on heavy I/O on my storage server.

I am using Ubuntu Server (Oneiric) and the issue occurred initially while using kernel 3.0.0-13. I upgraded my kernel version and zfs to the latest version hoping for a fix, but it happened again under a similar load. So unlike user "tisoft" above me I experience this issue both with 3.0.0-13 and 3.0.0-15.

In my case the deadlock occurred while running the unrar utility on a large archive (~8GB) on the FS while simultaneously writing ~6MB/sec to the same FS. The issue reproduces within 12 hours of rebooting the machine so long as the machine is continuously performing these tasks. After deadlock unrar is hung in I/O wait and the system repeatedly spews stall warnings on a single CPU until it eventually succumbs to the lock situation and is no longer accessible without a reboot.

All of the following data was acquired in a single session following a deadlock that I caught before the whole system crapped its pants.

Output of uname -a:

Linux dwight 3.0.0-15-server #26-Ubuntu SMP Fri Jan 20 19:07:39 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

Output of dpkg -l | egrep 'zfs|spl':

ii  libspl1                            0.6.0.45-0ubuntu1~oneiric1              Solaris Portability Layer library for Linux
ii  libzfs1                            0.6.0.51-0ubuntu1~oneiric1              Native ZFS filesystem library for Linux
ii  mountall                           2.31-zfs1                               filesystem mounting tool
ii  spl                                0.6.0.51-0ubuntu1~oneiric1              Solaris Porting Layer utilities for Linux
ii  spl-dkms                           0.6.0.51-0ubuntu1~oneiric1              Solaris Porting Layer kernel modules for Linux
ii  ubuntu-zfs                         6~oneiric                               Native ZFS filesystem metapackage for Ubuntu.
ii  zfs-dkms                           0.6.0.51-0ubuntu1~oneiric1              Native ZFS filesystem kernel modules for Linux
ii  zfsutils                           0.6.0.51-0ubuntu1~oneiric1              Native ZFS management utilities for Linux

Here is the output of zpool status on the affected filesystem:

 state: ONLINE
 scan: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    media       ONLINE       0     0     0
      raidz2-0  ONLINE       0     0     0
        sdb     ONLINE       0     0     0
        sdm     ONLINE       0     0     0
        sdu     ONLINE       0     0     0
        sdx     ONLINE       0     0     0
        sdw     ONLINE       0     0     0
        sdn     ONLINE       0     0     0
        sdv     ONLINE       0     0     0
      raidz2-1  ONLINE       0     0     0
        sdk     ONLINE       0     0     0
        sdp     ONLINE       0     0     0
        sdj     ONLINE       0     0     0
        sdd     ONLINE       0     0     0
        sdi     ONLINE       0     0     0
        sdl     ONLINE       0     0     0
        sdo     ONLINE       0     0     0
      raidz2-2  ONLINE       0     0     0
        sda     ONLINE       0     0     0
        sdc     ONLINE       0     0     0
        sde     ONLINE       0     0     0
        sdf     ONLINE       0     0     0
        sdg     ONLINE       0     0     0
        sdh     ONLINE       0     0     0
        sdq     ONLINE       0     0     0
    spares
      sdr       AVAIL   
      sds       AVAIL   
      sdt       AVAIL   

errors: No known data errors

Here is the trace from the syslog:

Feb 13 21:41:01 dwight kernel: [76531.327998] INFO: rcu_sched_state detected stall on CPU 1 (t=15000 jiffies)
Feb 13 21:43:30 dwight kernel: [76680.548019] INFO: task z_wr_iss/0:547 blocked for more than 120 seconds.
Feb 13 21:43:30 dwight kernel: [76680.549336] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 13 21:43:30 dwight kernel: [76680.550937] z_wr_iss/0      D 0000000000000000     0   547      2 0x00000000
Feb 13 21:43:30 dwight kernel: [76680.550943]  ffff880226bc9ce0 0000000000000046 ffffffff8160f54e ffff880226bc9d60
Feb 13 21:43:30 dwight kernel: [76680.550947]  ffff880226bc9fd8 ffff880226bc9fd8 ffff880226bc9fd8 0000000000012a40
Feb 13 21:43:30 dwight kernel: [76680.550952]  ffff88022e318000 ffff88022e31dc80 ffff880093143e58 ffff880093143e58
Feb 13 21:43:30 dwight kernel: [76680.550956] Call Trace:
Feb 13 21:43:30 dwight kernel: [76680.550964]  [<ffffffff8160f54e>] ? apic_timer_interrupt+0xe/0x20
Feb 13 21:43:30 dwight kernel: [76680.550968]  [<ffffffff8160492f>] schedule+0x3f/0x60
Feb 13 21:43:30 dwight kernel: [76680.550972]  [<ffffffff81605727>] __mutex_lock_slowpath+0xd7/0x150
Feb 13 21:43:30 dwight kernel: [76680.551000]  [<ffffffffa01d02bd>] ? dbuf_write_ready+0xfd/0x1c0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.551003]  [<ffffffff816052e2>] mutex_lock+0x22/0x40
Feb 13 21:43:30 dwight kernel: [76680.551033]  [<ffffffffa0260cc0>] zio_ready+0x1e0/0x3b0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.551061]  [<ffffffffa025d10f>] zio_execute+0x9f/0xf0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.551071]  [<ffffffffa014d0fb>] taskq_thread+0x24b/0x5a0 [spl]
Feb 13 21:43:30 dwight kernel: [76680.551076]  [<ffffffff8104e7b9>] ? finish_task_switch+0x49/0xf0
Feb 13 21:43:30 dwight kernel: [76680.551080]  [<ffffffff810573a0>] ? try_to_wake_up+0x200/0x200
Feb 13 21:43:30 dwight kernel: [76680.551088]  [<ffffffffa014ceb0>] ? task_done+0x140/0x140 [spl]
Feb 13 21:43:30 dwight kernel: [76680.551091]  [<ffffffff81080bbc>] kthread+0x8c/0xa0
Feb 13 21:43:30 dwight kernel: [76680.551095]  [<ffffffff8160fca4>] kernel_thread_helper+0x4/0x10
Feb 13 21:43:30 dwight kernel: [76680.551099]  [<ffffffff81080b30>] ? flush_kthread_worker+0xa0/0xa0
Feb 13 21:43:30 dwight kernel: [76680.551101]  [<ffffffff8160fca0>] ? gs_change+0x13/0x13
Feb 13 21:43:30 dwight kernel: [76680.551105] INFO: task z_wr_iss/2:549 blocked for more than 120 seconds.
Feb 13 21:43:30 dwight kernel: [76680.552532] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 13 21:43:30 dwight kernel: [76680.554087] z_wr_iss/2      D 0000000000000002     0   549      2 0x00000000
Feb 13 21:43:30 dwight kernel: [76680.554091]  ffff880226bcdce0 0000000000000046 ffffffff8160f54e ffff880226bcdd60
Feb 13 21:43:30 dwight kernel: [76680.554098]  ffff880226bcdfd8 ffff880226bcdfd8 ffff880226bcdfd8 0000000000012a40
Feb 13 21:43:30 dwight kernel: [76680.554103]  ffff88022e31ae40 ffff880226bd1720 ffff880093143e58 ffff880093143e58
Feb 13 21:43:30 dwight kernel: [76680.554107] Call Trace:
Feb 13 21:43:30 dwight kernel: [76680.554111]  [<ffffffff8160f54e>] ? apic_timer_interrupt+0xe/0x20
Feb 13 21:43:30 dwight kernel: [76680.554114]  [<ffffffff8160492f>] schedule+0x3f/0x60
Feb 13 21:43:30 dwight kernel: [76680.554118]  [<ffffffff81605727>] __mutex_lock_slowpath+0xd7/0x150
Feb 13 21:43:30 dwight kernel: [76680.554135]  [<ffffffffa01d02bd>] ? dbuf_write_ready+0xfd/0x1c0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.554139]  [<ffffffff816052e2>] mutex_lock+0x22/0x40
Feb 13 21:43:30 dwight kernel: [76680.554166]  [<ffffffffa0260cc0>] zio_ready+0x1e0/0x3b0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.554193]  [<ffffffffa025d10f>] zio_execute+0x9f/0xf0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.554201]  [<ffffffffa014d0fb>] taskq_thread+0x24b/0x5a0 [spl]
Feb 13 21:43:30 dwight kernel: [76680.554205]  [<ffffffff8104e7b9>] ? finish_task_switch+0x49/0xf0
Feb 13 21:43:30 dwight kernel: [76680.554208]  [<ffffffff810573a0>] ? try_to_wake_up+0x200/0x200
Feb 13 21:43:30 dwight kernel: [76680.554216]  [<ffffffffa014ceb0>] ? task_done+0x140/0x140 [spl]
Feb 13 21:43:30 dwight kernel: [76680.554219]  [<ffffffff81080bbc>] kthread+0x8c/0xa0
Feb 13 21:43:30 dwight kernel: [76680.554222]  [<ffffffff8160fca4>] kernel_thread_helper+0x4/0x10
Feb 13 21:43:30 dwight kernel: [76680.554226]  [<ffffffff81080b30>] ? flush_kthread_worker+0xa0/0xa0
Feb 13 21:43:30 dwight kernel: [76680.554229]  [<ffffffff8160fca0>] ? gs_change+0x13/0x13
Feb 13 21:43:30 dwight kernel: [76680.554231] INFO: task z_wr_iss/3:550 blocked for more than 120 seconds.
Feb 13 21:43:30 dwight kernel: [76680.555633] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 13 21:43:30 dwight kernel: [76680.557185] z_wr_iss/3      D 0000000000000003     0   550      2 0x00000000
Feb 13 21:43:30 dwight kernel: [76680.557199]  ffff880226bcfce0 0000000000000046 ffffffff8160f54e ffff880226bcfd60
Feb 13 21:43:30 dwight kernel: [76680.557217]  ffff880226bcffd8 ffff880226bcffd8 ffff880226bcffd8 0000000000012a40
Feb 13 21:43:30 dwight kernel: [76680.557232]  ffff88008c93ae40 ffff880226bd2e40 ffff880093143e58 ffff880093143e58
Feb 13 21:43:30 dwight kernel: [76680.557249] Call Trace:
Feb 13 21:43:30 dwight kernel: [76680.557257]  [<ffffffff8160f54e>] ? apic_timer_interrupt+0xe/0x20
Feb 13 21:43:30 dwight kernel: [76680.557266]  [<ffffffff8160492f>] schedule+0x3f/0x60
Feb 13 21:43:30 dwight kernel: [76680.557272]  [<ffffffff81605727>] __mutex_lock_slowpath+0xd7/0x150
Feb 13 21:43:30 dwight kernel: [76680.557294]  [<ffffffffa01d02bd>] ? dbuf_write_ready+0xfd/0x1c0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.557302]  [<ffffffff816052e2>] mutex_lock+0x22/0x40
Feb 13 21:43:30 dwight kernel: [76680.557332]  [<ffffffffa0260cc0>] zio_ready+0x1e0/0x3b0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.557363]  [<ffffffffa025d10f>] zio_execute+0x9f/0xf0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.557374]  [<ffffffffa014d0fb>] taskq_thread+0x24b/0x5a0 [spl]
Feb 13 21:43:30 dwight kernel: [76680.557381]  [<ffffffff8104e7b9>] ? finish_task_switch+0x49/0xf0
Feb 13 21:43:30 dwight kernel: [76680.557387]  [<ffffffff810573a0>] ? try_to_wake_up+0x200/0x200
Feb 13 21:43:30 dwight kernel: [76680.557397]  [<ffffffffa014ceb0>] ? task_done+0x140/0x140 [spl]
Feb 13 21:43:30 dwight kernel: [76680.557404]  [<ffffffff81080bbc>] kthread+0x8c/0xa0
Feb 13 21:43:30 dwight kernel: [76680.557410]  [<ffffffff8160fca4>] kernel_thread_helper+0x4/0x10
Feb 13 21:43:30 dwight kernel: [76680.557418]  [<ffffffff81080b30>] ? flush_kthread_worker+0xa0/0xa0
Feb 13 21:43:30 dwight kernel: [76680.557424]  [<ffffffff8160fca0>] ? gs_change+0x13/0x13
Feb 13 21:43:30 dwight kernel: [76680.557430] INFO: task z_wr_int/2:558 blocked for more than 120 seconds.
Feb 13 21:43:30 dwight kernel: [76680.558681] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 13 21:43:30 dwight kernel: [76680.560185] z_wr_int/2      D 0000000000000002     0   558      2 0x00000000
Feb 13 21:43:30 dwight kernel: [76680.560198]  ffff880226befbe0 0000000000000046 ffffffff8160f54e ffff880226befc60
Feb 13 21:43:30 dwight kernel: [76680.560215]  ffff880226beffd8 ffff880226beffd8 ffff880226beffd8 0000000000012a40
Feb 13 21:43:30 dwight kernel: [76680.560232]  ffff88022e31ae40 ffff880226bf0000 ffff88006fe924a0 ffff880093143e58
Feb 13 21:43:30 dwight kernel: [76680.560248] Call Trace:
Feb 13 21:43:30 dwight kernel: [76680.560255]  [<ffffffff8160f54e>] ? apic_timer_interrupt+0xe/0x20
Feb 13 21:43:30 dwight kernel: [76680.560268]  [<ffffffff8160492f>] schedule+0x3f/0x60
Feb 13 21:43:30 dwight kernel: [76680.560274]  [<ffffffff81605727>] __mutex_lock_slowpath+0xd7/0x150
Feb 13 21:43:30 dwight kernel: [76680.560280]  [<ffffffff816052e2>] mutex_lock+0x22/0x40
Feb 13 21:43:30 dwight kernel: [76680.560315]  [<ffffffffa025c328>] zio_remove_child+0x48/0xf0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.560346]  [<ffffffffa026114f>] zio_done+0x2bf/0xac0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.560378]  [<ffffffffa0228834>] ? vdev_mirror_map_free+0x24/0x30 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.560410]  [<ffffffffa02614a5>] zio_done+0x615/0xac0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.560440]  [<ffffffffa02614a5>] zio_done+0x615/0xac0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.560470]  [<ffffffffa025d10f>] zio_execute+0x9f/0xf0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.560481]  [<ffffffffa014d0fb>] taskq_thread+0x24b/0x5a0 [spl]
Feb 13 21:43:30 dwight kernel: [76680.560488]  [<ffffffff8104e7b9>] ? finish_task_switch+0x49/0xf0
Feb 13 21:43:30 dwight kernel: [76680.560494]  [<ffffffff810573a0>] ? try_to_wake_up+0x200/0x200
Feb 13 21:43:30 dwight kernel: [76680.560505]  [<ffffffffa014ceb0>] ? task_done+0x140/0x140 [spl]
Feb 13 21:43:30 dwight kernel: [76680.560511]  [<ffffffff81080bbc>] kthread+0x8c/0xa0
Feb 13 21:43:30 dwight kernel: [76680.560517]  [<ffffffff8160fca4>] kernel_thread_helper+0x4/0x10
Feb 13 21:43:30 dwight kernel: [76680.560525]  [<ffffffff81080b30>] ? flush_kthread_worker+0xa0/0xa0
Feb 13 21:43:30 dwight kernel: [76680.560531]  [<ffffffff8160fca0>] ? gs_change+0x13/0x13
Feb 13 21:43:30 dwight kernel: [76680.560536] INFO: task z_wr_int/3:559 blocked for more than 120 seconds.
Feb 13 21:43:30 dwight kernel: [76680.561723] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 13 21:43:30 dwight kernel: [76680.563185] z_wr_int/3      D 0000000000000003     0   559      2 0x00000000
Feb 13 21:43:30 dwight kernel: [76680.563189]  ffff880226bf9be0 0000000000000046 ffffffff8160fa2e ffff880226bf9c60
Feb 13 21:43:30 dwight kernel: [76680.563194]  ffff880226bf9fd8 ffff880226bf9fd8 ffff880226bf9fd8 0000000000012a40
Feb 13 21:43:30 dwight kernel: [76680.563198]  ffff880092119720 ffff880226bf1720 ffff88006fe91cc0 ffff880093143e58
Feb 13 21:43:30 dwight kernel: [76680.563202] Call Trace:
Feb 13 21:43:30 dwight kernel: [76680.563205]  [<ffffffff8160fa2e>] ? reschedule_interrupt+0xe/0x20
Feb 13 21:43:30 dwight kernel: [76680.563209]  [<ffffffff8160492f>] schedule+0x3f/0x60
Feb 13 21:43:30 dwight kernel: [76680.563212]  [<ffffffff81605727>] __mutex_lock_slowpath+0xd7/0x150
Feb 13 21:43:30 dwight kernel: [76680.563220]  [<ffffffffa01489cb>] ? kmem_free_debug+0x4b/0x150 [spl]
Feb 13 21:43:30 dwight kernel: [76680.563224]  [<ffffffff816052e2>] mutex_lock+0x22/0x40
Feb 13 21:43:30 dwight kernel: [76680.563251]  [<ffffffffa025c328>] zio_remove_child+0x48/0xf0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.563278]  [<ffffffffa026114f>] zio_done+0x2bf/0xac0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.563306]  [<ffffffffa0228834>] ? vdev_mirror_map_free+0x24/0x30 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.563333]  [<ffffffffa02614a5>] zio_done+0x615/0xac0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.563360]  [<ffffffffa02614a5>] zio_done+0x615/0xac0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.563387]  [<ffffffffa025d10f>] zio_execute+0x9f/0xf0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.563395]  [<ffffffffa014d0fb>] taskq_thread+0x24b/0x5a0 [spl]
Feb 13 21:43:30 dwight kernel: [76680.563399]  [<ffffffff8104e7b9>] ? finish_task_switch+0x49/0xf0
Feb 13 21:43:30 dwight kernel: [76680.563402]  [<ffffffff810573a0>] ? try_to_wake_up+0x200/0x200
Feb 13 21:43:30 dwight kernel: [76680.563410]  [<ffffffffa014ceb0>] ? task_done+0x140/0x140 [spl]
Feb 13 21:43:30 dwight kernel: [76680.563413]  [<ffffffff81080bbc>] kthread+0x8c/0xa0
Feb 13 21:43:30 dwight kernel: [76680.563416]  [<ffffffff8160fca4>] kernel_thread_helper+0x4/0x10
Feb 13 21:43:30 dwight kernel: [76680.563420]  [<ffffffff81080b30>] ? flush_kthread_worker+0xa0/0xa0
Feb 13 21:43:30 dwight kernel: [76680.563423]  [<ffffffff8160fca0>] ? gs_change+0x13/0x13
Feb 13 21:43:30 dwight kernel: [76680.563425] INFO: task z_wr_int/4:560 blocked for more than 120 seconds.
Feb 13 21:43:30 dwight kernel: [76680.564654] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 13 21:43:30 dwight kernel: [76680.566107] z_wr_int/4      D 0000000000000000     0   560      2 0x00000000
Feb 13 21:43:30 dwight kernel: [76680.566111]  ffff880226bfbbe0 0000000000000046 ffffffff8160f54e ffff880226bfbc60
Feb 13 21:43:30 dwight kernel: [76680.566116]  ffff880226bfbfd8 ffff880226bfbfd8 ffff880226bfbfd8 0000000000012a40
Feb 13 21:43:30 dwight kernel: [76680.566120]  ffff88022e318000 ffff880226bf2e40 ffff880058138130 ffff880093143e58
Feb 13 21:43:30 dwight kernel: [76680.566124] Call Trace:
Feb 13 21:43:30 dwight kernel: [76680.566128]  [<ffffffff8160f54e>] ? apic_timer_interrupt+0xe/0x20
Feb 13 21:43:30 dwight kernel: [76680.566131]  [<ffffffff8160492f>] schedule+0x3f/0x60
Feb 13 21:43:30 dwight kernel: [76680.566134]  [<ffffffff81605727>] __mutex_lock_slowpath+0xd7/0x150
Feb 13 21:43:30 dwight kernel: [76680.566138]  [<ffffffff816052e2>] mutex_lock+0x22/0x40
Feb 13 21:43:30 dwight kernel: [76680.566165]  [<ffffffffa025c328>] zio_remove_child+0x48/0xf0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.566192]  [<ffffffffa026114f>] zio_done+0x2bf/0xac0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.566220]  [<ffffffffa0228834>] ? vdev_mirror_map_free+0x24/0x30 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.566247]  [<ffffffffa02614a5>] zio_done+0x615/0xac0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.566274]  [<ffffffffa02614a5>] zio_done+0x615/0xac0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.566301]  [<ffffffffa025d10f>] zio_execute+0x9f/0xf0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.566309]  [<ffffffffa014d0fb>] taskq_thread+0x24b/0x5a0 [spl]
Feb 13 21:43:30 dwight kernel: [76680.566313]  [<ffffffff8104e7b9>] ? finish_task_switch+0x49/0xf0
Feb 13 21:43:30 dwight kernel: [76680.566316]  [<ffffffff810573a0>] ? try_to_wake_up+0x200/0x200
Feb 13 21:43:30 dwight kernel: [76680.566324]  [<ffffffffa014ceb0>] ? task_done+0x140/0x140 [spl]
Feb 13 21:43:30 dwight kernel: [76680.566327]  [<ffffffff81080bbc>] kthread+0x8c/0xa0
Feb 13 21:43:30 dwight kernel: [76680.566330]  [<ffffffff8160fca4>] kernel_thread_helper+0x4/0x10
Feb 13 21:43:30 dwight kernel: [76680.566334]  [<ffffffff81080b30>] ? flush_kthread_worker+0xa0/0xa0
Feb 13 21:43:30 dwight kernel: [76680.566337]  [<ffffffff8160fca0>] ? gs_change+0x13/0x13
Feb 13 21:43:30 dwight kernel: [76680.566339] INFO: task z_wr_int/6:562 blocked for more than 120 seconds.
Feb 13 21:43:30 dwight kernel: [76680.567550] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 13 21:43:30 dwight kernel: [76680.568998] z_wr_int/6      D 0000000000000002     0   562      2 0x00000000
Feb 13 21:43:30 dwight kernel: [76680.569010]  ffff880226bffb80 0000000000000046 ffffffff8160f54e ffff880226bffc00
Feb 13 21:43:30 dwight kernel: [76680.569027]  ffff880226bfffd8 ffff880226bfffd8 ffff880226bfffd8 0000000000012a40
Feb 13 21:43:30 dwight kernel: [76680.569042]  ffff88022641ae40 ffff880226bf5c80 ffff8800581394e0 ffff880093143e58
Feb 13 21:43:30 dwight kernel: [76680.569057] Call Trace:
Feb 13 21:43:30 dwight kernel: [76680.569065]  [<ffffffff8160f54e>] ? apic_timer_interrupt+0xe/0x20
Feb 13 21:43:30 dwight kernel: [76680.569073]  [<ffffffff8160492f>] schedule+0x3f/0x60
Feb 13 21:43:30 dwight kernel: [76680.569080]  [<ffffffff81605727>] __mutex_lock_slowpath+0xd7/0x150
Feb 13 21:43:30 dwight kernel: [76680.569086]  [<ffffffff816052e2>] mutex_lock+0x22/0x40
Feb 13 21:43:30 dwight kernel: [76680.569116]  [<ffffffffa025c328>] zio_remove_child+0x48/0xf0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.569147]  [<ffffffffa026114f>] zio_done+0x2bf/0xac0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.569177]  [<ffffffffa0228834>] ? vdev_mirror_map_free+0x24/0x30 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.569208]  [<ffffffffa02614a5>] zio_done+0x615/0xac0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.569237]  [<ffffffffa02614a5>] zio_done+0x615/0xac0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.569267]  [<ffffffffa02614a5>] zio_done+0x615/0xac0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.569297]  [<ffffffffa025d10f>] zio_execute+0x9f/0xf0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.569308]  [<ffffffffa014d0fb>] taskq_thread+0x24b/0x5a0 [spl]
Feb 13 21:43:30 dwight kernel: [76680.569315]  [<ffffffff8104e7b9>] ? finish_task_switch+0x49/0xf0
Feb 13 21:43:30 dwight kernel: [76680.569321]  [<ffffffff810573a0>] ? try_to_wake_up+0x200/0x200
Feb 13 21:43:30 dwight kernel: [76680.569332]  [<ffffffffa014ceb0>] ? task_done+0x140/0x140 [spl]
Feb 13 21:43:30 dwight kernel: [76680.569338]  [<ffffffff81080bbc>] kthread+0x8c/0xa0
Feb 13 21:43:30 dwight kernel: [76680.569345]  [<ffffffff8160fca4>] kernel_thread_helper+0x4/0x10
Feb 13 21:43:30 dwight kernel: [76680.569351]  [<ffffffff81080b30>] ? flush_kthread_worker+0xa0/0xa0
Feb 13 21:43:30 dwight kernel: [76680.569357]  [<ffffffff8160fca0>] ? gs_change+0x13/0x13
Feb 13 21:43:30 dwight kernel: [76680.569362] INFO: task z_wr_int/7:563 blocked for more than 120 seconds.
Feb 13 21:43:30 dwight kernel: [76680.570525] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 13 21:43:30 dwight kernel: [76680.571944] z_wr_int/7      D 0000000000000003     0   563      2 0x00000000
Feb 13 21:43:30 dwight kernel: [76680.571948]  ffff880226409be0 0000000000000046 ffffffff8160f54e ffff880226409c60
Feb 13 21:43:30 dwight kernel: [76680.571952]  ffff880226409fd8 ffff880226409fd8 ffff880226409fd8 0000000000012a40
Feb 13 21:43:30 dwight kernel: [76680.571957]  ffff88022e31c560 ffff880226400000 ffff88006fe92890 ffff880093143e58
Feb 13 21:43:30 dwight kernel: [76680.571961] Call Trace:
Feb 13 21:43:30 dwight kernel: [76680.571964]  [<ffffffff8160f54e>] ? apic_timer_interrupt+0xe/0x20
Feb 13 21:43:30 dwight kernel: [76680.571968]  [<ffffffff8160492f>] schedule+0x3f/0x60
Feb 13 21:43:30 dwight kernel: [76680.571971]  [<ffffffff81605727>] __mutex_lock_slowpath+0xd7/0x150
Feb 13 21:43:30 dwight kernel: [76680.571975]  [<ffffffff816052e2>] mutex_lock+0x22/0x40
Feb 13 21:43:30 dwight kernel: [76680.572002]  [<ffffffffa025c328>] zio_remove_child+0x48/0xf0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.572038]  [<ffffffffa026114f>] zio_done+0x2bf/0xac0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.572071]  [<ffffffffa0228834>] ? vdev_mirror_map_free+0x24/0x30 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.572103]  [<ffffffffa02614a5>] zio_done+0x615/0xac0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.572134]  [<ffffffffa02614a5>] zio_done+0x615/0xac0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.572164]  [<ffffffffa025d10f>] zio_execute+0x9f/0xf0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.572176]  [<ffffffffa014d0fb>] taskq_thread+0x24b/0x5a0 [spl]
Feb 13 21:43:30 dwight kernel: [76680.572182]  [<ffffffff8104e7b9>] ? finish_task_switch+0x49/0xf0
Feb 13 21:43:30 dwight kernel: [76680.572189]  [<ffffffff810573a0>] ? try_to_wake_up+0x200/0x200
Feb 13 21:43:30 dwight kernel: [76680.572200]  [<ffffffffa014ceb0>] ? task_done+0x140/0x140 [spl]
Feb 13 21:43:30 dwight kernel: [76680.572206]  [<ffffffff81080bbc>] kthread+0x8c/0xa0
Feb 13 21:43:30 dwight kernel: [76680.572212]  [<ffffffff8160fca4>] kernel_thread_helper+0x4/0x10
Feb 13 21:43:30 dwight kernel: [76680.572220]  [<ffffffff81080b30>] ? flush_kthread_worker+0xa0/0xa0
Feb 13 21:43:30 dwight kernel: [76680.572227]  [<ffffffff8160fca0>] ? gs_change+0x13/0x13
Feb 13 21:43:30 dwight kernel: [76680.572235] INFO: task z_wr_int/8:564 blocked for more than 120 seconds.
Feb 13 21:43:30 dwight kernel: [76680.573367] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 13 21:43:30 dwight kernel: [76680.574748] z_wr_int/8      D 0000000000000000     0   564      2 0x00000000
Feb 13 21:43:30 dwight kernel: [76680.574752]  ffff88022640bb80 0000000000000046 ffffffff8160f54e ffff88022640bc00
Feb 13 21:43:30 dwight kernel: [76680.574757]  ffff88022640bfd8 ffff88022640bfd8 ffff88022640bfd8 0000000000012a40
Feb 13 21:43:30 dwight kernel: [76680.574761]  ffff880226bf2e40 ffff880226401720 ffff880090947850 ffff880093143e58
Feb 13 21:43:30 dwight kernel: [76680.574765] Call Trace:
Feb 13 21:43:30 dwight kernel: [76680.574769]  [<ffffffff8160f54e>] ? apic_timer_interrupt+0xe/0x20
Feb 13 21:43:30 dwight kernel: [76680.574772]  [<ffffffff8160492f>] schedule+0x3f/0x60
Feb 13 21:43:30 dwight kernel: [76680.574776]  [<ffffffff81605727>] __mutex_lock_slowpath+0xd7/0x150
Feb 13 21:43:30 dwight kernel: [76680.574779]  [<ffffffff816052e2>] mutex_lock+0x22/0x40
Feb 13 21:43:30 dwight kernel: [76680.574806]  [<ffffffffa025c328>] zio_remove_child+0x48/0xf0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.574833]  [<ffffffffa026114f>] zio_done+0x2bf/0xac0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.574861]  [<ffffffffa0228834>] ? vdev_mirror_map_free+0x24/0x30 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.574888]  [<ffffffffa02614a5>] zio_done+0x615/0xac0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.574915]  [<ffffffffa02614a5>] zio_done+0x615/0xac0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.574942]  [<ffffffffa02614a5>] zio_done+0x615/0xac0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.574969]  [<ffffffffa025d10f>] zio_execute+0x9f/0xf0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.574977]  [<ffffffffa014d0fb>] taskq_thread+0x24b/0x5a0 [spl]
Feb 13 21:43:30 dwight kernel: [76680.574981]  [<ffffffff8104e7b9>] ? finish_task_switch+0x49/0xf0
Feb 13 21:43:30 dwight kernel: [76680.574984]  [<ffffffff810573a0>] ? try_to_wake_up+0x200/0x200
Feb 13 21:43:30 dwight kernel: [76680.574992]  [<ffffffffa014ceb0>] ? task_done+0x140/0x140 [spl]
Feb 13 21:43:30 dwight kernel: [76680.574995]  [<ffffffff81080bbc>] kthread+0x8c/0xa0
Feb 13 21:43:30 dwight kernel: [76680.574998]  [<ffffffff8160fca4>] kernel_thread_helper+0x4/0x10
Feb 13 21:43:30 dwight kernel: [76680.575002]  [<ffffffff81080b30>] ? flush_kthread_worker+0xa0/0xa0
Feb 13 21:43:30 dwight kernel: [76680.575005]  [<ffffffff8160fca0>] ? gs_change+0x13/0x13
Feb 13 21:43:30 dwight kernel: [76680.575008] INFO: task z_wr_int/11:567 blocked for more than 120 seconds.
Feb 13 21:43:30 dwight kernel: [76680.576127] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 13 21:43:30 dwight kernel: [76680.577485] z_wr_int/11     D 0000000000000003     0   567      2 0x00000000
Feb 13 21:43:30 dwight kernel: [76680.577489]  ffff880226411b80 0000000000000046 ffffffff8160f54e ffff880226411c00
Feb 13 21:43:30 dwight kernel: [76680.577493]  ffff880226411fd8 ffff880226411fd8 ffff880226411fd8 0000000000012a40
Feb 13 21:43:30 dwight kernel: [76680.577497]  ffff88022e31c560 ffff880226405c80 ffff88006fe918d0 ffff880093143e58
Feb 13 21:43:30 dwight kernel: [76680.577501] Call Trace:
Feb 13 21:43:30 dwight kernel: [76680.577505]  [<ffffffff8160f54e>] ? apic_timer_interrupt+0xe/0x20
Feb 13 21:43:30 dwight kernel: [76680.577509]  [<ffffffff8160492f>] schedule+0x3f/0x60
Feb 13 21:43:30 dwight kernel: [76680.577512]  [<ffffffff81605727>] __mutex_lock_slowpath+0xd7/0x150
Feb 13 21:43:30 dwight kernel: [76680.577520]  [<ffffffffa01489cb>] ? kmem_free_debug+0x4b/0x150 [spl]
Feb 13 21:43:30 dwight kernel: [76680.577523]  [<ffffffff816052e2>] mutex_lock+0x22/0x40
Feb 13 21:43:30 dwight kernel: [76680.577550]  [<ffffffffa025c328>] zio_remove_child+0x48/0xf0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.577577]  [<ffffffffa026114f>] zio_done+0x2bf/0xac0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.577605]  [<ffffffffa0228834>] ? vdev_mirror_map_free+0x24/0x30 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.577632]  [<ffffffffa02614a5>] zio_done+0x615/0xac0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.577659]  [<ffffffffa02614a5>] zio_done+0x615/0xac0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.577686]  [<ffffffffa02614a5>] zio_done+0x615/0xac0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.577713]  [<ffffffffa025d10f>] zio_execute+0x9f/0xf0 [zfs]
Feb 13 21:43:30 dwight kernel: [76680.577721]  [<ffffffffa014d0fb>] taskq_thread+0x24b/0x5a0 [spl]
Feb 13 21:43:30 dwight kernel: [76680.577725]  [<ffffffff8104e7b9>] ? finish_task_switch+0x49/0xf0
Feb 13 21:43:30 dwight kernel: [76680.577728]  [<ffffffff810573a0>] ? try_to_wake_up+0x200/0x200
Feb 13 21:43:30 dwight kernel: [76680.577736]  [<ffffffffa014ceb0>] ? task_done+0x140/0x140 [spl]
Feb 13 21:43:30 dwight kernel: [76680.577739]  [<ffffffff81080bbc>] kthread+0x8c/0xa0
Feb 13 21:43:30 dwight kernel: [76680.577742]  [<ffffffff8160fca4>] kernel_thread_helper+0x4/0x10
Feb 13 21:43:30 dwight kernel: [76680.577746]  [<ffffffff81080b30>] ? flush_kthread_worker+0xa0/0xa0
Feb 13 21:43:30 dwight kernel: [76680.577748]  [<ffffffff8160fca0>] ? gs_change+0x13/0x13
Feb 13 21:44:01 dwight kernel: [76711.447997] INFO: rcu_sched_state detected stall on CPU 1 (t=60030 jiffies)
Feb 13 21:47:01 dwight kernel: [76891.567998] INFO: rcu_sched_state detected stall on CPU 1 (t=105060 jiffies)

Output of zpool events -v following deadlock:


TIME                           CLASS
Feb 13 2012 00:25:33.521446000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0x67c02e772050e122
        vdev_state = 0x7
        time = 0x4f38acdd 0x1f14a270 

Feb 13 2012 00:25:33.521448000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0x37785b53b4e33326
        vdev_state = 0x7
        time = 0x4f38acdd 0x1f14aa40 

Feb 13 2012 00:25:33.521456000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0x7e3da6b293154695
        vdev_state = 0x7
        time = 0x4f38acdd 0x1f14c980 

Feb 13 2012 00:25:33.521456000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0x1016121fa971ac20
        vdev_state = 0x7
        time = 0x4f38acdd 0x1f14c980 

Feb 13 2012 00:25:33.521577000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0xfcf122cac009dd3d
        vdev_state = 0x7
        time = 0x4f38acdd 0x1f16a228 

Feb 13 2012 00:25:33.521579000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0xbb63e4cd76f17d38
        vdev_state = 0x7
        time = 0x4f38acdd 0x1f16a9f8 

Feb 13 2012 00:25:33.521708000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0xc9d46b647dee0fa7
        vdev_state = 0x7
        time = 0x4f38acdd 0x1f18a1e0 

Feb 13 2012 00:25:33.521800000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0xa66e80c44d362fde
        vdev_state = 0x7
        time = 0x4f38acdd 0x1f1a0940 

Feb 13 2012 00:25:33.521803000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0x8b4523a6c245afac
        vdev_state = 0x7
        time = 0x4f38acdd 0x1f1a14f8 

Feb 13 2012 00:25:33.521807000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0x42ae0138cdc77f37
        vdev_state = 0x7
        time = 0x4f38acdd 0x1f1a2498 

Feb 13 2012 00:25:33.521807000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0xcad6b44b647cacd2
        vdev_state = 0x7
        time = 0x4f38acdd 0x1f1a2498 

Feb 13 2012 00:25:33.521833000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0xa7e831769626651c
        vdev_state = 0x7
        time = 0x4f38acdd 0x1f1a8a28 

Feb 13 2012 00:25:33.521840000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0x36ac1c930f7bec1
        vdev_state = 0x7
        time = 0x4f38acdd 0x1f1aa580 

Feb 13 2012 00:25:33.521844000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0x42c80cd08335ba98
        vdev_state = 0x7
        time = 0x4f38acdd 0x1f1ab520 

Feb 13 2012 00:25:33.522268000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0xb773068ccfd67fc8
        vdev_state = 0x7
        time = 0x4f38acdd 0x1f212d60 

Feb 13 2012 00:25:33.522580000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0x144e3dcf701ea54d
        vdev_state = 0x7
        time = 0x4f38acdd 0x1f25f020 

Feb 13 2012 00:25:33.522748000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0x9586ca4a2e1e4821
        vdev_state = 0x7
        time = 0x4f38acdd 0x1f288060 

Feb 13 2012 00:25:33.522782000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0xc10340c5f615ccb8
        vdev_state = 0x7
        time = 0x4f38acdd 0x1f290530 

Feb 13 2012 00:25:33.522878000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0x3413239ecdfde92c
        vdev_state = 0x7
        time = 0x4f38acdd 0x1f2a7c30 

Feb 13 2012 00:25:33.523189000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0xb7c4949b63943277
        vdev_state = 0x7
        time = 0x4f38acdd 0x1f2f3b08 

Feb 13 2012 00:25:33.523353000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0xab34eb6019e3495
        vdev_state = 0x7
        time = 0x4f38acdd 0x1f31bba8 

Feb 13 2012 00:25:33.971102000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0xa7e831769626651c
        vdev_state = 0x7
        time = 0x4f38acdd 0x39e1d730 

Feb 13 2012 00:25:33.971103000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0x42ae0138cdc77f37
        vdev_state = 0x7
        time = 0x4f38acdd 0x39e1db18 

Feb 13 2012 00:25:33.971103000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0xa66e80c44d362fde
        vdev_state = 0x7
        time = 0x4f38acdd 0x39e1db18 

Feb 13 2012 00:25:33.971103000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0xcad6b44b647cacd2
        vdev_state = 0x7
        time = 0x4f38acdd 0x39e1db18 

Feb 13 2012 00:25:33.971136000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0x8b4523a6c245afac
        vdev_state = 0x7
        time = 0x4f38acdd 0x39e25c00 

Feb 13 2012 00:25:33.971197000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0x42c80cd08335ba98
        vdev_state = 0x7
        time = 0x4f38acdd 0x39e34a48 

Feb 13 2012 00:25:33.971232000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0xfcf122cac009dd3d
        vdev_state = 0x7
        time = 0x4f38acdd 0x39e3d300 

Feb 13 2012 00:25:33.971254000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0x37785b53b4e33326
        vdev_state = 0x7
        time = 0x4f38acdd 0x39e428f0 

Feb 13 2012 00:25:33.971257000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0x67c02e772050e122
        vdev_state = 0x7
        time = 0x4f38acdd 0x39e434a8 

Feb 13 2012 00:25:33.971273000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0x1016121fa971ac20
        vdev_state = 0x7
        time = 0x4f38acdd 0x39e47328 

Feb 13 2012 00:25:33.971287000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0x7e3da6b293154695
        vdev_state = 0x7
        time = 0x4f38acdd 0x39e4a9d8 

Feb 13 2012 00:25:33.971497000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0x36ac1c930f7bec1
        vdev_state = 0x7
        time = 0x4f38acdd 0x39e7de28 

Feb 13 2012 00:25:33.971643000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0xbb63e4cd76f17d38
        vdev_state = 0x7
        time = 0x4f38acdd 0x39ea1878 

Feb 13 2012 00:25:33.971682000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0xc9d46b647dee0fa7
        vdev_state = 0x7
        time = 0x4f38acdd 0x39eab0d0 

Feb 13 2012 00:25:33.972103000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0xc10340c5f615ccb8
        vdev_state = 0x7
        time = 0x4f38acdd 0x39f11d58 

Feb 13 2012 00:25:33.972199000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0xb7c4949b63943277
        vdev_state = 0x7
        time = 0x4f38acdd 0x39f29458 

Feb 13 2012 00:25:33.972572000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0x3413239ecdfde92c
        vdev_state = 0x7
        time = 0x4f38acdd 0x39f84560 

Feb 13 2012 00:25:33.972614000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0xb773068ccfd67fc8
        vdev_state = 0x7
        time = 0x4f38acdd 0x39f8e970 

Feb 13 2012 00:25:33.973099000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0x144e3dcf701ea54d
        vdev_state = 0x7
        time = 0x4f38acdd 0x3a004ff8 

Feb 13 2012 00:25:33.973102000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0xab34eb6019e3495
        vdev_state = 0x7
        time = 0x4f38acdd 0x3a005bb0 

Feb 13 2012 00:25:33.973250000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0x9586ca4a2e1e4821
        vdev_state = 0x7
        time = 0x4f38acdd 0x3a029dd0 

Feb 13 2012 00:25:34.236912000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0x23fa16996cfb596b
        vdev_state = 0x7
        time = 0x4f38acde 0xe1efd80 

Feb 13 2012 00:25:34.287349000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0xadcd306fd38021a
        vdev_state = 0x7
        time = 0x4f38acde 0x11209908 

Feb 13 2012 00:25:34.337133000 resource.fs.zfs.statechange
        version = 0x0
        class = "resource.fs.zfs.statechange"
        pool_guid = 0x1c9891ba99fceccc
        vdev_guid = 0xdd5140d28eb81c57
        vdev_state = 0x7
        time = 0x4f38acde 0x14183dc8 

Hopefully relevant output of ps following deadlock:


USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         2  0.0  0.0      0     0 ?        S    Feb13   0:00 [kthreadd]
root         3  0.0  0.0      0     0 ?        S    Feb13   0:15 [ksoftirqd/0]
root         6  0.0  0.0      0     0 ?        S    Feb13   0:00 [migration/0]
root         7  0.0  0.0      0     0 ?        S    Feb13   0:00 [migration/1]
root         8  0.0  0.0      0     0 ?        S    Feb13   0:49 [kworker/1:0]
root         9  0.0  0.0      0     0 ?        S    Feb13   0:00 [ksoftirqd/1]
root        11  0.0  0.0      0     0 ?        S    Feb13   0:00 [migration/2]
root        12  0.0  0.0      0     0 ?        S    Feb13   1:15 [kworker/2:0]
root        13  0.0  0.0      0     0 ?        S    Feb13   0:01 [ksoftirqd/2]
root        14  0.0  0.0      0     0 ?        S    Feb13   0:00 [migration/3]
root        16  0.0  0.0      0     0 ?        S    Feb13   0:00 [ksoftirqd/3]
root        17  0.0  0.0      0     0 ?        S<   Feb13   0:00 [cpuset]
root        18  0.0  0.0      0     0 ?        S<   Feb13   0:00 [khelper]
root        19  0.0  0.0      0     0 ?        S<   Feb13   0:00 [netns]
root        21  0.0  0.0      0     0 ?        S    Feb13   0:00 [sync_supers]
root        22  0.0  0.0      0     0 ?        S    Feb13   0:00 [bdi-default]
root        23  0.0  0.0      0     0 ?        S<   Feb13   0:00 [kintegrityd]
root        24  0.0  0.0      0     0 ?        S<   Feb13   0:00 [kblockd]
root        25  0.0  0.0      0     0 ?        S<   Feb13   0:00 [ata_sff]
root        26  0.0  0.0      0     0 ?        S    Feb13   0:00 [khubd]
root        27  0.0  0.0      0     0 ?        S<   Feb13   0:00 [md]
root        28  0.0  0.0      0     0 ?        R    Feb13   0:51 [kworker/1:1]
root        29  0.0  0.0      0     0 ?        S    Feb13   0:44 [kworker/2:1]
root        31  0.0  0.0      0     0 ?        S    Feb13   0:00 [khungtaskd]
root        32  0.0  0.0      0     0 ?        S    Feb13   0:00 [kswapd0]
root        33  0.0  0.0      0     0 ?        SN   Feb13   0:00 [ksmd]
root        34  0.0  0.0      0     0 ?        SN   Feb13   0:00 [khugepaged]
root        35  0.0  0.0      0     0 ?        S    Feb13   0:00 [fsnotify_mark]
root        36  0.0  0.0      0     0 ?        S    Feb13   0:00 [ecryptfs-kthrea]
root        37  0.0  0.0      0     0 ?        S<   Feb13   0:00 [crypto]
root        45  0.0  0.0      0     0 ?        S<   Feb13   0:00 [kthrotld]
root       193  0.0  0.0      0     0 ?        S    Feb13   0:00 [scsi_eh_0]
root       235  0.0  0.0      0     0 ?        S    Feb13   0:00 [scsi_eh_1]
root       236  0.0  0.0      0     0 ?        S    Feb13   0:00 [scsi_eh_2]
root       237  0.0  0.0      0     0 ?        S    Feb13   0:00 [scsi_eh_3]
root       238  0.0  0.0      0     0 ?        S    Feb13   0:00 [scsi_eh_4]
root       239  0.0  0.0      0     0 ?        S    Feb13   0:00 [scsi_eh_5]
root       241  0.0  0.0      0     0 ?        S    Feb13   0:00 [scsi_eh_6]
root       263  0.0  0.0      0     0 ?        S    Feb13   0:00 [kworker/u:18]
root       264  0.0  0.0      0     0 ?        S    Feb13   0:00 [kworker/u:19]
root       371  0.0  0.0      0     0 ?        S    Feb13   0:00 [scsi_eh_7]
root       374  0.0  0.0      0     0 ?        S    Feb13   0:00 [scsi_eh_8]
root       379  0.0  0.0      0     0 ?        S    Feb13   0:04 [jbd2/sdy1-8]
root       380  0.0  0.0      0     0 ?        S<   Feb13   0:00 [ext4-dio-unwrit]
root       403  0.0  0.0      0     0 ?        S<   Feb13   0:00 [spl_system_task]
root       404  0.0  0.0      0     0 ?        S<   Feb13   0:00 [spl_system_task]
root       405  0.0  0.0      0     0 ?        S<   Feb13   0:00 [spl_system_task]
root       406  0.0  0.0      0     0 ?        S<   Feb13   0:00 [spl_system_task]
root       407  0.0  0.0      0     0 ?        S<   Feb13   0:16 [arc_reclaim]
root       408  0.0  0.0      0     0 ?        S<   Feb13   0:00 [l2arc_feed]
root       409  0.0  0.0      0     0 ?        SN   Feb13   0:00 [zvol/0]
root       410  0.0  0.0      0     0 ?        SN   Feb13   0:00 [zvol/1]
root       411  0.0  0.0      0     0 ?        SN   Feb13   0:00 [zvol/2]
root       412  0.0  0.0      0     0 ?        SN   Feb13   0:00 [zvol/3]
root       413  0.0  0.0      0     0 ?        SN   Feb13   0:00 [zvol/4]
root       414  0.0  0.0      0     0 ?        SN   Feb13   0:00 [zvol/5]
root       415  0.0  0.0      0     0 ?        SN   Feb13   0:00 [zvol/6]
root       416  0.0  0.0      0     0 ?        SN   Feb13   0:00 [zvol/7]
root       417  0.0  0.0      0     0 ?        SN   Feb13   0:00 [zvol/8]
root       418  0.0  0.0      0     0 ?        SN   Feb13   0:00 [zvol/9]
root       419  0.0  0.0      0     0 ?        SN   Feb13   0:00 [zvol/10]
root       420  0.0  0.0      0     0 ?        SN   Feb13   0:00 [zvol/11]
root       421  0.0  0.0      0     0 ?        SN   Feb13   0:00 [zvol/12]
root       422  0.0  0.0      0     0 ?        SN   Feb13   0:00 [zvol/13]
root       423  0.0  0.0      0     0 ?        SN   Feb13   0:00 [zvol/14]
root       424  0.0  0.0      0     0 ?        SN   Feb13   0:00 [zvol/15]
root       425  0.0  0.0      0     0 ?        SN   Feb13   0:00 [zvol/16]
root       426  0.0  0.0      0     0 ?        SN   Feb13   0:00 [zvol/17]
root       427  0.0  0.0      0     0 ?        SN   Feb13   0:00 [zvol/18]
root       428  0.0  0.0      0     0 ?        SN   Feb13   0:00 [zvol/19]
root       429  0.0  0.0      0     0 ?        SN   Feb13   0:00 [zvol/20]
root       430  0.0  0.0      0     0 ?        SN   Feb13   0:00 [zvol/21]
root       431  0.0  0.0      0     0 ?        SN   Feb13   0:00 [zvol/22]
root       432  0.0  0.0      0     0 ?        SN   Feb13   0:00 [zvol/23]
root       433  0.0  0.0      0     0 ?        SN   Feb13   0:00 [zvol/24]
root       434  0.0  0.0      0     0 ?        SN   Feb13   0:00 [zvol/25]
root       435  0.0  0.0      0     0 ?        SN   Feb13   0:00 [zvol/26]
root       436  0.0  0.0      0     0 ?        SN   Feb13   0:00 [zvol/27]
root       437  0.0  0.0      0     0 ?        SN   Feb13   0:00 [zvol/28]
root       438  0.0  0.0      0     0 ?        SN   Feb13   0:00 [zvol/29]
root       439  0.0  0.0      0     0 ?        SN   Feb13   0:00 [zvol/30]
root       440  0.0  0.0      0     0 ?        SN   Feb13   0:00 [zvol/31]
root       533  0.0  0.0      0     0 ?        SN   Feb13   0:42 [z_null_iss/0]
root       534  0.0  0.0      0     0 ?        SN   Feb13   0:02 [z_null_int/0]
root       535  0.0  0.0      0     0 ?        SN   Feb13   0:00 [z_rd_iss/0]
root       536  0.0  0.0      0     0 ?        SN   Feb13   0:00 [z_rd_iss/1]
root       537  0.0  0.0      0     0 ?        SN   Feb13   0:00 [z_rd_iss/2]
root       538  0.0  0.0      0     0 ?        SN   Feb13   0:00 [z_rd_iss/3]
root       539  0.0  0.0      0     0 ?        SN   Feb13   0:00 [z_rd_iss/4]
root       540  0.0  0.0      0     0 ?        SN   Feb13   0:00 [z_rd_iss/5]
root       541  0.0  0.0      0     0 ?        SN   Feb13   0:00 [z_rd_iss/6]
root       542  0.0  0.0      0     0 ?        SN   Feb13   0:00 [z_rd_iss/7]
root       543  5.3  0.0      0     0 ?        SN   Feb13  89:01 [z_rd_int/0]
root       544  3.3  0.0      0     0 ?        RN   Feb13  54:51 [z_rd_int/1]
root       545  3.4  0.0      0     0 ?        SN   Feb13  56:36 [z_rd_int/2]
root       546  3.2  0.0      0     0 ?        SN   Feb13  53:08 [z_rd_int/3]
root       547  0.3  0.0      0     0 ?        DN   Feb13   6:24 [z_wr_iss/0]
root       548  0.3  0.0      0     0 ?        RN   Feb13   6:06 [z_wr_iss/1]
root       549  0.3  0.0      0     0 ?        DN   Feb13   6:18 [z_wr_iss/2]
root       550  0.3  0.0      0     0 ?        DN   Feb13   6:14 [z_wr_iss/3]
root       551  0.0  0.0      0     0 ?        SN   Feb13   0:00 [z_wr_iss_h/0]
root       552  0.0  0.0      0     0 ?        RN   Feb13   0:00 [z_wr_iss_h/1]
root       553  0.0  0.0      0     0 ?        SN   Feb13   0:00 [z_wr_iss_h/2]
root       554  0.0  0.0      0     0 ?        SN   Feb13   0:00 [z_wr_iss_h/3]
root       555  0.0  0.0      0     0 ?        SN   Feb13   0:00 [z_wr_iss_h/4]
root       556  0.1  0.0      0     0 ?        SN   Feb13   2:10 [z_wr_int/0]
root       557  0.1  0.0      0     0 ?        RN   Feb13   2:10 [z_wr_int/1]
root       558  0.1  0.0      0     0 ?        DN   Feb13   2:13 [z_wr_int/2]
root       559  0.1  0.0      0     0 ?        DN   Feb13   2:12 [z_wr_int/3]
root       560  0.1  0.0      0     0 ?        DN   Feb13   2:10 [z_wr_int/4]
root       561  0.1  0.0      0     0 ?        RN   Feb13   2:12 [z_wr_int/5]
root       562  0.1  0.0      0     0 ?        DN   Feb13   2:14 [z_wr_int/6]
root       563  0.1  0.0      0     0 ?        DN   Feb13   2:11 [z_wr_int/7]
root       564  0.1  0.0      0     0 ?        DN   Feb13   2:10 [z_wr_int/8]
root       565  0.1  0.0      0     0 ?        RN   Feb13   2:11 [z_wr_int/9]
root       566  0.1  0.0      0     0 ?        SN   Feb13   2:12 [z_wr_int/10]
root       567  0.1  0.0      0     0 ?        DN   Feb13   2:12 [z_wr_int/11]
root       568  0.1  0.0      0     0 ?        DN   Feb13   2:09 [z_wr_int/12]
root       569  0.1  0.0      0     0 ?        RN   Feb13   2:10 [z_wr_int/13]
root       570  0.1  0.0      0     0 ?        DN   Feb13   2:13 [z_wr_int/14]
root       571  0.1  0.0      0     0 ?        DN   Feb13   2:11 [z_wr_int/15]
root       572  0.0  0.0      0     0 ?        SN   Feb13   0:00 [z_wr_int_h/0]
root       573  0.0  0.0      0     0 ?        RN   Feb13   0:00 [z_wr_int_h/1]
root       574  0.0  0.0      0     0 ?        SN   Feb13   0:00 [z_wr_int_h/2]
root       575  0.0  0.0      0     0 ?        SN   Feb13   0:00 [z_wr_int_h/3]
root       576  0.0  0.0      0     0 ?        SN   Feb13   0:00 [z_wr_int_h/4]
root       577  0.0  0.0      0     0 ?        SN   Feb13   0:06 [z_fr_iss/0]
root       578  0.0  0.0      0     0 ?        SN   Feb13   0:05 [z_fr_iss/1]
root       579  0.0  0.0      0     0 ?        SN   Feb13   0:06 [z_fr_iss/2]
root       580  0.0  0.0      0     0 ?        SN   Feb13   0:07 [z_fr_iss/3]
root       581  0.0  0.0      0     0 ?        SN   Feb13   0:00 [z_fr_int/0]
root       582  0.0  0.0      0     0 ?        SN   Feb13   0:00 [z_cl_iss/0]
root       583  0.0  0.0      0     0 ?        SN   Feb13   0:00 [z_cl_int/0]
root       584  0.0  0.0      0     0 ?        SN   Feb13   0:01 [z_ioctl_iss/0]
root       585  0.0  0.0      0     0 ?        SN   Feb13   0:00 [z_ioctl_int/0]
root       624  0.0  0.0      0     0 ?        S<   Feb13   0:00 [zfs_iput_taskq/]
root       625  0.0  0.0      0     0 ?        S    Feb13   0:01 [flush-65:128]
root       626  0.0  0.0      0     0 ?        S<   Feb13   0:00 [txg_quiesce]
root       627  0.1  0.0      0     0 ?        D<   Feb13   2:28 [txg_sync]
root       724  0.0  0.0      0     0 ?        S<   Feb13   0:00 [edac-poller]
root      1156  0.0  0.0      0     0 ?        S    Feb13   0:00 [ivtv0]
root      1368  0.0  0.0      0     0 ?        S<   Feb13   0:02 [zil_clean/0]
root      1467  0.0  0.0      0     0 ?        S<   Feb13   0:00 [ttm_swap]
root      1738  0.0  0.0  36276  7024 ?        S    Feb13   0:01 ddclient - sleeping for 130 seconds
root      8059  1.1  0.0      0     0 ?        S    Feb13   5:01 [kworker/0:0]
root      8770  0.0  0.0      0     0 ?        S    Feb13   0:08 [kworker/3:0]
root      8772  0.0  0.0      0     0 ?        S    Feb13   0:01 [kworker/0:2]
aaa   8837  0.4  0.0  22468  1876 ?        D    Feb13   1:35 /usr/bin/unrar x [...]
root      9002  0.0  0.0      0     0 ?        S    00:11   0:00 [flush-zfs-1]
root      9072  0.0  0.0      0     0 ?        S    01:14   0:00 [kworker/3:1]
behlendorf commented 12 years ago

I was able to reproduce this issue running a test case which simply creates a large number of files in the filesystem. After running for 10 hours and creating 224 million files the system deadlocked. The root cause appears to be that the z_null_iss thread got stuck on the mutex's wait_lock when attempting to drop a zio mutex. Unfortunately, despite significant investigation I was unable to determine what thread was holding the spin lock. Regardless, here's the detailed analysis.

One of several threads creating files, it's block waiting for the txg to sync.

PID: 5597   TASK: ffff88062ecb00c0  CPU: 7   COMMAND: "cm"
 #0 [ffff8804fccb9a58] schedule at ffffffff814ee4a0
 #1 [ffff8804fccb9b20] cv_wait_common at ffffffffa04d34ec [spl]
 #2 [ffff8804fccb9ba0] __cv_wait at ffffffffa04d3623 [spl]
 #3 [ffff8804fccb9bb0] txg_wait_open at ffffffffa061136b [zfs]
 #4 [ffff8804fccb9c10] dmu_tx_wait at ffffffffa05d6c2d [zfs]
 #5 [ffff8804fccb9c40] zfs_create at ffffffffa0644e9f [zfs]
 #6 [ffff8804fccb9d20] zpl_create at ffffffffa0658817 [zfs]
 #7 [ffff8804fccb9d90] vfs_create at ffffffff81184dd4
 #8 [ffff8804fccb9dd0] do_filp_open at ffffffff811889df
 #9 [ffff8804fccb9f20] do_sys_open at ffffffff81174b99
#10 [ffff8804fccb9f70] sys_open at ffffffff81174cb0
#11 [ffff8804fccb9f80] system_call_fastpath at ffffffff8100b0f2
    RIP: 00002aaaaada5200  RSP: 00007fffffffd888  RFLAGS: 00010246
    RAX: 0000000000000002  RBX: ffffffff8100b0f2  RCX: 0000000000000000
    RDX: 00000000000001a4  RSI: 0000000000000042  RDI: 0000000000601e40
    RBP: 00007fffffffe2b0   R8: 0000000000000000   R9: 0000000000000001
    R10: 00000000fffffff0  R11: 0000000000000246  R12: ffffffff81174cb0
    R13: ffff8804fccb9f78  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: 0000000000000002  CS: 0033  SS: 002b

The txg_sync thread which is responsible for ensuring txg are synced in a reasonable time, it's blocked waiting on an outstanding write.

PID: 5333   TASK: ffff88062f6db580  CPU: 7   COMMAND: "txg_sync"
 #0 [ffff880636a99850] schedule at ffffffff814ee4a0
 #1 [ffff880636a99918] cv_wait_common at ffffffffa04d34ec [spl]
 #2 [ffff880636a99998] __cv_wait at ffffffffa04d3623 [spl]
 #3 [ffff880636a999a8] zio_wait at ffffffffa065480b [zfs]
 #4 [ffff880636a999d8] dbuf_read at ffffffffa05c3fa9 [zfs]
 #5 [ffff880636a99a98] dbuf_will_dirty at ffffffffa05c5cf6 [zfs]
 #6 [ffff880636a99ad8] dmu_write at ffffffffa05cc3b0 [zfs]
 #7 [ffff880636a99b58] space_map_sync at ffffffffa060fd2a [zfs]
 #8 [ffff880636a99c48] metaslab_sync at ffffffffa05fa93a [zfs]
 #9 [ffff880636a99cc8] vdev_sync at ffffffffa06141ff [zfs]
#10 [ffff880636a99d18] spa_sync at ffffffffa06003eb [zfs]
#11 [ffff880636a99de8] txg_sync_thread at ffffffffa0611c61 [zfs]
#12 [ffff880636a99eb8] thread_generic_wrapper at ffffffffa04cb748 [spl]
#13 [ffff880636a99ee8] kthread at ffffffff810906a6
#14 [ffff880636a99f48] kernel_thread at ffffffff8100c14a

A child read zio which was issued as a consequence on the aforementioned requested write, it blocked on its parents zio mutex.

PID: 5210   TASK: ffff8806358c3540  CPU: 7   COMMAND: "z_rd_iss/7"
 #0 [ffff88062f64fc30] schedule at ffffffff814ee4a0
 #1 [ffff88062f64fcf8] __mutex_lock_slowpath at ffffffff814efb3e
 #2 [ffff88062f64fd68] mutex_lock at ffffffff814ef9db
 #3 [ffff88062f64fd88] zio_ready at ffffffffa0654d78 [zfs]
 #4 [ffff88062f64fde8] zio_execute at ffffffffa06546c9 [zfs]
 #5 [ffff88062f64fe18] taskq_thread at ffffffffa04cc6c2 [spl]
 #6 [ffff88062f64fee8] kthread at ffffffff810906a6
 #7 [ffff88062f64ff48] kernel_thread at ffffffff8100c14a

The z_nul_iss thread which is holding the parent zio mutex and blocking the child zio. It's attempting to drop the mutex but in the process it must take the mutex's wait_lock to safely wake all waiters. The spin lock however is contended and we spin on it consuming an entire cpu, this is easily visible in top.

 #6 [ffff880636b0fd50] _spin_lock at ffffffff814f0dce
 #7 [ffff880636b0fd58] __mutex_unlock_slowpath at ffffffff814ef735
 #8 [ffff880636b0fd78] mutex_unlock at ffffffff814ef70b
 #9 [ffff880636b0fd88] zio_ready at ffffffffa0654cf0 [zfs]
#10 [ffff880636b0fde8] zio_execute at ffffffffa06546c9 [zfs]
#11 [ffff880636b0fe18] taskq_thread at ffffffffa04cc6c2 [spl]
#12 [ffff880636b0fee8] kthread at ffffffff810906a6
#13 [ffff880636b0ff48] kernel_thread at ffffffff8100c14a

...unfortunately I couldn't determine what (if anything) is holding that spin lock. I dug up the exact memory location for the lock in memory and dumped its contents but everything looks sane. I even verified that the one task struct listed as a waiter on this list is the previous thread. Here's the lock in memory.


crash> p *(struct mutex *)0xffff88017c00b678
$9 = {
  count = {
    counter = -1
  }, 
  wait_lock = {
    raw_lock = {
      slock = 131074
    }
  }, 
  wait_list = {
    next = 0xffff88062f64fd10, 
    prev = 0xffff88062f64fd10
  }, 
  owner = 0x0
}

So two possible explanation for this are:

The next step is going to have to be to reproduce this issue with CONFIG_DEBUG_SPINLOCK set. This will allow us to identify the task struct which owns the spin lock. I was unable to do this simply by looking at all the stack traces.

ryao commented 12 years ago

Brian, is it possible that a thread freed this lock, but the free wasn't done correctly?

prakashsurya commented 12 years ago

In the above mutex, it shows owner = 0x0, is that valid?

behlendorf commented 12 years ago

@Prakash-Surya That's to be expected since we're past the point in mutex_unlock() which clears the owner.

prakashsurya commented 12 years ago

I ran fdtree -l 30 -d 2 -f 1024 -s 1 -o $ZFS_DIR over the weekend (2+ days) with ASSERTs enabled in ZFS and SPL, along with a debug kernel with some lock debugging enabled and didn't see this issue come up. So either the extra debugging makes this condition occur less frequently, or it's just very difficult to trigger in general (or both?). Either way, I thought I should report this.

behlendorf commented 12 years ago

Thanks Prakash, I should also update this issue with our current thinking of how this bug must be happened... even if we can't reproduce it. The only scenario I can think of which entirely explains this issue is a use-after-free bug as follows.

This effectively results in the deadlock observed above although the race for this to occur is only a few instructions wide. However, we do have some evidence that this is exactly what happened if you look at the value of the ticket spinlock above it is set to 0x00020002 which indicates it was recently initialized. This value is far to low for anything but a new zio since it gets incremented every mutex_enter() or mutex_exit().

Unfortunately, I haven't been able to find the exact call path where this is happening and it's been difficult to reproduce. Incidentally, this issue could potentially go unobserved in other ports. The only reason this is causing a problem on Linux is because of the nitty gritty details of how an x86_64 spin lock is implemented. This could easily not cause an issue for an alternate mutex implementation.

behlendorf commented 12 years ago

The above patch isn't a fix, but it might help avoid the problem by moving the mutexinit call to an constructor for the slab which will be called far far far less frequently. If someone who is suffering from this issue could try and patch it would be helpful. Incidentally, you can tell if this is the issue your suffering from by checking to see if one of the z* threads is consuming 100% of one cpu when it occurs.

asnow0 commented 12 years ago

I've been using ZFS with this patch for almost 3 days now, and so far so good. The systems in question were under a good amount of load (several large file copies to the zpool, multiple recv operations, etc) and nothing has failed yet.

prakashsurya commented 12 years ago

@asnow0 Is it safe to assume you were hitting this deadlock regularly prior to this patch?

tisoft commented 12 years ago

I had this deadlock regularly before this patch. The server is now running 4 days with the patch and I didn't get the deadlock.

asnow0 commented 12 years ago

@Prakash-Surya yes, I was seeing this issue pretty frequently before applying the patch.

behlendorf commented 12 years ago

I've merged this workaround in the master because we have good evidence that it does help avoid the issue. However, I'm going to leave this issue open until we resolve the real root cause at which point the workaround can be reverted.