openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.69k stars 1.75k forks source link

dbuf_evict / txg_sync hang #6880

Closed cwedgwood closed 5 years ago

cwedgwood commented 7 years ago

linux 4.9.61 zfs d4a72f23863382bdf6d0ae33196f5b5decbc48fd (origin/master as of a few hours ago)

dbuf_evict      D12200   508      2 0x00000000
 ffff880c2425b880 00000000ffffffff ffff88098d3ada00 ffff88061f113880
 ffff880c3fd97900 ffffc9000deebd48 ffffffff816ad811 ffff8805e0830938
 00ffc9000deebd30 ffff880c3fd97900 ffffc9000deebd70 ffff88061f113880
Call Trace:
 [<ffffffff816ad811>] ? __schedule+0x221/0x6f0
 [<ffffffff816add16>] schedule+0x36/0x80
 [<ffffffff816adfae>] schedule_preempt_disabled+0xe/0x10
 [<ffffffff816af7f0>] __mutex_lock_slowpath+0x90/0x100
 [<ffffffff816af877>] mutex_lock+0x17/0x30
 [<ffffffffa1293907>] dbuf_destroy+0x157/0x740 [zfs]
 [<ffffffffa1293fcb>] dbuf_evict_one+0xdb/0x160 [zfs]
 [<ffffffffa12941f3>] dbuf_evict_thread+0x1a3/0x330 [zfs]
 [<ffffffffa1294050>] ? dbuf_evict_one+0x160/0x160 [zfs]
 [<ffffffffa0f5c440>] ? __thread_exit+0x20/0x20 [spl]
 [<ffffffffa0f5c4bb>] thread_generic_wrapper+0x7b/0xc0 [spl]
 [<ffffffff810a5ae7>] kthread+0xd7/0xf0
 [<ffffffff810a5a10>] ? kthread_park+0x60/0x60
 [<ffffffff816b2a47>] ret_from_fork+0x27/0x40

txg_sync        D10456  3551      2 0x00000000
 ffff880c241ed4c0 00000000ffffffff ffff8806225e0000 ffff88061b871c40
 ffff880c3fcd7900 ffffc90028ba79a8 ffffffff816ad811 000000000003ee90
 00ff8805f845ae18 ffff880c3fcd7900 ffffffff811d39da ffff88061b871c40
Call Trace:
 [<ffffffff816ad811>] ? __schedule+0x221/0x6f0
 [<ffffffff811d39da>] ? kmem_cache_free+0x1ba/0x1e0
 [<ffffffff816add16>] schedule+0x36/0x80
 [<ffffffff816adfae>] schedule_preempt_disabled+0xe/0x10
 [<ffffffff816af7f0>] __mutex_lock_slowpath+0x90/0x100
 [<ffffffff816af877>] mutex_lock+0x17/0x30
 [<ffffffffa1293907>] dbuf_destroy+0x157/0x740 [zfs]
 [<ffffffffa1293fcb>] dbuf_evict_one+0xdb/0x160 [zfs]
 [<ffffffffa129344a>] dbuf_rele_and_unlock+0x50a/0x540 [zfs]
 [<ffffffffa0f58e5a>] ? spl_kmem_free+0x2a/0x40 [spl]
 [<ffffffffa1352f5b>] ? __dprintf+0xfb/0x140 [zfs]
 [<ffffffffa129377b>] dbuf_rele+0x4b/0x80 [zfs]
 [<ffffffffa129817e>] dmu_buf_rele+0xe/0x10 [zfs]
 [<ffffffffa13427be>] zap_put_leaf+0x6e/0x90 [zfs]
 [<ffffffffa1343445>] fzap_lookup+0x125/0x130 [zfs]
 [<ffffffffa1348c1c>] zap_lookup_impl+0xcc/0x1d0 [zfs]
 [<ffffffffa1349570>] zap_lookup_norm+0x80/0xb0 [zfs]
 [<ffffffffa134961b>] zap_contains+0x3b/0x50 [zfs]
 [<ffffffffa130d25a>] spa_sync+0x96a/0x1240 [zfs]
 [<ffffffffa13257f6>] txg_sync_thread+0x2c6/0x4f0 [zfs]
 [<ffffffffa1325530>] ? txg_quiesce_thread+0x4f0/0x4f0 [zfs]
 [<ffffffffa0f5c440>] ? __thread_exit+0x20/0x20 [spl]
 [<ffffffffa0f5c4bb>] thread_generic_wrapper+0x7b/0xc0 [spl]
 [<ffffffff810a5ae7>] kthread+0xd7/0xf0
 [<ffffffff810a5a10>] ? kthread_park+0x60/0x60
 [<ffffffff816b2a47>] ret_from_fork+0x27/0x40

@behlendorf if this happens again (i'm going to reboot i guess) how ideally do i find out who is holding the mutex?

behlendorf commented 7 years ago

@cwedgwood unfortunately it's not much fun. If you have crash(8) setup on your system you can use it to dump full stack frames, then find the offending dbuf pointer on the stack. From there you can extract the mutex and print the stored owner.

If this is a new issue, do you have a feeling for roughly when it was introduced?

cwedgwood commented 7 years ago

@behlendorf

new to me ... i reluctantly upgraded recently from what i found to be quite stable

upgraded from 4.9.20+ w/ d45670852 to 4.9.61+ w/ d4a72f238

behlendorf commented 7 years ago

@cwedgwood definitely let us know if you hit this again.

cwedgwood commented 6 years ago

@behlendorf i hit this again, both times with 'heavy' zvol access

what's unclear to me is why other people are not seeing this as i'm sure the use case of running VMs over zvols is very common

kernel 4.14.x zfs from master ba88ef6bbdcfbf4cb6edfdfceb2d7e5a5e36c72b


INFO: task dbuf_evict:1210 blocked for more than 45 seconds.
      Tainted: G           O    4.14.1.cw2 #37
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
dbuf_evict      D14624  1210      2 0x80000000
Call Trace:
 ? __schedule+0x39c/0x830
 schedule+0x32/0x80
 schedule_preempt_disabled+0xa/0x10
 __mutex_lock.isra.2+0x21a/0x490
 ? kmem_cache_free+0x1a5/0x1d0
 ? dbuf_destroy+0x14b/0x730 [zfs]
 dbuf_destroy+0x14b/0x730 [zfs]
 dbuf_evict_one+0xed/0x1c0 [zfs]
 dbuf_evict_thread+0x19e/0x330 [zfs]
 ? dbuf_evict_one+0x1c0/0x1c0 [zfs]
 thread_generic_wrapper+0x76/0xb0 [spl]
 kthread+0xfc/0x130
 ? __thread_exit+0x20/0x20 [spl]
 ? kthread_create_on_node+0x40/0x40
 ret_from_fork+0x27/0x40

INFO: task qemu-system-x86:12484 blocked for more than 45 seconds.
      Tainted: G           O    4.14.1.cw2 #37
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
qemu-system-x86 D13064 12484      1 0x00000000
Call Trace:
 ? __schedule+0x39c/0x830
 schedule+0x32/0x80
 schedule_preempt_disabled+0xa/0x10
 __mutex_lock.isra.2+0x21a/0x490
 ? kmem_cache_free+0x1a5/0x1d0
 ? dbuf_destroy+0x14b/0x730 [zfs]
 dbuf_destroy+0x14b/0x730 [zfs]
 dbuf_evict_one+0xed/0x1c0 [zfs]
 dbuf_rele_and_unlock+0x4e4/0x510 [zfs]
 ? kmem_cache_free+0x1a5/0x1d0
 ? kmem_cache_free+0x1a5/0x1d0
 ? dbuf_destroy+0x389/0x730 [zfs]
 dnode_evict_dbufs+0x1de/0x260 [zfs]
 dmu_objset_evict_dbufs+0x122/0x2d0 [zfs]
 zvol_last_close+0x62/0xb0 [zfs]
 zvol_release+0x1df/0x440 [zfs]
 __blkdev_put+0x189/0x1f0
 blkdev_close+0x21/0x30
 __fput+0xc7/0x1c0
 task_work_run+0x90/0xb0
 exit_to_usermode_loop+0xa2/0xb0
 syscall_return_slowpath+0x80/0xa0
 entry_SYSCALL_64_fastpath+0xa6/0xa8
RIP: 0033:0x7f722f6b326d
RSP: 002b:00007ffdc3a95690 EFLAGS: 00000293 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 0000561e6475fb20 RCX: 00007f722f6b326d
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000000d
RBP: 0000561e64708d30 R08: 00007ffdc3a95600 R09: 00007f7234b43b28
R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000010
R13: 0000000000000003 R14: 0000000000000000 R15: 0000561e62698f80
behlendorf commented 6 years ago

what's unclear to me is why other people are not seeing this.

@cwedgwood you may not be the only one. It looks like this is related to 6952, and you're seeing the stacks since hung_task_timeout_secs is tuned to a more aggressive 45s on your system (which I agree is still crazy long).

According to the profiling in #6952 there's a significant amount of lock contention evicting from the dbuf cache when it's filled with many small 8k blocks from zvols. And your backtraces suggest that dnode_evict_dbufs() can take quite a log time to evict those dbufs. We're going to need to determine why exactly.

While we run this down you could make the following safel change which should avoid the problem. Normally, when the last process to have a zvol open closes it all of the cached buffers backing it are immediately dropped. That's taking a long time of your system and isn't strictly needed, the buffers will be evicted from the ARC as needed anyway.

diff --git a/module/zfs/zvol.c b/module/zfs/zvol.c
index 6ea8224..9949e92 100644
--- a/module/zfs/zvol.c
+++ b/module/zfs/zvol.c
@@ -1211,7 +1211,6 @@ zvol_shutdown_zv(zvol_state_t *zv)
        if (dsl_dataset_is_dirty(dmu_objset_ds(zv->zv_objset)) &&
            !(zv->zv_flags & ZVOL_RDONLY))
                txg_wait_synced(dmu_objset_pool(zv->zv_objset), 0);
-       (void) dmu_objset_evict_dbufs(zv->zv_objset);
 }

 /*
cwedgwood commented 6 years ago

@behlendorf i will make the suggested change later today

that said, the zvols here are not opened/closed in my case but open (for days at a time)

the last report i made of this - things were hung for many hours before i caught it, attempting to nuke processes/etc does not help and it required a hard (ipmi) reboot usually

cwedgwood commented 6 years ago

@behlendorf is there any reason when making this change i should or should not update to current origin/master (89a66a0457cd392ab8c6ad6d9c138fedaa425067 right now)

behlendorf commented 6 years ago

@cwedgwood then I misunderstood, I didn't realize the entire system was deadlocked on this. So removing the dmu_objset_evict_dbufs() call may prevent this for now but we'll absolutely need to sort out the deadlock. As for upgrading to 89a66a0 everything is currently in good shape.

cwedgwood commented 6 years ago

@behlendorf reading over this now i see your comment wrt to dmu_objset_evict_dbufs likely came out of a stack trace where i killed qemu ... but that was after the hang ...

i don't have a clean understanding of why things hang in the first place, at one point i thought it was because of a bad disk in l2arc but i'm running without that these days so i can't place blame there

is there some way i can build the kernel to know who owns a lock should this happen again? (without completely crushing the performance)

also, is there a tag or something in github to remind me to close this if it doesn't show up in 90 days or so?

behlendorf commented 6 years ago

@cwedgwood it's only been 50 days, but if you haven't seen this again do you mind is we close this one out?

cwedgwood commented 6 years ago

@behlendorf the bug is still there, just now i got the trace below

it's definitely related to qemu/kvm closing/opening zvol's and i'm going to retest with your suggested change above as i think that helped last i had it

i think i can probably reproduce this fairly (happens within a day usually) if you have some suggestions on what else to try

INFO: task dbuf_evict:1222 blocked for more than 45 seconds.
      Tainted: P           O    4.14.23.cw3 #43
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
dbuf_evict      D14648  1222      2 0x80000000
Call Trace:
 ? __schedule+0x39c/0x830
 schedule+0x32/0x80
 schedule_preempt_disabled+0xa/0x10
 __mutex_lock.isra.2+0x21a/0x490
 ? kmem_cache_free+0x1a5/0x1d0
 ? dbuf_destroy+0x14b/0x7b0 [zfs]
 dbuf_destroy+0x14b/0x7b0 [zfs]
 dbuf_evict_one+0x129/0x1e0 [zfs]
 dbuf_evict_thread+0x19c/0x320 [zfs]
 ? dbuf_evict_one+0x1e0/0x1e0 [zfs]
 thread_generic_wrapper+0x76/0xb0 [spl]
 kthread+0xfc/0x130
 ? __thread_exit+0x20/0x20 [spl]
 ? kthread_create_on_node+0x40/0x40
 ret_from_fork+0x24/0x30
INFO: task txg_sync:5811 blocked for more than 45 seconds.
      Tainted: P           O    4.14.23.cw3 #43
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
txg_sync        D12344  5811      2 0x80000000
Call Trace:
 ? __schedule+0x39c/0x830
 schedule+0x32/0x80
 schedule_preempt_disabled+0xa/0x10
 __mutex_lock.isra.2+0x21a/0x490
 ? kmem_cache_free+0x1a5/0x1d0
 ? dbuf_destroy+0x14b/0x7b0 [zfs]
 dbuf_destroy+0x14b/0x7b0 [zfs]
 dbuf_evict_one+0x129/0x1e0 [zfs]
 dbuf_rele_and_unlock+0x565/0x590 [zfs]
 ? zap_lookup_impl+0x8f/0x1c0 [zfs]
 zap_lookup_norm+0x7f/0xa0 [zfs]
 zap_contains+0x1a/0x30 [zfs]
 spa_sync+0x953/0x1170 [zfs]
 txg_sync_thread+0x2c4/0x4d0 [zfs]
 ? txg_quiesce_thread+0x4e0/0x4e0 [zfs]
 thread_generic_wrapper+0x76/0xb0 [spl]
 kthread+0xfc/0x130
 ? __thread_exit+0x20/0x20 [spl]
 ? kthread_create_on_node+0x40/0x40
 ret_from_fork+0x24/0x30

kernel: 4.14.23 zfs: e086e717c spl: 3673d03

richardelling commented 6 years ago

@cwedgwood in the past, we've been able to debug the zvol evict ARC buffers on close() by simply doing dd to a zvol. When dd calls close(), the zvol_shutdown_zv() function described above kicks in. In VMs, this would occur if a VM that uses a zvol is stopped. Hopefully you'll be able to reproduce without having to wait a few days before stopping a VM.

behlendorf commented 6 years ago

@cwedgwood @richardelling PR #7273 may also help with this issue. This is a port of OpenZFS https://github.com/openzfs/openzfs/pull/564 which increases the default dbuf cache size. This can improve performance by keeping more indirect blocks in the dbuf cache which is particularly important when the block size is small.

You can see if this helps without patching anything by increasing the dbuf_cache_max_bytes module option to 1/32 of the target ARC size.

cwedgwood commented 6 years ago

@behlendorf i saw the upstream issue, it's unclear to me how a performance fix will prevent what's a hard lock/failure in this case

richardelling commented 6 years ago

Today we can already tune the dbuf cache size in sysfs, so it is easy to test.

However, I don't think the size is where the bug lies, though increasing the size might make it easier to trigger the bug when compressed ARC is in play.

behlendorf commented 6 years ago

@cwedgwood yes of course, I was thinking this was solely a performance issue again. Reading back through the issue again it's pretty clear that in fact things are blocking on the dn->dn_dbufs_mtx in dbuf_destroy(). What remains unclear is where that mutex is held.

STRML commented 6 years ago

Have received a similar lock on a machine that receives very frequent compressed recv for 4 datasets (not zvols) using znapzend.

Using 0.7.8 release.

Full syslog is lengthy but starts the same way.

[360861.121445] INFO: task dbuf_evict:281 blocked for more than 120 seconds.
[360861.124470]       Tainted: P           OE   4.4.0-1016-aws #16-Ubuntu
[360861.127280] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[360861.130785] dbuf_evict      D ffff88007bb63d58     0   281      2 0x00000000
[360861.130790]  ffff88007bb63d58 0000000099cb3e01 ffff8807f5d25280 ffff88007bb64000
[360861.130792]  ffff8801fadd3754 ffff8807f5d25280 00000000ffffffff ffff8801fadd3758
[360861.130795]  ffff88007bb63d70 ffffffff817e9075 ffff8801fadd3750 ffff88007bb63d80
[360861.130798] Call Trace:
[360861.130805]  [<ffffffff817e9075>] schedule+0x35/0x80
[360861.130808]  [<ffffffff817e930e>] schedule_preempt_disabled+0xe/0x10
[360861.130811]  [<ffffffff817eab95>] __mutex_lock_slowpath+0x95/0x110
[360861.130813]  [<ffffffff817eac2f>] mutex_lock+0x1f/0x2f
[360861.130858]  [<ffffffffc02bfbba>] dbuf_destroy+0x2ea/0x350 [zfs]
[360861.130878]  [<ffffffffc02bfcce>] dbuf_evict_one+0xae/0xd0 [zfs]
[360861.130898]  [<ffffffffc02bfdf7>] dbuf_evict_thread+0x107/0x1d0 [zfs]
[360861.130917]  [<ffffffffc02bfcf0>] ? dbuf_evict_one+0xd0/0xd0 [zfs]
[360861.130923]  [<ffffffffc00853a0>] ? __thread_exit+0x20/0x20 [spl]
[360861.130927]  [<ffffffffc0085413>] thread_generic_wrapper+0x73/0x80 [spl]
[360861.130931]  [<ffffffff8109dee8>] kthread+0xd8/0xf0
[360861.130933]  [<ffffffff8109de10>] ? kthread_park+0x60/0x60
[360861.130937]  [<ffffffff817ecdb5>] ret_from_fork+0x55/0x80
[360861.130939]  [<ffffffff8109de10>] ? kthread_park+0x60/0x60
[360861.130943] INFO: task z_wr_iss:632 blocked for more than 120 seconds.
[360861.133905]       Tainted: P           OE   4.4.0-1016-aws #16-Ubuntu
[360861.136785] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[360861.140319] z_wr_iss        D ffff8807f7c3b768     0   632      2 0x00000000
[360861.140323]  ffff8807f7c3b768 0000000000000000 ffff8807f9a76e00 ffff8807f7c3c000
[360861.140326]  ffff8801fadd3754 ffff8807f9a76e00 00000000ffffffff ffff8801fadd3758
[360861.140330]  ffff8807f7c3b780 ffffffff817e9075 ffff8801fadd3750 ffff8807f7c3b790
[360861.140336] Call Trace:
[360861.140340]  [<ffffffff817e9075>] schedule+0x35/0x80
[360861.140346]  [<ffffffff817e930e>] schedule_preempt_disabled+0xe/0x10
[360861.140349]  [<ffffffff817eab95>] __mutex_lock_slowpath+0x95/0x110
[360861.140354]  [<ffffffff817eac2f>] mutex_lock+0x1f/0x2f
[360861.140381]  [<ffffffffc02bfbba>] dbuf_destroy+0x2ea/0x350 [zfs]
[360861.140401]  [<ffffffffc02bfcce>] dbuf_evict_one+0xae/0xd0 [zfs]
[360861.140419]  [<ffffffffc02bf795>] dbuf_rele_and_unlock+0x465/0x470 [zfs]
[360861.140439]  [<ffffffffc02c32ac>] ? dmu_buf_replace_user+0x6c/0x90 [zfs]
[360861.140442]  [<ffffffff817eae02>] ? down_read+0x12/0x30
[360861.140474]  [<ffffffffc033629d>] ? zap_get_leaf_byblk.isra.12+0xbd/0x220 [zfs]
[360861.140494]  [<ffffffffc02bf8c6>] dbuf_rele+0x36/0x40 [zfs]
[360861.140513]  [<ffffffffc02c219e>] dmu_buf_rele+0xe/0x10 [zfs]
[360861.140544]  [<ffffffffc0336c89>] zap_put_leaf+0x39/0x50 [zfs]
[360861.140576]  [<ffffffffc033780a>] fzap_length+0x7a/0x90 [zfs]
[360861.140607]  [<ffffffffc033bb2d>] ? zap_name_alloc_uint64+0x5d/0x70 [zfs]
[360861.140637]  [<ffffffffc033c59c>] zap_length_uint64+0x7c/0xf0 [zfs]
[360861.140659]  [<ffffffffc02c72d5>] ddt_zap_lookup+0x55/0xc0 [zfs]
[360861.140678]  [<ffffffffc02c4d4b>] ddt_object_lookup+0x5b/0x60 [zfs]
[360861.140698]  [<ffffffffc02c6082>] ddt_lookup+0xb2/0x1b0 [zfs]
[360861.140726]  [<ffffffffc0308c79>] ? abd_checksum_SHA256+0x69/0xd0 [zfs]
[360861.140759]  [<ffffffffc0376ca1>] ? zio_checksum_compute+0x191/0x320 [zfs]
[360861.140762]  [<ffffffff811e0088>] ? __kmalloc_node+0xb8/0x2c0
[360861.140765]  [<ffffffff817eac22>] ? mutex_lock+0x12/0x2f
[360861.140797]  [<ffffffffc0374a01>] zio_ddt_write+0x61/0x4b0 [zfs]
[360861.140802]  [<ffffffffc0082f94>] ? spl_kmem_alloc+0xc4/0x170 [spl]
[360861.140834]  [<ffffffffc0370d19>] ? zio_push_transform+0x39/0x90 [zfs]
[360861.140866]  [<ffffffffc037120b>] ? zio_write_compress+0x49b/0x670 [zfs]
[360861.140906]  [<ffffffffc03701a5>] zio_execute+0x95/0xf0 [zfs]
[360861.140913]  [<ffffffffc0086aa4>] taskq_thread+0x2b4/0x4d0 [spl]
[360861.140919]  [<ffffffff810a9bf0>] ? wake_up_q+0x80/0x80
[360861.140949]  [<ffffffffc0370110>] ? zio_taskq_member.isra.7.constprop.10+0x80/0x80 [zfs]
[360861.140954]  [<ffffffffc00867f0>] ? taskq_thread_spawn+0x50/0x50 [spl]
[360861.140957]  [<ffffffff8109dee8>] kthread+0xd8/0xf0
[360861.140959]  [<ffffffff8109de10>] ? kthread_park+0x60/0x60
[360861.140962]  [<ffffffff817ecdb5>] ret_from_fork+0x55/0x80
[360861.140964]  [<ffffffff8109de10>] ? kthread_park+0x60/0x60
[360861.140967] INFO: task z_wr_iss:633 blocked for more than 120 seconds.
[360861.143901]       Tainted: P           OE   4.4.0-1016-aws #16-Ubuntu
[360861.146687] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[360861.150086] z_wr_iss        D ffff8807f2c8f6f8     0   633      2 0x00000000
[360861.150090]  ffff8807f2c8f6f8 00000001002e0023 ffff8807f63d2940 ffff8807f2c90000
[360861.150094]  ffff8801fadd3754 ffff8807f63d2940 00000000ffffffff ffff8801fadd3758
[360861.150097]  ffff8807f2c8f710 ffffffff817e9075 ffff8801fadd3750 ffff8807f2c8f720
[360861.150101] Call Trace:
[360861.150108]  [<ffffffff817e9075>] schedule+0x35/0x80
[360861.150116]  [<ffffffff817e930e>] schedule_preempt_disabled+0xe/0x10
[360861.150119]  [<ffffffff817eab95>] __mutex_lock_slowpath+0x95/0x110
[360861.150123]  [<ffffffff817eac2f>] mutex_lock+0x1f/0x2f
[360861.150152]  [<ffffffffc02bfbba>] dbuf_destroy+0x2ea/0x350 [zfs]
[360861.150173]  [<ffffffffc02bfcce>] dbuf_evict_one+0xae/0xd0 [zfs]
[360861.150191]  [<ffffffffc02bf795>] dbuf_rele_and_unlock+0x465/0x470 [zfs]
[360861.150210]  [<ffffffffc02be956>] ? dbuf_read+0x646/0x960 [zfs]
[360861.150229]  [<ffffffffc02bf8c6>] dbuf_rele+0x36/0x40 [zfs]
[360861.150248]  [<ffffffffc02c219e>] dmu_buf_rele+0xe/0x10 [zfs]
[360861.150280]  [<ffffffffc033616b>] zap_idx_to_blk+0xfb/0x170 [zfs]
[360861.150311]  [<ffffffffc03364a1>] zap_deref_leaf+0xa1/0xd0 [zfs]
[360861.150341]  [<ffffffffc03377c2>] fzap_length+0x32/0x90 [zfs]
[360861.150372]  [<ffffffffc033bb2d>] ? zap_name_alloc_uint64+0x5d/0x70 [zfs]
[360861.150403]  [<ffffffffc033c59c>] zap_length_uint64+0x7c/0xf0 [zfs]
[360861.150424]  [<ffffffffc02c72d5>] ddt_zap_lookup+0x55/0xc0 [zfs]
[360861.150443]  [<ffffffffc02c4d4b>] ddt_object_lookup+0x5b/0x60 [zfs]
[360861.150463]  [<ffffffffc02c6082>] ddt_lookup+0xb2/0x1b0 [zfs]
[360861.150491]  [<ffffffffc0308c79>] ? abd_checksum_SHA256+0x69/0xd0 [zfs]
[360861.150494]  [<ffffffff810b2064>] ? set_next_entity+0xa4/0x710
[360861.150496]  [<ffffffff810b4355>] ? pick_next_entity+0xa5/0x160
[360861.150500]  [<ffffffff810c7511>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[360861.150502]  [<ffffffff810a7253>] ? finish_task_switch+0x73/0x2a0
[360861.150504]  [<ffffffff817e8af6>] ? __schedule+0x2d6/0x820
[360861.150507]  [<ffffffff817e9272>] ? preempt_schedule_common+0x18/0x26
[360861.150509]  [<ffffffff817e929c>] ? _cond_resched+0x1c/0x30
[360861.150511]  [<ffffffff817eac22>] ? mutex_lock+0x12/0x2f
[360861.150543]  [<ffffffffc0374a01>] zio_ddt_write+0x61/0x4b0 [zfs]
[360861.150548]  [<ffffffffc0082f94>] ? spl_kmem_alloc+0xc4/0x170 [spl]
[360861.150580]  [<ffffffffc0370d19>] ? zio_push_transform+0x39/0x90 [zfs]
[360861.150612]  [<ffffffffc037120b>] ? zio_write_compress+0x49b/0x670 [zfs]
[360861.150649]  [<ffffffffc03701a5>] zio_execute+0x95/0xf0 [zfs]
[360861.150657]  [<ffffffffc0086aa4>] taskq_thread+0x2b4/0x4d0 [spl]
[360861.150660]  [<ffffffff810a9bf0>] ? wake_up_q+0x80/0x80
[360861.150694]  [<ffffffffc0370110>] ? zio_taskq_member.isra.7.constprop.10+0x80/0x80 [zfs]
[360861.150699]  [<ffffffffc00867f0>] ? taskq_thread_spawn+0x50/0x50 [spl]
[360861.150702]  [<ffffffff8109dee8>] kthread+0xd8/0xf0
[360861.150704]  [<ffffffff8109de10>] ? kthread_park+0x60/0x60
[360861.150707]  [<ffffffff817ecdb5>] ret_from_fork+0x55/0x80
[360861.150709]  [<ffffffff8109de10>] ? kthread_park+0x60/0x60
[360861.150712] INFO: task z_wr_iss:634 blocked for more than 120 seconds.
[360861.153727]       Tainted: P           OE   4.4.0-1016-aws #16-Ubuntu
[360861.156507] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[360861.159833] z_wr_iss        D ffff8807f927b6f8     0   634      2 0x00000000
[360861.159838]  ffff8807f927b6f8 0000000000000000 ffff8807f5dc3700 ffff8807f927c000
[360861.159843]  ffff8801fadd3754 ffff8807f5dc3700 00000000ffffffff ffff8801fadd3758
[360861.159846]  ffff8807f927b710 ffffffff817e9075 ffff8801fadd3750 ffff8807f927b720
[360861.159850] Call Trace:
[360861.159854]  [<ffffffff817e9075>] schedule+0x35/0x80
[360861.159857]  [<ffffffff817e930e>] schedule_preempt_disabled+0xe/0x10
[360861.159865]  [<ffffffff817eab95>] __mutex_lock_slowpath+0x95/0x110
[360861.159871]  [<ffffffff817eac2f>] mutex_lock+0x1f/0x2f
[360861.159901]  [<ffffffffc02bfbba>] dbuf_destroy+0x2ea/0x350 [zfs]
[360861.159929]  [<ffffffffc02bfcce>] dbuf_evict_one+0xae/0xd0 [zfs]
[360861.159949]  [<ffffffffc02bf795>] dbuf_rele_and_unlock+0x465/0x470 [zfs]
[360861.159968]  [<ffffffffc02be956>] ? dbuf_read+0x646/0x960 [zfs]
[360861.159987]  [<ffffffffc02bf8c6>] dbuf_rele+0x36/0x40 [zfs]
[360861.160005]  [<ffffffffc02c219e>] dmu_buf_rele+0xe/0x10 [zfs]
[360861.160037]  [<ffffffffc033616b>] zap_idx_to_blk+0xfb/0x170 [zfs]
[360861.160069]  [<ffffffffc03364a1>] zap_deref_leaf+0xa1/0xd0 [zfs]
[360861.160100]  [<ffffffffc03377c2>] fzap_length+0x32/0x90 [zfs]
[360861.160131]  [<ffffffffc033bb2d>] ? zap_name_alloc_uint64+0x5d/0x70 [zfs]
[360861.160162]  [<ffffffffc033c59c>] zap_length_uint64+0x7c/0xf0 [zfs]
[360861.160183]  [<ffffffffc02c72d5>] ddt_zap_lookup+0x55/0xc0 [zfs]
[360861.160203]  [<ffffffffc02c4d4b>] ddt_object_lookup+0x5b/0x60 [zfs]
[360861.160223]  [<ffffffffc02c6082>] ddt_lookup+0xb2/0x1b0 [zfs]
[360861.160251]  [<ffffffffc0308c79>] ? abd_checksum_SHA256+0x69/0xd0 [zfs]
[360861.160284]  [<ffffffffc0376ca1>] ? zio_checksum_compute+0x191/0x320 [zfs]
[360861.160287]  [<ffffffff811e0221>] ? __kmalloc_node+0x251/0x2c0
[360861.160290]  [<ffffffff817eac22>] ? mutex_lock+0x12/0x2f
[360861.160322]  [<ffffffffc0374a01>] zio_ddt_write+0x61/0x4b0 [zfs]
[360861.160327]  [<ffffffffc0082f94>] ? spl_kmem_alloc+0xc4/0x170 [spl]
[360861.160358]  [<ffffffffc0370d19>] ? zio_push_transform+0x39/0x90 [zfs]
[360861.160394]  [<ffffffffc037120b>] ? zio_write_compress+0x49b/0x670 [zfs]
[360861.160432]  [<ffffffffc03701a5>] zio_execute+0x95/0xf0 [zfs]
[360861.160438]  [<ffffffffc0086aa4>] taskq_thread+0x2b4/0x4d0 [spl]
[360861.160441]  [<ffffffff810a9bf0>] ? wake_up_q+0x80/0x80
[360861.160472]  [<ffffffffc0370110>] ? zio_taskq_member.isra.7.constprop.10+0x80/0x80 [zfs]
[360861.160477]  [<ffffffffc00867f0>] ? taskq_thread_spawn+0x50/0x50 [spl]
[360861.160480]  [<ffffffff8109dee8>] kthread+0xd8/0xf0
[360861.160482]  [<ffffffff8109de10>] ? kthread_park+0x60/0x60
[360861.160485]  [<ffffffff817ecdb5>] ret_from_fork+0x55/0x80
[360861.160487]  [<ffffffff8109de10>] ? kthread_park+0x60/0x60
[360861.160490] INFO: task z_wr_int_0:636 blocked for more than 120 seconds.
[360861.163328]       Tainted: P           OE   4.4.0-1016-aws #16-Ubuntu
[360861.166227] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[360861.169819] z_wr_int_0      D ffff8807f7a7baa8     0   636      2 0x00000000
[360861.169823]  ffff8807f7a7baa8 00000001002e0014 ffff8807f5dc1b80 ffff8807f7a7c000
[360861.169829]  ffff88062f3e1cd4 ffff8807f5dc1b80 00000000ffffffff ffff88062f3e1cd8
[360861.169832]  ffff8807f7a7bac0 ffffffff817e9075 ffff88062f3e1cd0 ffff8807f7a7bad0
[360861.169836] Call Trace:
[360861.169840]  [<ffffffff817e9075>] schedule+0x35/0x80
[360861.169843]  [<ffffffff817e930e>] schedule_preempt_disabled+0xe/0x10
[360861.169850]  [<ffffffff817eab95>] __mutex_lock_slowpath+0x95/0x110
[360861.169858]  [<ffffffff817eac2f>] mutex_lock+0x1f/0x2f
[360861.169888]  [<ffffffffc02bfbba>] dbuf_destroy+0x2ea/0x350 [zfs]
[360861.169916]  [<ffffffffc02bfcce>] dbuf_evict_one+0xae/0xd0 [zfs]
[360861.169936]  [<ffffffffc02bf795>] dbuf_rele_and_unlock+0x465/0x470 [zfs]
[360861.169961]  [<ffffffffc02eef73>] ? dsl_dir_diduse_space+0x113/0x180 [zfs]
[360861.169965]  [<ffffffffc0082d5a>] ? spl_kmem_free+0x2a/0x40 [spl]
[360861.169968]  [<ffffffff811e0fb7>] ? kfree+0x117/0x150
[360861.169986]  [<ffffffffc02c298d>] dbuf_write_done+0xed/0x1b0 [zfs]
[360861.170005]  [<ffffffffc02b601a>] arc_write_done+0x24a/0x3b0 [zfs]
[360861.170037]  [<ffffffffc0375891>] zio_done+0x2a1/0xca0 [zfs]
[360861.170040]  [<ffffffff810c7511>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[360861.170072]  [<ffffffffc03701a5>] zio_execute+0x95/0xf0 [zfs]
[360861.170077]  [<ffffffffc0086aa4>] taskq_thread+0x2b4/0x4d0 [spl]
[360861.170080]  [<ffffffff810a9bf0>] ? wake_up_q+0x80/0x80
[360861.170112]  [<ffffffffc0370110>] ? zio_taskq_member.isra.7.constprop.10+0x80/0x80 [zfs]
[360861.170118]  [<ffffffffc00867f0>] ? taskq_thread_spawn+0x50/0x50 [spl]
[360861.170120]  [<ffffffff8109dee8>] kthread+0xd8/0xf0
[360861.170122]  [<ffffffff8109de10>] ? kthread_park+0x60/0x60
[360861.170125]  [<ffffffff817ecdb5>] ret_from_fork+0x55/0x80
[360861.170127]  [<ffffffff8109de10>] ? kthread_park+0x60/0x60
[360861.170130] INFO: task z_wr_int_3:639 blocked for more than 120 seconds.
[360861.173162]       Tainted: P           OE   4.4.0-1016-aws #16-Ubuntu
[360861.176701] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[360861.181234] z_wr_int_3      D ffff8807f82bbaa8     0   639      2 0x00000000
[360861.181238]  ffff8807f82bbaa8 00000001002e0028 ffff8807f5dc6040 ffff8807f82bc000
[360861.181240]  ffff88062f3e1cd4 ffff8807f5dc6040 00000000ffffffff ffff88062f3e1cd8
[360861.181243]  ffff8807f82bbac0 ffffffff817e9075 ffff88062f3e1cd0 ffff8807f82bbad0
[360861.181245] Call Trace:
[360861.181249]  [<ffffffff817e9075>] schedule+0x35/0x80
[360861.181251]  [<ffffffff817e930e>] schedule_preempt_disabled+0xe/0x10
[360861.181253]  [<ffffffff817eab95>] __mutex_lock_slowpath+0x95/0x110
[360861.181256]  [<ffffffff817eac2f>] mutex_lock+0x1f/0x2f
[360861.181277]  [<ffffffffc02bfbba>] dbuf_destroy+0x2ea/0x350 [zfs]
[360861.181296]  [<ffffffffc02bfcce>] dbuf_evict_one+0xae/0xd0 [zfs]
[360861.181315]  [<ffffffffc02bf795>] dbuf_rele_and_unlock+0x465/0x470 [zfs]
[360861.181341]  [<ffffffffc02eef73>] ? dsl_dir_diduse_space+0x113/0x180 [zfs]
[360861.181346]  [<ffffffffc0082d5a>] ? spl_kmem_free+0x2a/0x40 [spl]
[360861.181348]  [<ffffffff811e0fb7>] ? kfree+0x117/0x150
[360861.181367]  [<ffffffffc02c298d>] dbuf_write_done+0xed/0x1b0 [zfs]
[360861.181387]  [<ffffffffc02b601a>] arc_write_done+0x24a/0x3b0 [zfs]
[360861.181444]  [<ffffffffc0375891>] zio_done+0x2a1/0xca0 [zfs]
[360861.181452]  [<ffffffff810c7511>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[360861.181498]  [<ffffffffc03701a5>] zio_execute+0x95/0xf0 [zfs]
[360861.181509]  [<ffffffffc0086aa4>] taskq_thread+0x2b4/0x4d0 [spl]
[360861.181520]  [<ffffffff810a9bf0>] ? wake_up_q+0x80/0x80
[360861.181568]  [<ffffffffc0370110>] ? zio_taskq_member.isra.7.constprop.10+0x80/0x80 [zfs]
[360861.181576]  [<ffffffffc00867f0>] ? taskq_thread_spawn+0x50/0x50 [spl]
[360861.181580]  [<ffffffff8109dee8>] kthread+0xd8/0xf0
[360861.181582]  [<ffffffff8109de10>] ? kthread_park+0x60/0x60
[360861.181586]  [<ffffffff817ecdb5>] ret_from_fork+0x55/0x80
[360861.181589]  [<ffffffff8109de10>] ? kthread_park+0x60/0x60
[360861.181592] INFO: task z_wr_int_4:640 blocked for more than 120 seconds.
[360861.185499]       Tainted: P           OE   4.4.0-1016-aws #16-Ubuntu
[360861.189374] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[360861.194246] z_wr_int_4      D ffff8807f7c6baa8     0   640      2 0x00000000
[360861.194251]  ffff8807f7c6baa8 00000001002e0022 ffff8807f5dc0000 ffff8807f7c6c000
[360861.194258]  ffff8801fadd3754 ffff8807f5dc0000 00000000ffffffff ffff8801fadd3758
[360861.194261]  ffff8807f7c6bac0 ffffffff817e9075 ffff8801fadd3750 ffff8807f7c6bad0
[360861.194264] Call Trace:
[360861.194269]  [<ffffffff817e9075>] schedule+0x35/0x80
[360861.194272]  [<ffffffff817e930e>] schedule_preempt_disabled+0xe/0x10
[360861.194277]  [<ffffffff817eab95>] __mutex_lock_slowpath+0x95/0x110
[360861.194286]  [<ffffffff817eac2f>] mutex_lock+0x1f/0x2f
[360861.194320]  [<ffffffffc02bfbba>] dbuf_destroy+0x2ea/0x350 [zfs]
[360861.194340]  [<ffffffffc02bfcce>] dbuf_evict_one+0xae/0xd0 [zfs]
[360861.194359]  [<ffffffffc02bf795>] dbuf_rele_and_unlock+0x465/0x470 [zfs]
[360861.194384]  [<ffffffffc02eef73>] ? dsl_dir_diduse_space+0x113/0x180 [zfs]
[360861.194389]  [<ffffffffc0082d5a>] ? spl_kmem_free+0x2a/0x40 [spl]
[360861.194391]  [<ffffffff811e0fb7>] ? kfree+0x117/0x150
[360861.194410]  [<ffffffffc02c298d>] dbuf_write_done+0xed/0x1b0 [zfs]
[360861.194428]  [<ffffffffc02b601a>] arc_write_done+0x24a/0x3b0 [zfs]
[360861.194461]  [<ffffffffc0375891>] zio_done+0x2a1/0xca0 [zfs]
[360861.194464]  [<ffffffff810c7511>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[360861.194497]  [<ffffffffc03701a5>] zio_execute+0x95/0xf0 [zfs]
[360861.194502]  [<ffffffffc0086aa4>] taskq_thread+0x2b4/0x4d0 [spl]
[360861.194505]  [<ffffffff810a9bf0>] ? wake_up_q+0x80/0x80
[360861.194536]  [<ffffffffc0370110>] ? zio_taskq_member.isra.7.constprop.10+0x80/0x80 [zfs]
[360861.194541]  [<ffffffffc00867f0>] ? taskq_thread_spawn+0x50/0x50 [spl]
[360861.194544]  [<ffffffff8109dee8>] kthread+0xd8/0xf0
[360861.194546]  [<ffffffff8109de10>] ? kthread_park+0x60/0x60
[360861.194549]  [<ffffffff817ecdb5>] ret_from_fork+0x55/0x80
[360861.194551]  [<ffffffff8109de10>] ? kthread_park+0x60/0x60
[360861.194553] INFO: task z_wr_int_5:641 blocked for more than 120 seconds.
[360861.198554]       Tainted: P           OE   4.4.0-1016-aws #16-Ubuntu
[360861.202399] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[360861.206951] z_wr_int_5      D ffff8807f8157aa8     0   641      2 0x00000000
[360861.206955]  ffff8807f8157aa8 00000001002e0025 ffff8807f5dc6e00 ffff8807f8158000
[360861.206961]  ffff88062f3e1cd4 ffff8807f5dc6e00 00000000ffffffff ffff88062f3e1cd8
[360861.206964]  ffff8807f8157ac0 ffffffff817e9075 ffff88062f3e1cd0 ffff8807f8157ad0
[360861.206967] Call Trace:
[360861.206971]  [<ffffffff817e9075>] schedule+0x35/0x80
[360861.206977]  [<ffffffff817e930e>] schedule_preempt_disabled+0xe/0x10
[360861.206983]  [<ffffffff817eab95>] __mutex_lock_slowpath+0x95/0x110
[360861.206986]  [<ffffffff817eac2f>] mutex_lock+0x1f/0x2f
[360861.207016]  [<ffffffffc02bfbba>] dbuf_destroy+0x2ea/0x350 [zfs]
[360861.207046]  [<ffffffffc02bfcce>] dbuf_evict_one+0xae/0xd0 [zfs]
[360861.207066]  [<ffffffffc02bf795>] dbuf_rele_and_unlock+0x465/0x470 [zfs]
[360861.207091]  [<ffffffffc02eef73>] ? dsl_dir_diduse_space+0x113/0x180 [zfs]
[360861.207095]  [<ffffffffc0082d5a>] ? spl_kmem_free+0x2a/0x40 [spl]
[360861.207098]  [<ffffffff811e0fb7>] ? kfree+0x117/0x150
[360861.207117]  [<ffffffffc02c298d>] dbuf_write_done+0xed/0x1b0 [zfs]
[360861.207135]  [<ffffffffc02b601a>] arc_write_done+0x24a/0x3b0 [zfs]
[360861.207168]  [<ffffffffc0375891>] zio_done+0x2a1/0xca0 [zfs]
[360861.207170]  [<ffffffff810c7511>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[360861.207204]  [<ffffffffc03701a5>] zio_execute+0x95/0xf0 [zfs]
[360861.207209]  [<ffffffffc0086aa4>] taskq_thread+0x2b4/0x4d0 [spl]
[360861.207212]  [<ffffffff810a9bf0>] ? wake_up_q+0x80/0x80
[360861.207244]  [<ffffffffc0370110>] ? zio_taskq_member.isra.7.constprop.10+0x80/0x80 [zfs]
[360861.207249]  [<ffffffffc00867f0>] ? taskq_thread_spawn+0x50/0x50 [spl]
[360861.207252]  [<ffffffff8109dee8>] kthread+0xd8/0xf0
[360861.207254]  [<ffffffff8109de10>] ? kthread_park+0x60/0x60
[360861.207257]  [<ffffffff817ecdb5>] ret_from_fork+0x55/0x80
[360861.207259]  [<ffffffff8109de10>] ? kthread_park+0x60/0x60
[360861.207264] INFO: task txg_sync:947 blocked for more than 120 seconds.
[360861.210992]       Tainted: P           OE   4.4.0-1016-aws #16-Ubuntu
[360861.214715] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[360861.219436] txg_sync        D ffff8807f3403b08     0   947      2 0x00000000
[360861.219440]  ffff8807f3403b08 ffffffffc0370110 ffff8807f99c6040 ffff8807f3404000
[360861.219446]  0000000000000000 7fffffffffffffff ffff8807f91337e8 0000000000000001
[360861.219449]  ffff8807f3403b20 ffffffff817e9075 ffff880800ad7180 ffff8807f3403bc8
[360861.219452] Call Trace:
[360861.219498]  [<ffffffffc0370110>] ? zio_taskq_member.isra.7.constprop.10+0x80/0x80 [zfs]
[360861.219505]  [<ffffffff817e9075>] schedule+0x35/0x80
[360861.219514]  [<ffffffff817eba5b>] schedule_timeout+0x23b/0x2d0
[360861.219558]  [<ffffffffc036f88f>] ? zio_taskq_dispatch+0x8f/0xa0 [zfs]
[360861.219602]  [<ffffffffc036f8d2>] ? zio_issue_async+0x12/0x20 [zfs]
[360861.219607]  [<ffffffff81022b95>] ? xen_clocksource_get_cycles+0x15/0x20
[360861.219610]  [<ffffffff810ef56e>] ? ktime_get+0x3e/0xb0
[360861.219613]  [<ffffffff817e87b6>] io_schedule_timeout+0xa6/0x110
[360861.219619]  [<ffffffffc008a2bd>] cv_wait_common+0xad/0x120 [spl]
[360861.219622]  [<ffffffff810c05f0>] ? prepare_to_wait_event+0xf0/0xf0
[360861.219627]  [<ffffffffc008a368>] __cv_wait_io+0x18/0x20 [spl]
[360861.219660]  [<ffffffffc0373867>] zio_wait+0xf7/0x190 [zfs]
[360861.219686]  [<ffffffffc02f18c8>] dsl_pool_sync+0xa8/0x3e0 [zfs]
[360861.219715]  [<ffffffffc030e5c6>] spa_sync+0x3f6/0xcf0 [zfs]
[360861.219718]  [<ffffffff810a9c02>] ? default_wake_function+0x12/0x20
[360861.219749]  [<ffffffffc0321513>] txg_sync_thread+0x2e3/0x4d0 [zfs]
[360861.219779]  [<ffffffffc0321230>] ? txg_quiesce_thread+0x3e0/0x3e0 [zfs]
[360861.219783]  [<ffffffffc00853a0>] ? __thread_exit+0x20/0x20 [spl]
[360861.219788]  [<ffffffffc0085413>] thread_generic_wrapper+0x73/0x80 [spl]
[360861.219790]  [<ffffffff8109dee8>] kthread+0xd8/0xf0
[360861.219792]  [<ffffffff8109de10>] ? kthread_park+0x60/0x60
[360861.219795]  [<ffffffff817ecdb5>] ret_from_fork+0x55/0x80
[360861.219797]  [<ffffffff8109de10>] ? kthread_park+0x60/0x60
[360861.219809] INFO: task zfs:27315 blocked for more than 120 seconds.
[360861.223791]       Tainted: P           OE   4.4.0-1016-aws #16-Ubuntu
[360861.227735] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[360861.232534] zfs             D ffff88038ad4f758     0 27315  27311 0x00000000
[360861.232538]  ffff88038ad4f758 00000001002e0026 ffff8802dfc4a940 ffff88038ad50000
[360861.232540]  ffff88062f3e1cd4 ffff8802dfc4a940 00000000ffffffff ffff88062f3e1cd8
[360861.232543]  ffff88038ad4f770 ffffffff817e9075 ffff88062f3e1cd0 ffff88038ad4f780
[360861.232545] Call Trace:
[360861.232549]  [<ffffffff817e9075>] schedule+0x35/0x80
[360861.232551]  [<ffffffff817e930e>] schedule_preempt_disabled+0xe/0x10
[360861.232553]  [<ffffffff817eab95>] __mutex_lock_slowpath+0x95/0x110
[360861.232556]  [<ffffffff817eac2f>] mutex_lock+0x1f/0x2f
[360861.232577]  [<ffffffffc02bfbba>] dbuf_destroy+0x2ea/0x350 [zfs]
[360861.232596]  [<ffffffffc02bfcce>] dbuf_evict_one+0xae/0xd0 [zfs]
[360861.232616]  [<ffffffffc02bf795>] dbuf_rele_and_unlock+0x465/0x470 [zfs]
[360861.232619]  [<ffffffff810c7511>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[360861.232621]  [<ffffffff811e0b64>] ? kmem_cache_free+0x1d4/0x1f0
[360861.232640]  [<ffffffffc02bf8c6>] dbuf_rele+0x36/0x40 [zfs]
[360861.232658]  [<ffffffffc02bfb42>] dbuf_destroy+0x272/0x350 [zfs]
[360861.232682]  [<ffffffffc02df5dc>] dnode_evict_dbufs+0x17c/0x1a0 [zfs]
[360861.232704]  [<ffffffffc02cf8b6>] dmu_objset_evict_dbufs+0x1d6/0x200 [zfs]
[360861.232737]  [<ffffffffc035af9a>] zfsvfs_teardown+0x14a/0x290 [zfs]
[360861.232770]  [<ffffffffc035b200>] zfs_suspend_fs+0x10/0x20 [zfs]
[360861.232803]  [<ffffffffc03514d0>] zfs_ioc_recv_impl+0x560/0xff0 [zfs]
[360861.232806]  [<ffffffff810c7511>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[360861.232837]  [<ffffffffc0352135>] zfs_ioc_recv+0x1d5/0x2d0 [zfs]
[360861.232841]  [<ffffffff8118dde1>] ? get_page_from_freelist+0x421/0x920
[360861.232873]  [<ffffffffc034e572>] zfsdev_ioctl+0x5c2/0x600 [zfs]
[360861.232877]  [<ffffffff81214bb7>] do_vfs_ioctl+0x2e7/0x4d0
[360861.232880]  [<ffffffff81068e37>] ? __do_page_fault+0x1b7/0x430
[360861.232882]  [<ffffffff81214e19>] SyS_ioctl+0x79/0x90
[360861.232885]  [<ffffffff817ec948>] entry_SYSCALL_64_fastpath+0x1c/0xbb

Received a similar lock on 0.7.7 a few weeks back, but on z_zvol.

Mar 27 13:36:55 hostname kernel: [562470.055797] INFO: task z_zvol:27225 blocked for more than 120 seconds.
[562470.058786]       Tainted: P           OE   4.4.0-1014-aws #14-Ubuntu
[562470.061611] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[562470.064829] z_zvol          D ffff8807cda2f8c8     0 27225      2 0x00000000
[562470.064834]  ffff8807cda2f8c8 ffffffffc04af8d9 ffff8807fa85d280 ffff8807cda30000
[562470.064836]  ffff880684f11b84 ffff8807fa85d280 00000000ffffffff ffff880684f11b88
[562470.064839]  ffff8807cda2f8e0 ffffffff817e7985 ffff880684f11b80 ffff8807cda2f8f0
[562470.064842] Call Trace:
[562470.064892]  [<ffffffffc04af8d9>] ? dsl_dataset_hold_obj+0xb9/0xa00 [zfs]
[562470.064896]  [<ffffffff817e7985>] schedule+0x35/0x80
[562470.064899]  [<ffffffff817e7c1e>] schedule_preempt_disabled+0xe/0x10
[562470.064901]  [<ffffffff817e9485>] __mutex_lock_slowpath+0x95/0x110
[562470.064904]  [<ffffffff817e951f>] mutex_lock+0x1f/0x2f
[562470.064926]  [<ffffffffc049bc4e>] dmu_objset_from_ds+0x2e/0x140 [zfs]
[562470.064947]  [<ffffffffc049bdc1>] dmu_objset_hold+0x61/0xb0 [zfs]
[562470.064954]  [<ffffffffc0382d5a>] ? spl_kmem_free+0x2a/0x40 [spl]
[562470.064979]  [<ffffffffc04c19a5>] dsl_prop_get+0x35/0x80 [zfs]
[562470.064983]  [<ffffffff810c7361>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[562470.065007]  [<ffffffffc04c1a0e>] dsl_prop_get_integer+0x1e/0x20 [zfs]
[562470.065040]  [<ffffffffc054ccdc>] zvol_create_minors_cb+0x2c/0x110 [zfs]
[562470.065061]  [<ffffffffc049a959>] dmu_objset_find_impl+0xf9/0x3e0 [zfs]
[562470.065094]  [<ffffffffc054ccb0>] ? zvol_set_volmode_sync_cb+0xb0/0xb0 [zfs]
[562470.065116]  [<ffffffffc049aa12>] dmu_objset_find_impl+0x1b2/0x3e0 [zfs]
[562470.065148]  [<ffffffffc054ccb0>] ? zvol_set_volmode_sync_cb+0xb0/0xb0 [zfs]
[562470.065170]  [<ffffffffc049aa12>] dmu_objset_find_impl+0x1b2/0x3e0 [zfs]
[562470.065201]  [<ffffffffc054ccb0>] ? zvol_set_volmode_sync_cb+0xb0/0xb0 [zfs]
[562470.065232]  [<ffffffffc054ccb0>] ? zvol_set_volmode_sync_cb+0xb0/0xb0 [zfs]
[562470.065253]  [<ffffffffc049ac89>] dmu_objset_find+0x49/0x70 [zfs]
[562470.065283]  [<ffffffffc054f833>] zvol_task_cb+0x503/0x570 [zfs]
[562470.065286]  [<ffffffff810c7361>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[562470.065289]  [<ffffffff810c7361>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[562470.065294]  [<ffffffffc0386aa4>] taskq_thread+0x2b4/0x4d0 [spl]
[562470.065298]  [<ffffffff810a9bb0>] ? wake_up_q+0x80/0x80
[562470.065302]  [<ffffffffc03867f0>] ? taskq_thread_spawn+0x50/0x50 [spl]
[562470.065306]  [<ffffffff8109dec8>] kthread+0xd8/0xf0
[562470.065308]  [<ffffffff8109ddf0>] ? kthread_park+0x60/0x60
[562470.065311]  [<ffffffff817eb6b5>] ret_from_fork+0x55/0x80
[562470.065313]  [<ffffffff8109ddf0>] ? kthread_park+0x60/0x60
mtippmann commented 6 years ago

Seeing something similiar with 0.7.5-1ubuntu13 - happens after a few days of heavy buildbot activity on the machine.

# cat /etc/modprobe.d/zfs.conf 
options zfs zfs_vdev_cache_size=33554432
options zfs l2arc_write_max=157286400
options zfs l2arc_noprefetch=0
options zfs zfs_vdev_sync_write_min_active=8
options zfs zfs_vdev_sync_write_max_active=32
options zfs zfs_vdev_sync_read_min_active=8
options zfs zfs_vdev_sync_read_max_active=32
options zfs zfs_vdev_async_read_min_active=8
options zfs zfs_vdev_async_read_max_active=32
options zfs zfs_vdev_async_write_min_active=1
options zfs zfs_vdev_async_write_max_active=1
options zfs zfs_dirty_data_max_percent=40
options zfs zfs_txg_timeout=30
options zfs zfs_prefetch_disable=0
options zfs l2arc_headroom=16
[1653669.107979] INFO: task make:30572 blocked for more than 120 seconds.
[1653669.114715]       Tainted: P           O     4.15.0-20-generic #21-Ubuntu
[1653669.121883] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1653669.130149] make            D    0 30572  29560 0x00000100
[1653669.130153] Call Trace:
[1653669.130166]  __schedule+0x297/0x8b0
[1653669.130260]  ? dbuf_rele_and_unlock+0x4b0/0x4b0 [zfs]
[1653669.130304]  ? zio_null+0x2f/0x40 [zfs]
[1653669.130308]  schedule+0x2c/0x80
[1653669.130312]  io_schedule+0x16/0x40
[1653669.130323]  cv_wait_common+0xb2/0x140 [spl]
[1653669.130327]  ? wait_woken+0x80/0x80
[1653669.130332]  __cv_wait_io+0x18/0x20 [spl]
[1653669.130376]  zio_wait+0xf8/0x1b0 [zfs]
[1653669.130407]  dbuf_read+0x7cd/0x920 [zfs]
[1653669.130446]  dnode_hold_impl+0xef/0xbd0 [zfs]
[1653669.130491]  ? zfs_znode_hold_cache_constructor+0x24/0x50 [zfs]
[1653669.130496]  ? spl_kmem_cache_alloc+0xfc/0x140 [spl]
[1653669.130502]  ? avl_add+0x67/0x90 [zavl]
[1653669.130504]  ? _cond_resched+0x19/0x40
[1653669.130505]  ? mutex_lock+0x12/0x40
[1653669.130542]  dnode_hold+0x1b/0x20 [zfs]
[1653669.130575]  dmu_bonus_hold+0x34/0x1b0 [zfs]
[1653669.130618]  sa_buf_hold+0xe/0x10 [zfs]
[1653669.130662]  zfs_zget+0x11e/0x240 [zfs]
[1653669.130708]  zfs_dirent_lock+0x4bd/0x660 [zfs]
[1653669.130753]  zfs_dirlook+0x92/0x290 [zfs]
[1653669.130799]  zfs_lookup+0x21e/0x3a0 [zfs]
[1653669.130843]  zpl_lookup+0xc9/0x1e0 [zfs]
[1653669.130848]  lookup_slow+0xab/0x170
[1653669.130850]  walk_component+0x1c3/0x470
[1653669.130852]  path_lookupat+0x84/0x1f0
[1653669.130855]  ? profile_path_perm.part.7+0x78/0xa0
[1653669.130857]  filename_lookup+0xb6/0x190
[1653669.130859]  ? mutex_lock+0x12/0x40
[1653669.130862]  ? __check_object_size+0xaf/0x1b0
[1653669.130866]  ? strncpy_from_user+0x4d/0x170
[1653669.130868]  user_path_at_empty+0x36/0x40
[1653669.130869]  ? user_path_at_empty+0x36/0x40
[1653669.130872]  vfs_statx+0x76/0xe0
[1653669.130875]  SYSC_newstat+0x3d/0x70
[1653669.130879]  ? __secure_computing+0x3f/0x100
[1653669.130882]  ? syscall_trace_enter+0xd7/0x2d0
[1653669.130884]  SyS_newstat+0xe/0x10
[1653669.130886]  do_syscall_64+0x73/0x130
[1653669.130888]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
STRML commented 6 years ago

I have yet to have the behavior come back after upgrading from 0.7.8 to 0.7.9. Hope it stays that way.

EDIT: Nevermind, it's back again, same trace as above.

alek-p commented 6 years ago

we may have had a 0.7.2 server hit this one @datto yesterday. No deadman stack traces but I took a kernel threads dump and see a ton of threads stuck at dbuf_destroy()

kernel: [17722868.374295] txg_sync        D ffff883fc31777c8     0 11467      2 0x00000080
kernel: [17722868.374298]  ffff883fc31777c8 0000000000000018 ffff881ff2ed4600 ffff881fad0c6200
kernel: [17722868.374311]  ffff883fc3178000 ffff880a4c716c54 ffff881fad0c6200 00000000ffffffff
kernel: [17722868.374327]  ffff880a4c716c58 ffff883fc31777e0 ffffffff8183f0d5 ffff880a4c716c50
kernel: [17722868.374342] Call Trace:
kernel: [17722868.374348]  [<ffffffff8183f0d5>] schedule+0x35/0x80
kernel: [17722868.374354]  [<ffffffff8183f37e>] schedule_preempt_disabled+0xe/0x10
kernel: [17722868.374360]  [<ffffffff81840fb9>] __mutex_lock_slowpath+0xb9/0x130
kernel: [17722868.374366]  [<ffffffff8184104f>] mutex_lock+0x1f/0x30
kernel: [17722868.374409]  [<ffffffffc097f1f0>] dbuf_destroy+0x110/0x340 [zfs]
kernel: [17722868.374450]  [<ffffffffc097f4cb>] dbuf_evict_one+0xab/0xd0 [zfs]
kernel: [17722868.374492]  [<ffffffffc097f93d>] dbuf_rele_and_unlock+0x44d/0x470 [zfs]
kernel: [17722868.374548]  [<ffffffffc09b278c>] ? dsl_dir_diduse_space+0x12c/0x170 [zfs]
kernel: [17722868.374589]  [<ffffffffc097fa86>] dmu_buf_rele+0x36/0x40 [zfs]
kernel: [17722868.374628]  [<ffffffffc097d959>] bpobj_iterate_impl+0x299/0x760 [zfs]
kernel: [17722868.374678]  [<ffffffffc09ba430>] ? dsl_scan_zil_block+0x100/0x100 [zfs]
kernel: [17722868.374724]  [<ffffffffc099f733>] ? dnode_rele_and_unlock+0x53/0x80 [zfs]
kernel: [17722868.374766]  [<ffffffffc0988c69>] ? dmu_bonus_hold+0xb9/0x1a0 [zfs]
kernel: [17722868.374805]  [<ffffffffc097cc8a>] ? bpobj_open+0xba/0x100 [zfs]
kernel: [17722868.374809]  [<ffffffff81841042>] ? mutex_lock+0x12/0x30
kernel: [17722868.374848]  [<ffffffffc097da36>] bpobj_iterate_impl+0x376/0x760 [zfs]
kernel: [17722868.374897]  [<ffffffffc09ba430>] ? dsl_scan_zil_block+0x100/0x100 [zfs]
kernel: [17722868.374937]  [<ffffffffc097de34>] bpobj_iterate+0x14/0x20 [zfs]
kernel: [17722868.374986]  [<ffffffffc09bd23d>] dsl_scan_sync+0x56d/0xc70 [zfs]
kernel: [17722868.374993]  [<ffffffffc04a3af5>] ? spl_kmem_cache_free+0x135/0x1c0 [spl]
kernel: [17722868.375055]  [<ffffffffc0a3bd46>] ? zio_wait+0x146/0x1b0 [zfs]
kernel: [17722868.375109]  [<ffffffffc09d3b1f>] spa_sync+0x46f/0xda0 [zfs]
kernel: [17722868.375114]  [<ffffffff810c3df4>] ? __wake_up+0x44/0x50
kernel: [17722868.375170]  [<ffffffffc09e6882>] txg_sync_thread+0x2e2/0x4b0 [zfs]
kernel: [17722868.375226]  [<ffffffffc09e65a0>] ? txg_delay+0x170/0x170 [zfs]
kernel: [17722868.375233]  [<ffffffffc04a44a0>] ? __thread_exit+0x20/0x20 [spl]
kernel: [17722868.375241]  [<ffffffffc04a4511>] thread_generic_wrapper+0x71/0x80 [spl]
kernel: [17722868.375245]  [<ffffffff810a0c95>] kthread+0xe5/0x100
kernel: [17722868.375248]  [<ffffffff810a0bb0>] ? kthread_create_on_node+0x1e0/0x1e0
kernel: [17722868.375253]  [<ffffffff8184358f>] ret_from_fork+0x3f/0x70
kernel: [17722868.375256]  [<ffffffff810a0bb0>] ? kthread_create_on_node+0x1e0/0x1e0
tcaputi commented 6 years ago

The issue is still present in 0.7.8. However, the auto-rebooter got to the machine before I could collect meaningful debug info.

alek-p commented 6 years ago

looks like we may need to backport https://github.com/zfsonlinux/zfs/pull/7602 to 0.7+

behlendorf commented 5 years ago

Yes it looks that way, I've added #7602 to the list for 0.7.13.

cwedgwood commented 5 years ago

@behlendorf you're confident this is the last fix needed here? if that is the case i will look at switch my tree over and test

behlendorf commented 5 years ago

@cwedgwood yes, as far as we're aware.

cwedgwood commented 5 years ago

@behlendorf this looks to be resolved using a more recent version from git