Closed BjoKaSH closed 8 years ago
@BjoKaSH thanks for posting this. This does look like the second issue currently being discussed #4106. The stacks are slightly different but more clearly show the deadlock. Here's the critical bit from the stacks you posted with annotations for clarity.
txg_sync_thread
spa_sync
dsl_scan_sync
dsl_scan_visitds
dsl_scan_visitbp
...
arc_read - takes hash_lock
__cv_wait
cv_wait_common - increments waiters if b_cv
- drops hash_lock
- schedule() called and task woken
mutex_lock - waiting on hash_lock
arc_reclaim_thread
arc_adjust
arc_adjust_impl
arc_evict_state
arc_evict_state_impl - take hash_lock
arc_hdr_destroy
spl_kmem_cache_free
hdr_full_dest
__cv_destroy - waiting for waiters to exit b_cv
I think what has happened is clear, the mystery still lies in exactly how it happened. If your familiar with the kgdb or crash utilities it would be helpful if you could dump the contents of both the hash_lock mutex and the b_l1hdr.b_cv as a starting point.
@behlendorf This is a pretty funny situation. But I think this is pretty easy to solve. We really don't need to hold the mutex lock until we release the ref count on the cvp. We can just use a cvp private spin lock to protect itself.
Sooo, I am probably too late, but anyway, here comes the data (all from the arc_reclaim
task, pid 487):
hash_lock
(gdb) print *$hl
$155 = (kmutex_t *) 0xffffffffa02b4770 <buf_hash_table+312720>
(gdb) print **$hl
$156 = {m_mutex = {count = {counter = -4}, wait_lock = {{rlock = {raw_lock = {{head_tail = 524296, tickets = {head = 8,
tail = 8}}}}}}, wait_list = {next = 0xffff8803cc0bf410, prev = 0xffff8800c9ebb940},
owner = 0xffff8803f0d59800, spin_mlock = 0x0 <irq_stack_union>}, m_lock = {{rlock = {raw_lock = {{head_tail = 2590808684,
tickets = {head = 39532, tail = 39532}}}}}}, m_owner = 0xffff8803f0d59800}
(gdb) print (*$hl)->m_mutex.owner
$157 = (struct task_struct *) 0xffff8803f0d59800
(gdb) print (*$hl)->m_mutex.owner->comm
$158 = "arc_reclaim\000\000\000\000"
(gdb) print (*$hl)->m_mutex.owner->pid
$174 = 487
(gdb) print (*$hl)->m_mutex.wait_list
$159 = {next = 0xffff8803cc0bf410, prev = 0xffff8800c9ebb940}
(gdb) print (*$hl)->m_mutex.wait_list.next
$161 = (struct list_head *) 0xffff8803cc0bf410
(gdb) set $wm = (struct mutex_waiter *)(*$hl)->m_mutex.wait_list.next
(gdb) print $wm->task->comm
$162 = "txg_sync\000\000\000\000\000\000\000"
(gdb) print $wm->task->pid
$164 = 10734
(gdb) print $wm->list.next
$165 = (struct list_head *) 0xffff8803502cd770
(gdb) set $wm = (struct mutex_waiter *) $wm->list.next
(gdb) print $wm->task->comm
$166 = "rsync\000it\000\000\000\000\000\000\000"
(gdb) print $wm->task->pid
$167 = 16393
(gdb) set $wm = (struct mutex_waiter *) $wm->list.next
(gdb) print $wm->task->comm
$168 = "z_wr_iss\000\000\000\000\000\000\000"
(gdb) print $wm->task->pid
$169 = 6447
(gdb) set $wm = (struct mutex_waiter *) $wm->list.next
(gdb) print $wm->task->comm
$170 = "find\000n\000t\000\000\000\000\000\000\000"
(gdb) print $wm->task->pid
$171 = 21416
(gdb) set $wm = (struct mutex_waiter *) $wm->list.next
(gdb) print $wm->task->comm
$172 = "arc_reclaim\000\000\000\000"
(gdb) print $wm->task->pid
$173 = 487
hdr and b_l1hdr.b_cv
(gdb) print $hdr
$177 = (arc_buf_hdr_t *) 0xffff8802f2dc9c30
(gdb) print *$hdr
$178 = {b_dva = {dva_word = {0, 0}}, b_birth = 0, b_freeze_cksum = 0x0 <irq_stack_union>, b_hash_next = 0x0 <irq_stack_union>,
b_flags = (ARC_FLAG_PREFETCH | ARC_FLAG_INDIRECT | ARC_FLAG_BUFC_METADATA | ARC_FLAG_HAS_L1HDR), b_size = 16384,
b_spa = 6460952436296833629, b_l2hdr = {b_dev = 0x0 <irq_stack_union>, b_daddr = 0, b_hits = 0, b_asize = 0,
b_compress = 0 '\000', b_l2node = {next = 0xdead000000100100, prev = 0xdead000000200200}}, b_l1hdr = {b_freeze_lock = {
m_mutex = {count = {counter = 1}, wait_lock = {{rlock = {raw_lock = {{head_tail = 0, tickets = {head = 0,
tail = 0}}}}}}, wait_list = {next = 0xffff8802f2dc9ca0, prev = 0xffff8802f2dc9ca0},
owner = 0x0 <irq_stack_union>, spin_mlock = 0x0 <irq_stack_union>}, m_lock = {{rlock = {raw_lock = {{head_tail = 0,
tickets = {head = 0, tail = 0}}}}}}, m_owner = 0x0 <irq_stack_union>}, b_buf = 0x0 <irq_stack_union>,
b_datacnt = 0, b_cv = {cv_magic = 879052277, cv_event = {lock = {{rlock = {raw_lock = {{head_tail = 262148, tickets = {
head = 4, tail = 4}}}}}}, task_list = {next = 0xffff8802f2dc9cf0, prev = 0xffff8802f2dc9cf0}},
cv_destroy = {lock = {{rlock = {raw_lock = {{head_tail = 1244940852, tickets = {head = 18996, tail = 18996}}}}}},
task_list = {next = 0xffff8803f0d4fc70, prev = 0xffff8803f0d4fc70}}, cv_refs = {counter = 1}, cv_waiters = {
counter = 1}, cv_mutex = 0xffffffffa02b4770 <buf_hash_table+312720>}, b_state = 0xffffffffa02e85e0 <ARC_anon>,
b_arc_node = {next = 0xdead000000100100, prev = 0xdead000000200200}, b_arc_access = 4376410910, b_mru_hits = 1,
b_mru_ghost_hits = 1, b_mfu_hits = 0, b_mfu_ghost_hits = 0, b_l2_hits = 0, b_refcnt = {rc_count = 0},
b_acb = 0x0 <irq_stack_union>, b_tmp_cdata = 0x0 <irq_stack_union>}}
b_l1hdr.b_cv
(gdb) print $hdr->b_l1hdr.b_cv
$180 = {cv_magic = 879052277, cv_event = {lock = {{rlock = {raw_lock = {{head_tail = 262148, tickets = {head = 4,
tail = 4}}}}}}, task_list = {next = 0xffff8802f2dc9cf0, prev = 0xffff8802f2dc9cf0}}, cv_destroy = {lock = {{
rlock = {raw_lock = {{head_tail = 1244940852, tickets = {head = 18996, tail = 18996}}}}}}, task_list = {
next = 0xffff8803f0d4fc70, prev = 0xffff8803f0d4fc70}}, cv_refs = {counter = 1}, cv_waiters = {counter = 1},
cv_mutex = 0xffffffffa02b4770 <buf_hash_table+312720>}
(gdb) print $hdr->b_l1hdr.b_cv.cv_event.task_list.next
$182 = (struct list_head *) 0xffff8802f2dc9cf0
(gdb) print &$hdr->b_l1hdr.b_cv.cv_event.task_list.next
$186 = (struct list_head **) 0xffff8802f2dc9cf0
cv_event.task_list
is empty
cv_destroy.task_list
contains a single node, but I don't know what it refers to. What is the true type $hdr->b_l1hdr.b_cv.cv_destroy.task_list.next
points to, i.e. beside an embedded list_head?
Took a bit longer to locate the data. Turned out kgdb doesn't work with that box, it doesn't have a serial port. So had to use plain GDB against /proc/kcore and "unwind" the stacks manually (by looking at "disass ..." and memory dumps of the stack page), which is a bit messy.
The box is still alive, so I can try to retrieve more data if still needed.
@BjoKaSH ping
you have the possibility to modify and try out alternative deb packages ?
if yes, there were several fixes mentioned in #4106
https://github.com/zfsonlinux/zfs/issues/3979#issuecomment-171649725
is a summary of those from #4106
@kernelOfTruth I tried once in a while, but Debian packages are largely black magic to me. And anything dkms related is deep black magic. Nevertheless I'll try again in the next days.
Right now I wonder if there is a way to untangle the two threads, such that I can get a clean shutdown instead of just power-cycling the box. Or if any attempt to modify the state would always cause disaster.
@behlendorf was the contents of the hash_lock mutex and the b_l1hdr.b_cv above still useful, or do you need anything else?
@BjoKaSH thanks for the debug, we got what we needed from the back traces. The fix for this has been merged to the zfs master branch. You can roll custom debian packages if you need to:
http://zfsonlinux.org/generic-deb.html
zfsonlinux/spl@e843553 Don't hold mutex until release cv in cv_wait c96c36f Fix zsb->z_hold_mtx deadlock 0720116 Add zfs_object_mutex_size module option
@BjoKaSH it looks like there are weekly builds available at https://launchpad.net/~zfs-native/+archive/ubuntu/daily
So I assume the next build should include these fixes & latest master changes
After upgrade to 0.6.5.3 (Ubuntu 14.04 with ZFS 0.6.5.3-1~trusty) pool IO hangs on moderate load with arc_reclaim in state "D" (but slowly spinning).
This may be related to issue #4106, although workload and stack traces are different. In my case arc_reclaim is using very little CPU (~1%) and no NFS is involved. Stack trace of processes accessing the pool contains arc_read and further down buf_hash_find.
The box is still running and at least partially responsive. If helpful, I can (try to) attach (k)gdb and try to get some more insights on the internal state of ZFS. But I would need some guidance as I am unfamiliar with the ZoL code base.
Workload causing hang:
The box was doing a "rsync ... --link-dest /some/zfs/dir /some/zfs/otherdir " receiving a file tree of some 600000 files and directories from a remote box. The involved ZFS dataset is deduped with a dedup ratio of 2.1
System state:
The ZFS pool is a 2-way mirror:
The box has a second pool, but that was not used at the time. I have not tried to touch the second pool in any way in order to prevent further lock-up. Its state is unknown.
The box has 16GB of RAM, of which 5GB are currently free
Kernel is 3.13.0-71-generic #114-Ubuntu
The box has one CPU with 2 cores. One is idle, the other sitting in "wait" state, atop output:
The load of "7.03" corresponds to the seven tasks in state "D" (excerpt from "top"):
The hanging zpool command was trying to start a scrub of the pool (run by cron) The find was manually started to lock for some files. Not coming back lead to discovery of the hang. The rsync got stuck some days earlier, but went unnoticed (run by cron) until the interactive find didn't respond. Interestingly, I could "ls" the base of the tree searched by find right before starting find. Maybe the data was already in the cache.
There are 935 ZFS related tasks:
I have no idea where the z_unmount comes from. There should have been no mount / umount activity going on.
Relevant stack traces according to sysrq-t
arc_reclaim : stack trace (1) (Yes, it hangs since at least two weeks - I was away I couldn't investigate until now.)
z_wr_iss : stack trace (2)
txg_sync -> stack trace (3, the short running)
txg_sync -> stack trace (4, the long running)
rsync
find