Closed behlendorf closed 9 years ago
This is again xfstests 127 but in this case a deadlock was observed. It appears that we basically have a rare (but possible) lock inversion here. between the zfs range lock and individual page writeback bits.
zpl_writepages write_cache_pages zpl_putpage zfs_putpage (set page writeback bit) zfs_range_lock (waits on zfs range lock)
zpl_fallocate zpl_fallocate_common zfs_space zfs_freesp zfs_free_range (takes zfs range lock) truncate_inode_pages_range wait_on_page_writeback (waits on page writeback bit)
[ 3720.484058] INFO: task kworker/u8:1:6307 blocked for more than 120 seconds.
[ 3720.484086] Tainted: P OX 3.13.0-37-generic #64-Ubuntu
[ 3720.484096] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3720.484109] kworker/u8:1 D ffff88007fd14480 0 6307 2 0x00000000
[ 3720.484118] Workqueue: writeback bdi_writeback_workfn (flush-zfs-22)
[ 3720.484121] ffff8800578ff868 0000000000000046 ffff880056e2c800 ffff8800578fffd8
[ 3720.484125] 0000000000014480 0000000000014480 ffff880056e2c800 ffff880035d32e38
[ 3720.484128] ffff880004a0d660 ffff880035d32e40 ffff880004a0d688 0000000000000000
[ 3720.484132] Call Trace:
[ 3720.484139] [<ffffffff81723129>] schedule+0x29/0x70
[ 3720.484152] [<ffffffffa01cc755>] cv_wait_common+0x125/0x270 [spl]
[ 3720.484156] [<ffffffff810aafd0>] ? prepare_to_wait_event+0x100/0x100
[ 3720.484163] [<ffffffffa01cc8b5>] __cv_wait+0x15/0x20 [spl]
[ 3720.484211] [<ffffffffa0312cac>] zfs_range_lock+0x22c/0x6b0 [zfs]
[ 3720.484247] [<ffffffffa0322beb>] zfs_putpage+0x10b/0x4d0 [zfs]
[ 3720.484280] [<ffffffffa031e920>] ? zfs_link+0x480/0x480 [zfs]
[ 3720.484283] [<ffffffff8172f4bf>] ? ftrace_call+0x5/0x2f
[ 3720.484286] [<ffffffff8172f4bf>] ? ftrace_call+0x5/0x2f
[ 3720.484320] [<ffffffffa033f498>] zpl_putpage+0x58/0x110 [zfs]
[ 3720.484324] [<ffffffff81159575>] write_cache_pages+0x235/0x480
[ 3720.484356] [<ffffffffa033f440>] ? zpl_iterate+0x80/0x80 [zfs]
[ 3720.484458] [<ffffffffa033ec99>] zpl_writepages+0x99/0x190 [zfs]
[ 3720.484511] [<ffffffff8115a99e>] do_writepages+0x1e/0x40
[ 3720.484516] [<ffffffff811e6760>] __writeback_single_inode+0x40/0x210
[ 3720.484519] [<ffffffff811e7517>] writeback_sb_inodes+0x247/0x3e0
[ 3720.484523] [<ffffffff811e774f>] __writeback_inodes_wb+0x9f/0xd0
[ 3720.484526] [<ffffffff811e79c3>] wb_writeback+0x243/0x2c0
[ 3720.484530] [<ffffffff811e9382>] bdi_writeback_workfn+0x2a2/0x430
[ 3720.484536] [<ffffffff810839c2>] process_one_work+0x182/0x450
[ 3720.484539] [<ffffffff810847b1>] worker_thread+0x121/0x410
[ 3720.484542] [<ffffffff81084690>] ? rescuer_thread+0x430/0x430
[ 3720.484546] [<ffffffff8108b492>] kthread+0xd2/0xf0
[ 3720.484549] [<ffffffff8108b3c0>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3720.484552] [<ffffffff8172f77c>] ret_from_fork+0x7c/0xb0
[ 3720.484555] [<ffffffff8108b3c0>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3720.484581] INFO: task fsx:19179 blocked for more than 120 seconds.
[ 3720.484592] Tainted: P OX 3.13.0-37-generic #64-Ubuntu
[ 3720.484602] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3720.484612] fsx D ffff88007fd94480 0 19179 19034 0x00000000
[ 3720.484615] ffff88006d123b88 0000000000000086 ffff88007b626000 ffff88006d123fd8
[ 3720.484619] 0000000000014480 0000000000014480 ffff88007b626000 ffff88007fd94d18
[ 3720.484622] ffff88007ffd55e8 0000000000000002 ffffffff8114e440 ffff88006d123c00
[ 3720.484626] Call Trace:
[ 3720.484631] [<ffffffff8114e440>] ? wait_on_page_read+0x60/0x60
[ 3720.484635] [<ffffffff8172344d>] io_schedule+0x9d/0x140
[ 3720.484637] [<ffffffff8114e44e>] sleep_on_page+0xe/0x20
[ 3720.484640] [<ffffffff817238d2>] __wait_on_bit+0x62/0x90
[ 3720.484644] [<ffffffff8114e20f>] wait_on_page_bit+0x7f/0x90
[ 3720.484647] [<ffffffff810ab010>] ? autoremove_wake_function+0x40/0x40
[ 3720.484650] [<ffffffff8114eaf3>] ? unlock_page+0x23/0x30
[ 3720.484654] [<ffffffff8115dbee>] truncate_inode_pages_range+0x2fe/0x5a0
[ 3720.484690] [<ffffffffa03291cd>] zfs_freesp+0x4cd/0x530 [zfs]
[ 3720.484694] [<ffffffff8172f4bf>] ? ftrace_call+0x5/0x2f
[ 3720.484817] [<ffffffffa02fae69>] ? zfs_zaccess_common+0x169/0x1a0 [zfs]
[ 3720.484851] [<ffffffffa032321b>] zfs_space+0x14b/0x1b0 [zfs]
[ 3720.484884] [<ffffffffa033fc86>] zpl_fallocate_common+0xc6/0x150 [zfs]
[ 3720.484917] [<ffffffffa033fd26>] zpl_fallocate+0x16/0x20 [zfs]
[ 3720.484920] [<ffffffff811bb509>] do_fallocate+0xf9/0x190
[ 3720.484924] [<ffffffff811bb5ec>] SyS_fallocate+0x4c/0x80
[ 3720.484927] [<ffffffff8172f82d>] system_call_fastpath+0x1a/0x1f
@behlendorf As to the first failure, I ran into something weird when trying to get a log of fsx's operations when using the 191110531 seed value: In the log dump above, operation 2962 is a write from 0x165fa through 0x1ef39. When I run fsx by hand and collect the log of all operations, that particular write is operation 1962. I'm a bit mystified as to why there's a difference.
In the second sub-issue, nice catch on the lock inversion. I never encountered it in my testing, obviously. Isn't it zfs_putpage()
rather than zpl_putpage()
which sets the writeback bit in the zpl_writepages()
path? Couldn't zfs_putpage()
take the range lock a bit earlier?
I'm not quite sure what to make of that either. But I've certainly seen this for other seed values.
On the second issue your right it is zfs_putpage()
, I've updated my comment above. As for re-factoring the code to take the range lock a bit earlier @ryao investigated this in #2413. The idea was to pull it up so it covered the entire call to zpl_writepages()
. While this sounded like a good idea on the surface. After getting it implemented and benchmarked it was clear there were cases where this adversely impacted performance so it was abandoned.
But you're absolutely right, I don't see why we couldn't do something much more modest and just take the range lock a few lines earlier in zfs_putpage()
. That way the writeback bit only get's set under the range lock and we don't have this inversion. Seems like a nice easy fix, something like this (untested):
diff --git a/module/zfs/zfs_vnops.c b/module/zfs/zfs_vnops.c
index d05ccef..8834b05 100644
--- a/module/zfs/zfs_vnops.c
+++ b/module/zfs/zfs_vnops.c
@@ -3899,10 +3899,11 @@ zfs_putpage(struct inode *ip, struct page *pp, struct writeback_control *wbc)
}
#endif
+ rl = zfs_range_lock(zp, pgoff, pglen, RL_WRITER);
+
set_page_writeback(pp);
unlock_page(pp);
- rl = zfs_range_lock(zp, pgoff, pglen, RL_WRITER);
tx = dmu_tx_create(zsb->z_os);
dmu_tx_hold_write(tx, zp->z_id, pgoff, pglen);
@behlendorf Yes, that was exactly what I had in mind. As to the other approach, I was concerned that covering the entire zpl_writepages()
would have some negative performance ramifications. It sounds like my concern was correct.
I've opened #2980 with the proposed patch above so it can roll through testing and we don't loose track of it. The original issue I'm still not sure about but here's another example from the buildbot which seems to be hitting the issue more frequently now.
While the lock inversion has been resolved. The original issue here still needs to be investigated.
@behlendorf FYI: buildbot just hit this on: http://buildbot.zfsonlinux.org/builders/ubuntu-14.04-amd64-current-builder/builds/2365
Commit 21a96fb635fb14a59efe44d7ea6767c5f9a586bc which is now in master should address this issue. Unfortunately, I've never had much luck being able to reproduce this so we'll need to take a wait and see approach. I'm closing the issue but we'll want to keep an eye on the automated testing to see if it ever reoccurs.
@dweeezil the following rare failure was accidentally introduced by commit 223df0161fad50f53a8fa5ffeea8cc4f8137d522 and still exists as of zfs-0.6.3-159-gc944be5. It causes an occasional xfstests failure during the regression tests. I'm filing this issue so we can track it but I have not yet seriously investigated the issue.