openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.55k stars 1.74k forks source link

FSX mmap failure #2976

Closed behlendorf closed 9 years ago

behlendorf commented 9 years ago

@dweeezil the following rare failure was accidentally introduced by commit 223df0161fad50f53a8fa5ffeea8cc4f8137d522 and still exists as of zfs-0.6.3-159-gc944be5. It causes an occasional xfstests failure during the regression tests. I'm filing this issue so we can track it but I have not yet seriously investigated the issue.

127  - output mismatch (see 127.out.bad)
--- 127.out 2014-12-17 15:16:05.986205001 -0800
+++ 127.out.bad 2014-12-17 15:40:07.042205001 -0800
@@ -10,4 +10,1041 @@
 === FSX Standard Mode, No Memory Mapping ===
 All operations completed A-OK!
 === FSX Standard Mode, Memory Mapping ===
-All operations completed A-OK!
+ltp/fsx -f -q -l 262144 -o 65536 -S 191110531 -N 100000 fsx_std_mmap
+READ BAD DATA: offset = 0x2a413, size = 0x6c50, fname = /tank/xfstests/fsx_std_mmap
+OFFSET GOOD    BAD RANGE
+0x2d000    0x0000  0x8393  0x    0
+operation# (mod 256) for the bad data may be 131
+0x2d001    0x0000  0x9383  0x    1
+operation# (mod 256) for the bad data may be 131
+0x2d002    0x0000  0x8348  0x    2
+operation# (mod 256) for the bad data may be 131
+0x2d003    0x0000  0x4883  0x    3
+operation# (mod 256) for the bad data may be 131
+0x2d004    0x0000  0x8373  0x    4
+operation# (mod 256) for the bad data may be 131
+0x2d005    0x0000  0x7383  0x    5
+operation# (mod 256) for the bad data may be 131
+0x2d006    0x0000  0x8357  0x    6
+operation# (mod 256) for the bad data may be 131
+0x2d007    0x0000  0x5783  0x    7
+operation# (mod 256) for the bad data may be 131
+0x2d008    0x0000  0x83ff  0x    8
+operation# (mod 256) for the bad data may be 131
+0x2d009    0x0000  0xff83  0x    9
+operation# (mod 256) for the bad data may be 131
+0x2d00a    0x0000  0x83f2  0x    a
+operation# (mod 256) for the bad data may be 131
+0x2d00b    0x0000  0xf283  0x    b
+operation# (mod 256) for the bad data may be 131
+0x2d00c    0x0000  0x8355  0x    c
+operation# (mod 256) for the bad data may be 131
+0x2d00d    0x0000  0x5583  0x    d
+operation# (mod 256) for the bad data may be 131
+0x2d00e    0x0000  0x83d5  0x    e
+operation# (mod 256) for the bad data may be 131
+0x2d00f    0x0000  0xd583  0x    f
+operation# (mod 256) for the bad data may be 131
+LOG DUMP (2961 total operations):
+2962(146 mod 256): WRITE    0x165fa thru 0x1ef39   (0x8940 bytes)
+2963(147 mod 256): TRUNCATE DOWN   from 0x2610a to 0x2513e
+2964(148 mod 256): MAPWRITE 0x223d thru 0x11114    (0xeed8 bytes)
+2965(149 mod 256): MAPREAD  0x5a20 thru 0x75c9 (0x1baa bytes)
+2966(150 mod 256): SKIPPED (no operation)
+2967(151 mod 256): READ     0x1536d thru 0x22562   (0xd1f6 bytes)
+2968(152 mod 256): SKIPPED (no operation)
+2969(153 mod 256): SKIPPED (no operation)
+2970(154 mod 256): WRITE    0x118a thru 0x9291 (0x8108 bytes)
+2971(155 mod 256): READ     0x7da thru 0x6048  (0x586f bytes)
+2972(156 mod 256): SKIPPED (no operation)
+2973(157 mod 256): SKIPPED (no operation)
+2974(158 mod 256): MAPREAD  0xf4e2 thru 0x19f42    (0xaa61 bytes)
+2975(159 mod 256): WRITE    0xbd9b thru 0x12ba3    (0x6e09 bytes)
+2976(160 mod 256): TRUNCATE DOWN   from 0x2513e to 0x764
+2977(161 mod 256): SKIPPED (no operation)
+2978(162 mod 256): MAPREAD  0x1ba thru 0x763   (0x5aa bytes)
+2979(163 mod 256): SKIPPED (no operation)
+2980(164 mod 256): MAPREAD  0xda thru 0x763    (0x68a bytes)
+2981(165 mod 256): SKIPPED (no operation)
+2982(166 mod 256): TRUNCATE UP from 0x764 to 0x34785   ******WWWW
+2983(167 mod 256): MAPREAD  0x2495c thru 0x2f1b6   (0xa85b bytes)  ***RRRR***
+2984(168 mod 256): MAPREAD  0x3222a thru 0x34784   (0x255b bytes)
+2985(169 mod 256): READ     0x5955 thru 0x1402c    (0xe6d8 bytes)
+2986(170 mod 256): SKIPPED (no operation)
+2987(171 mod 256): READ     0x9544 thru 0x13a97    (0xa554 bytes)
+2988(172 mod 256): SKIPPED (no operation)
+2989(173 mod 256): MAPREAD  0x2bbde thru 0x30a4d   (0x4e70 bytes)  ***RRRR***
+2990(174 mod 256): READ     0x29d7f thru 0x34784   (0xaa06 bytes)  ***RRRR***
+2991(175 mod 256): MAPREAD  0x28da9 thru 0x2ca71   (0x3cc9 bytes)
+2992(176 mod 256): SKIPPED (no operation)
behlendorf commented 9 years ago

This is again xfstests 127 but in this case a deadlock was observed. It appears that we basically have a rare (but possible) lock inversion here. between the zfs range lock and individual page writeback bits.

zpl_writepages write_cache_pages zpl_putpage zfs_putpage (set page writeback bit) zfs_range_lock (waits on zfs range lock)

zpl_fallocate zpl_fallocate_common zfs_space zfs_freesp zfs_free_range (takes zfs range lock) truncate_inode_pages_range wait_on_page_writeback (waits on page writeback bit)

[ 3720.484058] INFO: task kworker/u8:1:6307 blocked for more than 120 seconds.
[ 3720.484086]       Tainted: P           OX 3.13.0-37-generic #64-Ubuntu
[ 3720.484096] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3720.484109] kworker/u8:1    D ffff88007fd14480     0  6307      2 0x00000000
[ 3720.484118] Workqueue: writeback bdi_writeback_workfn (flush-zfs-22)
[ 3720.484121]  ffff8800578ff868 0000000000000046 ffff880056e2c800 ffff8800578fffd8
[ 3720.484125]  0000000000014480 0000000000014480 ffff880056e2c800 ffff880035d32e38
[ 3720.484128]  ffff880004a0d660 ffff880035d32e40 ffff880004a0d688 0000000000000000
[ 3720.484132] Call Trace:
[ 3720.484139]  [<ffffffff81723129>] schedule+0x29/0x70
[ 3720.484152]  [<ffffffffa01cc755>] cv_wait_common+0x125/0x270 [spl]
[ 3720.484156]  [<ffffffff810aafd0>] ? prepare_to_wait_event+0x100/0x100
[ 3720.484163]  [<ffffffffa01cc8b5>] __cv_wait+0x15/0x20 [spl]
[ 3720.484211]  [<ffffffffa0312cac>] zfs_range_lock+0x22c/0x6b0 [zfs]
[ 3720.484247]  [<ffffffffa0322beb>] zfs_putpage+0x10b/0x4d0 [zfs]
[ 3720.484280]  [<ffffffffa031e920>] ? zfs_link+0x480/0x480 [zfs]
[ 3720.484283]  [<ffffffff8172f4bf>] ? ftrace_call+0x5/0x2f
[ 3720.484286]  [<ffffffff8172f4bf>] ? ftrace_call+0x5/0x2f
[ 3720.484320]  [<ffffffffa033f498>] zpl_putpage+0x58/0x110 [zfs]
[ 3720.484324]  [<ffffffff81159575>] write_cache_pages+0x235/0x480
[ 3720.484356]  [<ffffffffa033f440>] ? zpl_iterate+0x80/0x80 [zfs]
[ 3720.484458]  [<ffffffffa033ec99>] zpl_writepages+0x99/0x190 [zfs]
[ 3720.484511]  [<ffffffff8115a99e>] do_writepages+0x1e/0x40
[ 3720.484516]  [<ffffffff811e6760>] __writeback_single_inode+0x40/0x210
[ 3720.484519]  [<ffffffff811e7517>] writeback_sb_inodes+0x247/0x3e0
[ 3720.484523]  [<ffffffff811e774f>] __writeback_inodes_wb+0x9f/0xd0
[ 3720.484526]  [<ffffffff811e79c3>] wb_writeback+0x243/0x2c0
[ 3720.484530]  [<ffffffff811e9382>] bdi_writeback_workfn+0x2a2/0x430
[ 3720.484536]  [<ffffffff810839c2>] process_one_work+0x182/0x450
[ 3720.484539]  [<ffffffff810847b1>] worker_thread+0x121/0x410
[ 3720.484542]  [<ffffffff81084690>] ? rescuer_thread+0x430/0x430
[ 3720.484546]  [<ffffffff8108b492>] kthread+0xd2/0xf0
[ 3720.484549]  [<ffffffff8108b3c0>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3720.484552]  [<ffffffff8172f77c>] ret_from_fork+0x7c/0xb0
[ 3720.484555]  [<ffffffff8108b3c0>] ? kthread_create_on_node+0x1c0/0x1c0
[ 3720.484581] INFO: task fsx:19179 blocked for more than 120 seconds.
[ 3720.484592]       Tainted: P           OX 3.13.0-37-generic #64-Ubuntu
[ 3720.484602] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3720.484612] fsx             D ffff88007fd94480     0 19179  19034 0x00000000
[ 3720.484615]  ffff88006d123b88 0000000000000086 ffff88007b626000 ffff88006d123fd8
[ 3720.484619]  0000000000014480 0000000000014480 ffff88007b626000 ffff88007fd94d18
[ 3720.484622]  ffff88007ffd55e8 0000000000000002 ffffffff8114e440 ffff88006d123c00
[ 3720.484626] Call Trace:
[ 3720.484631]  [<ffffffff8114e440>] ? wait_on_page_read+0x60/0x60
[ 3720.484635]  [<ffffffff8172344d>] io_schedule+0x9d/0x140
[ 3720.484637]  [<ffffffff8114e44e>] sleep_on_page+0xe/0x20
[ 3720.484640]  [<ffffffff817238d2>] __wait_on_bit+0x62/0x90
[ 3720.484644]  [<ffffffff8114e20f>] wait_on_page_bit+0x7f/0x90
[ 3720.484647]  [<ffffffff810ab010>] ? autoremove_wake_function+0x40/0x40
[ 3720.484650]  [<ffffffff8114eaf3>] ? unlock_page+0x23/0x30
[ 3720.484654]  [<ffffffff8115dbee>] truncate_inode_pages_range+0x2fe/0x5a0
[ 3720.484690]  [<ffffffffa03291cd>] zfs_freesp+0x4cd/0x530 [zfs]
[ 3720.484694]  [<ffffffff8172f4bf>] ? ftrace_call+0x5/0x2f
[ 3720.484817]  [<ffffffffa02fae69>] ? zfs_zaccess_common+0x169/0x1a0 [zfs]
[ 3720.484851]  [<ffffffffa032321b>] zfs_space+0x14b/0x1b0 [zfs]
[ 3720.484884]  [<ffffffffa033fc86>] zpl_fallocate_common+0xc6/0x150 [zfs]
[ 3720.484917]  [<ffffffffa033fd26>] zpl_fallocate+0x16/0x20 [zfs]
[ 3720.484920]  [<ffffffff811bb509>] do_fallocate+0xf9/0x190
[ 3720.484924]  [<ffffffff811bb5ec>] SyS_fallocate+0x4c/0x80
[ 3720.484927]  [<ffffffff8172f82d>] system_call_fastpath+0x1a/0x1f
dweeezil commented 9 years ago

@behlendorf As to the first failure, I ran into something weird when trying to get a log of fsx's operations when using the 191110531 seed value: In the log dump above, operation 2962 is a write from 0x165fa through 0x1ef39. When I run fsx by hand and collect the log of all operations, that particular write is operation 1962. I'm a bit mystified as to why there's a difference.

In the second sub-issue, nice catch on the lock inversion. I never encountered it in my testing, obviously. Isn't it zfs_putpage() rather than zpl_putpage() which sets the writeback bit in the zpl_writepages() path? Couldn't zfs_putpage() take the range lock a bit earlier?

behlendorf commented 9 years ago

I'm not quite sure what to make of that either. But I've certainly seen this for other seed values.

On the second issue your right it is zfs_putpage(), I've updated my comment above. As for re-factoring the code to take the range lock a bit earlier @ryao investigated this in #2413. The idea was to pull it up so it covered the entire call to zpl_writepages(). While this sounded like a good idea on the surface. After getting it implemented and benchmarked it was clear there were cases where this adversely impacted performance so it was abandoned.

But you're absolutely right, I don't see why we couldn't do something much more modest and just take the range lock a few lines earlier in zfs_putpage(). That way the writeback bit only get's set under the range lock and we don't have this inversion. Seems like a nice easy fix, something like this (untested):

diff --git a/module/zfs/zfs_vnops.c b/module/zfs/zfs_vnops.c
index d05ccef..8834b05 100644
--- a/module/zfs/zfs_vnops.c
+++ b/module/zfs/zfs_vnops.c
@@ -3899,10 +3899,11 @@ zfs_putpage(struct inode *ip, struct page *pp, struct writeback_control *wbc)
        }
 #endif

+       rl = zfs_range_lock(zp, pgoff, pglen, RL_WRITER);
+
        set_page_writeback(pp);
        unlock_page(pp);

-       rl = zfs_range_lock(zp, pgoff, pglen, RL_WRITER);
        tx = dmu_tx_create(zsb->z_os);

        dmu_tx_hold_write(tx, zp->z_id, pgoff, pglen);
dweeezil commented 9 years ago

@behlendorf Yes, that was exactly what I had in mind. As to the other approach, I was concerned that covering the entire zpl_writepages() would have some negative performance ramifications. It sounds like my concern was correct.

behlendorf commented 9 years ago

I've opened #2980 with the proposed patch above so it can roll through testing and we don't loose track of it. The original issue I'm still not sure about but here's another example from the buildbot which seems to be hitting the issue more frequently now.

http://buildbot.zfsonlinux.org/builders/ubuntu-13.10-amd64-builder/builds/2071/steps/shell_13/logs/stdio

behlendorf commented 9 years ago

While the lock inversion has been resolved. The original issue here still needs to be investigated.

kernelOfTruth commented 9 years ago

@behlendorf FYI: buildbot just hit this on: http://buildbot.zfsonlinux.org/builders/ubuntu-14.04-amd64-current-builder/builds/2365

https://github.com/zfsonlinux/zfs/pull/3216

behlendorf commented 9 years ago

Commit 21a96fb635fb14a59efe44d7ea6767c5f9a586bc which is now in master should address this issue. Unfortunately, I've never had much luck being able to reproduce this so we'll need to take a wait and see approach. I'm closing the issue but we'll want to keep an eye on the automated testing to see if it ever reoccurs.