Closed thulle closed 6 months ago
Well, falling back to generic_file_llseek
still breaks in the BRT case, as @rrevans said above, but we appear to no longer break in the BRT-disabled case for that, suggesting at least two issues at hand, unless it's just so rare now that I've not hit it yet.
@rrevans what do you think about shorting out to that for the SEEK_DATA case, assuming Linux's llseek has correct visibility into state, and encouraging people to leave bclone disabled for now? Can you think of any cases that doesn't cover?
I would appreciate it if people could take the current state of #16000 for a spin - I haven't been able to reproduce it doing the wrong thing again on that branch, with any of the above repros, though I'm still running them to try and be more confident for now.
@rincebrain As I currently understand, the problem for non-BRT is that sometimes mmap page dirtiness doesn't reach the DMU before dnode_is_dirty
. IIRC generic_file_llseek
implements "the whole file is data" based on the inode size which so far seems correct. Based on the fact ZFS has no special handling for SEEK_END it should work to disable hole seeks.
That said, that approach is a regression for sparse copies without mmap, and I'm not sure we need to break that case.
I'm thoroughly convinced that the original fix works for the original problem and its root cause, though it's still got theoretical data races which I think are unreachable in practice. I've been studying this space and all the code ever since in pursuit of an implementation that doesn't need to force txg syncs because I think ZFS shouldn't have limits in this regard (and I have working code that indeed does that and passes ZTS, ztest, and my own stress tests).
While #16000 is super conservative I'd vote for adding code that keeps llseek functional EXCEPT for files with mappings (dirty or not) which is an easy test to add. Maybe leave in the option to fully disable as cheap advice for users in case this analysis is wrong somehow...?
Given the history, I'm really in favor of "super conservative" at the moment. I'm not opposed to other improvements, obviously, but given that distros tend to ship releases and take forever to cherrypick fixes from the future if they won't update (if they do it at all), I'm really in favor of safety models that don't require assuming people run a newer version if we're wrong about it being safe.
Also, I have a similar document detailing the history, with commits, so it's interesting to read someone else's deep dive, thank you. A little curious why you prepared it, but.
@Gendra13 can you please try with 20 or 200 files instead of 2000 on 2.1.5 / 5.15.0 without bclone but withzfs_dmu_offset_next_sync=0
?
Occasionally when developing the repro I saw smaller numbers of files being more effective for some cases including this one if memory serves.
Breaks on 5.10.0-27 zfs_dmu_offset_next_sync=0
and bclone_enabled=0.
Files x.1109 and x.1109.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.1116 and x.1116.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.1839 and x.1839.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.1874 and x.1874.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.1615 and x.1615.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.1961 and x.1961.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
e: Were you actually asking about 2.1.5, or no? Because bclone only went in on 2.2, so that setting wouldn't exist there...
Same applies to 2.1.5 stock with either dmu_offset_next_sync. 1:
Files x.1816 and x.1816.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
0:
Files x.1958 and x.1958.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.639 and x.639.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.1261 and x.1261.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.1885 and x.1885.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
@Gendra13 can you please try with 20 or 200 files instead of 2000 on 2.1.5 / 5.15.0 without bclone but with
zfs_dmu_offset_next_sync=0
?Occasionally when developing the repro I saw smaller numbers of files being more effective for some cases including this one if memory serves.
That was weird. When reducing the files down to 200 (or 20) I couldn't trigger the bug at all, even with zfs_dmu_offset_next_sync=1
.
I tested different numbers with zfs_dmu_offset_next_sync=1
and found 400 files to be the sweet spot where the script failed immediately.
Then I repeated the test with zfs_dmu_offset_next_sync=0
and indeed, it also failed but took a few minutes of running.
ZFS-Version: zfs-kmod-2.1.15-1
Uname -r : 5.15.0-100-generic
cp --version: cp (GNU coreutils) 9.4
Files x.346 and x.346.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.341 and x.341.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
Well, falling back to
generic_file_llseek
still breaks in the BRT case, as @rrevans said above, but we appear to no longer break in the BRT-disabled case for that, suggesting at least two issues at hand, unless it's just so rare now that I've not hit it yet.
I don't think the problems at hand are possible without fancy llseek or block clone.
If we attempt a direct read, then the cached mapping pages are used as the source of truth of the file state instead of the dirty buffers in the DMU/ARC. This happens as long as cp
is forced to actually read the file or have copy_file_range
do it indirectly (which also DTRT - eventually the kernel calls splice_read
which becomes zpl_iter_read
which ultimately calls zfs_read
; that in turn does a read of the pagecache for mapped ranges).
Thus, assuming those paths work, skipping fancy llseek and reflink is always going to be correct and avoid any dependence on writeback or flushing dirty pages to the DMU.
@rrevans what do you think about shorting out to that for the SEEK_DATA case, assuming Linux's llseek has correct visibility into state, and encouraging people to leave bclone disabled for now? Can you think of any cases that doesn't cover?
No, I can't think of any other cases. Fancy llseek getting it wrong is the proximate cause of all known problems pre-BRT. It's the same problem as #15526 but triggered by mmap instead of racing syncs. Edit: I'm also convinced Linux has the correct state already because the other seek flavors work, especially SEEK_END (see also O_APPEND maybe).
In short, while at least some older releases are affected, I'm convinced that triggering the writeback flavor of this bug for zfs_bclone_enabled=0
requires approximately all of:
1) recent cp
tool that is sparse aware (plain reads do not trigger this)
2) a sparse file (heuristics will likely skip fancy llseek in dense files)
3) mmap
used to make some changes to that file (plain writes do not trigger this)
4) copying the file soon after changing it (before normal writeback happens)
Note that for golang it seems to rewrite the outputs during build, so the files are briefly sparse during the build process but this is enough to create a window for (2): stat reports on-disk block usage from the prior moment the file was sparse if it happens to be flushed as part of a normal txg sync during the window where it is sparse. (I think this explains why golang repros are really hard pre-BRT: in addition to hitting a writeback race the system also needs to hit a txg flush in the small window that the output file is sparse.)
Okay... kernel tracing to the rescue. TL;DR zfs_putpage
fails on pushback from dsl_dir_tempreserve_space
, and nothing in Linux retries the writeback after this happens (in part because the error is not propagated to the kernel).
I ran my repro with the zn_flush_cached_data(B_TRUE)
patch in both zfs_holey_common
and zfs_clone_range
. Then I installed bpftrace
probes focused on a specific DMU object to try and figure out why mmap flush is broken.
On flush, the kernel calls zpl_writepages
which calls write_cache_pages
twice and that in turn calls zfs_putpage
for each dirty page. I added a trace for each page and found that some of the blocks that were mismatching were ones being written in the second write_cache_pages
call.
After a bit more poking, I gathered stacks for end_page_writeback
and noticed that the traces included stacks with write_cache_pages
which shouldn't happen as the normal call happens during zil_commit
callbacks:
@epw[
bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
bpf_trampoline_6442492082+71
end_page_writeback+9
zfs_putpage+698
zpl_putpage+86
zpl_putfolio+14
write_cache_pages+322
bpf_trampoline_6442532424+141
write_cache_pages+9
zpl_writepages+274
do_writepages+207
__writeback_single_inode+61
writeback_single_inode+175
write_inode_now+118
zfs_clone_range+1247
zpl_clone_file_range_impl+250
zpl_remap_file_range+93
do_clone_file_range+261
vfs_clone_file_range+62
ioctl_file_clone+73
do_vfs_ioctl+1752
__x64_sys_ioctl+114
do_syscall_64+96
entry_SYSCALL_64_after_hwframe+110
]: 4
@epw[
bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
bpf_trampoline_6442492082+71
end_page_writeback+9
zfs_putpage+698
zpl_putpage+86
zpl_putfolio+14
write_cache_pages+322
zpl_writepages+149
do_writepages+207
__writeback_single_inode+61
writeback_single_inode+175
write_inode_now+118
zfs_clone_range+1247
zpl_clone_file_range_impl+250
zpl_remap_file_range+93
do_clone_file_range+261
vfs_clone_file_range+62
ioctl_file_clone+73
do_vfs_ioctl+1752
__x64_sys_ioctl+114
do_syscall_64+96
entry_SYSCALL_64_after_hwframe+110
]: 5
@epw[
bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
bpf_trampoline_6442492082+71
end_page_writeback+9
zfs_putpage+698
zpl_putpage+86
zpl_putfolio+14
write_cache_pages+322
bpf_trampoline_6442532424+141
write_cache_pages+9
zpl_writepages+274
do_writepages+207
__writeback_single_inode+61
writeback_single_inode+175
write_inode_now+118
icp_init+45
write_inode_now+9
zfs_clone_range+1247
zpl_clone_file_range_impl+250
zpl_remap_file_range+93
do_clone_file_range+261
vfs_clone_file_range+62
ioctl_file_clone+73
do_vfs_ioctl+1752
__x64_sys_ioctl+114
do_syscall_64+96
entry_SYSCALL_64_after_hwframe+110
]: 13
@epw[
bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
bpf_trampoline_6442492082+71
end_page_writeback+9
zfs_putpage+698
zpl_putpage+86
zpl_putfolio+14
write_cache_pages+322
bpf_trampoline_6442532424+141
write_cache_pages+9
zpl_writepages+149
do_writepages+207
__writeback_single_inode+61
writeback_single_inode+175
write_inode_now+118
icp_init+45
write_inode_now+9
zfs_clone_range+1247
zpl_clone_file_range_impl+250
zpl_remap_file_range+93
do_clone_file_range+261
vfs_clone_file_range+62
ioctl_file_clone+73
do_vfs_ioctl+1752
__x64_sys_ioctl+114
do_syscall_64+96
entry_SYSCALL_64_after_hwframe+110
]: 22
@epw[
bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
bpf_trampoline_6442492082+71
end_page_writeback+9
zil_itx_destroy+174
zil_lwb_flush_vdevs_done+212
zio_done+975
zio_execute+243
taskq_thread+510
kthread+232
ret_from_fork+52
ret_from_fork_asm+27
]: 992
@epw[
bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
bpf_trampoline_6442492082+71
end_page_writeback+9
zil_itx_destroy+174
zil_itxg_clean+149
taskq_thread+510
kthread+232
ret_from_fork+52
ret_from_fork_asm+27
]: 2660
generated from
kfunc:end_page_writeback
{
if (args.page->mapping->host->i_ino == 3337) {
printf("end %d\n", args.page->index);
@epw[kstack()] = count();
}
}
invoked as (from ZFS src root):
# bpftrace --include zfs_config.h -I include/ -I include/spl -I include/os/linux/spl -I include/os/linux/zfs -I include/os/linux/kernel trace_mmap.bt
The only place that zfs_putpage
ends writeback early is when it hands the page back due to dmu_tx_assign
failing.
Sure enough, if I trace the dmu_tx_abort
call in that branch of zfs_putpage
it is indeed getting hit:
kprobe:zfs_putpage
{
$ip = (struct inode*)arg0;
if ($ip->i_ino == 3337) {
@put[tid] = 1;
}
}
kretprobe:zfs_putpage /@put[tid] != 0/
{
delete(@put[tid]);
}
kprobe:dmu_tx_abort /@put[tid] != 0/
{
$tx = (struct dmu_tx*)arg0;
printf("abort %d %d %d\n", $tx->tx_err, $tx->tx_needassign_txh, $tx->tx_wait_dirty);
print(kstack());
}
And a bit more poking I find that dsl_dir_tempreserve_space
is returning ERESTART
which is checked at the bottom of dmu_tx_assign
. Tracing also confirms zfs_putpage
is returning ERESTART
.
kretprobe:dsl_dir_tempreserve_space /@put[tid] != 0 && retval != 0/
{
printf("reserve %d\n", retval);
}
The final failing sequence prints something like:
writepages 3337 1
bpf_prog_6097e4b9567b95c5_kfunc_vmlinux_write_cache_pages+405
bpf_prog_6097e4b9567b95c5_kfunc_vmlinux_write_cache_pages+405
bpf_trampoline_6442532424+98
write_cache_pages+9
zpl_writepages+274
do_writepages+207
__writeback_single_inode+61
writeback_single_inode+175
write_inode_now+118
icp_init+45
write_inode_now+9
zfs_clone_range+1247
zpl_clone_file_range_impl+250
zpl_remap_file_range+93
do_clone_file_range+261
vfs_clone_file_range+62
ioctl_file_clone+73
do_vfs_ioctl+1752
__x64_sys_ioctl+114
do_syscall_64+96
entry_SYSCALL_64_after_hwframe+110
put 112
end 112
put 260
end 260
put 522
reserve 85
abort 0 0 0
dmu_tx_abort+5
zfs_putpage+647
zpl_putpage+86
zpl_putfolio+14
write_cache_pages+322
bpf_trampoline_6442532424+141
write_cache_pages+9
zpl_writepages+274
do_writepages+207
__writeback_single_inode+61
writeback_single_inode+175
write_inode_now+118
icp_init+45
write_inode_now+9
zfs_clone_range+1247
zpl_clone_file_range_impl+250
zpl_remap_file_range+93
do_clone_file_range+261
vfs_clone_file_range+62
ioctl_file_clone+73
do_vfs_ioctl+1752
__x64_sys_ioctl+114
do_syscall_64+96
entry_SYSCALL_64_after_hwframe+110
end 522
put 523
end 523
put 524
reserve 85
abort 0 0 0
dmu_tx_abort+5
zfs_putpage+647
zpl_putpage+86
zpl_putfolio+14
write_cache_pages+322
bpf_trampoline_6442532424+141
write_cache_pages+9
zpl_writepages+274
do_writepages+207
__writeback_single_inode+61
writeback_single_inode+175
write_inode_now+118
icp_init+45
write_inode_now+9
zfs_clone_range+1247
zpl_clone_file_range_impl+250
zpl_remap_file_range+93
do_clone_file_range+261
vfs_clone_file_range+62
ioctl_file_clone+73
do_vfs_ioctl+1752
__x64_sys_ioctl+114
do_syscall_64+96
entry_SYSCALL_64_after_hwframe+110
end 524
put 586
end 586
put 992
end 992
put 993
reserve 85
abort 0 0 0
dmu_tx_abort+5
zfs_putpage+647
zpl_putpage+86
zpl_putfolio+14
write_cache_pages+322
bpf_trampoline_6442532424+141
write_cache_pages+9
zpl_writepages+274
do_writepages+207
__writeback_single_inode+61
writeback_single_inode+175
write_inode_now+118
icp_init+45
write_inode_now+9
zfs_clone_range+1247
zpl_clone_file_range_impl+250
zpl_remap_file_range+93
do_clone_file_range+261
vfs_clone_file_range+62
ioctl_file_clone+73
do_vfs_ioctl+1752
__x64_sys_ioctl+114
do_syscall_64+96
entry_SYSCALL_64_after_hwframe+110
end 993
put 994
reserve 85
abort 0 0 0
dmu_tx_abort+5
zfs_putpage+647
zpl_putpage+86
zpl_putfolio+14
write_cache_pages+322
bpf_trampoline_6442532424+141
write_cache_pages+9
zpl_writepages+274
do_writepages+207
__writeback_single_inode+61
writeback_single_inode+175
write_inode_now+118
icp_init+45
write_inode_now+9
zfs_clone_range+1247
zpl_clone_file_range_impl+250
zpl_remap_file_range+93
do_clone_file_range+261
vfs_clone_file_range+62
ioctl_file_clone+73
do_vfs_ioctl+1752
__x64_sys_ioctl+114
do_syscall_64+96
entry_SYSCALL_64_after_hwframe+110
end 994
put 995
reserve 85
abort 0 0 0
dmu_tx_abort+5
zfs_putpage+647
zpl_putpage+86
zpl_putfolio+14
write_cache_pages+322
bpf_trampoline_6442532424+141
write_cache_pages+9
zpl_writepages+274
do_writepages+207
__writeback_single_inode+61
writeback_single_inode+175
write_inode_now+118
icp_init+45
write_inode_now+9
zfs_clone_range+1247
zpl_clone_file_range_impl+250
zpl_remap_file_range+93
do_clone_file_range+261
vfs_clone_file_range+62
ioctl_file_clone+73
do_vfs_ioctl+1752
__x64_sys_ioctl+114
do_syscall_64+96
entry_SYSCALL_64_after_hwframe+110
end 995
put 996
end 996
ret 0
filemap_fdatawait
done write_inode_now 0
My repro has new instrumentation to dump out blocks that don't match, and I'm running that with set -x
and piping thru to grep x.364
to filter messages for the focus object.
This outputs something like this immediately after the traces above:
+ write x.364
+ truncate --size=4M x.364
' x.364
++ cp --debug x.364 x.364.2
+ cpout=''\''x.364'\'' -> '\''x.364.2'\''
++ diff -q x.364 x.364.2
+ diff='Files x.364 and x.364.2 differ'
++ echo ''\''x.364'\'' -> '\''x.364.2'\''
+ echo 'Files x.364 and x.364.2 differ' '(copy offload: unknown, reflink: yes, sparse detection: unknown)'
Files x.364 and x.364.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
' x.364 x.364.2
x.364 block 993 differs
x.364 block 994 differs
x.364 block 995 differs
All of blocks 993, 994, and 995 failed putpage in the trace. Some others did too, but they were probably zeros that didn't change from the last write to the mapping in the repro loop.
The new instrumentation:
# write, copy, compare, and print an error on mismatch
write() {
truncate --size=4M "${f?}"
python3 -c "${tool}" "${f?}" &>/dev/null
cpout=$(cp --debug "${f?}" "${f?}.2" 2>&1)
diff=$(diff -q "${f?}" "${f?}.2")
if (($?)); then
echo "$diff" "($(echo "$cpout" | tail -1))"
python -c "
import sys
def blocks(f):
while True:
b = f.read(1<<12)
if b:
yield b
else:
break
with open(sys.argv[1], 'rb') as f1, open(sys.argv[2], 'rb') as f2:
for n, (b1, b2) in enumerate(zip(blocks(f1), blocks(f2))):
if b1 != b2:
print(sys.argv[1], 'block', n, 'differs')
" "${f?}" "${f?}.2"
fi
}
TL;DR With more tracing, I confirmed all this by comparing the mismatched blocks reported in the repro to those that failed to write during the second call to write_cache_pages
(the one which uses WB_SYNC_ALL
, which is the one that matters). Skipping the writeback due to tx pressure causes the data to be missed later.
Meanwhile, Linux will only retries failed writeback pages for wbc->sync_mode == WB_SYNC_ALL && err == -ENOMEM
which is implemented in do_writepages
. Other errors break out of the loop (though all pages are written once), and the error is returned to the caller. Not that this error path does anything as-written: ZFS doesn't return errors from zpl_writefolio
or zpl_writepage
, nor does ZFS check that write_inode_now
actually succeeds in either code path.
Note this is happening block-by-block with the first write_cache_pages
call writing most of the data and the second call writing out the rest. When both calls hit tx pressure the data goes unwritten and bclone fails. The general pattern I'm seeing in traces is a few individual blocks fail and then dmu_tx_wait
waits for pressure before attempting more. As a result, only individual blocks get skipped, and this probably explains why llseek is immune when sync=TRUE
. Every DMU record is >1 page, so the next page succeeds with high probability, some block in thus dirtied, fancy llseek will notice dirty data, and DTRT because the record is dirty even though the DMU is missing some writebacks.
A simple fix, I guess, is to plumb the error up and loop on flush until it succeeds if it returns ERESTART.
Edit: See trace_mmap.bt for my full bpftrace tooling.
rrevans@c7100fc32c35f90d1155d5d4eab3157ae240c856 is my attempt at a simple retry loop to fix writeback gaps.
I'm a few minutes into testing, and so far I am unable to trigger failures with any of:
zfs_bclone_enabled = 0, zfs_dmu_offset_next_sync = 1
zfs_bclone_enabled = 0, zfs_dmu_offset_next_sync = 0
zfs_bclone_enabled = 1, zfs_bclone_wait_dirty = 0, zfs_dmu_offset_next_sync = 1
zfs_bclone_enabled = 1, zfs_bclone_wait_dirty = 1, zfs_dmu_offset_next_sync = 1
Write throughput is lower due to the synchronous ZIL commits. zil_commit_count
from /proc/spl/kstat/zfs/test/objset-0x36
shows 5k ZIL commits per second with bclone and 2k files.
I would appreciate any feedback or attempts at reproducing with this patch (especially any glaring problems caused by synchronous flushing).
EDIT: I've run the repro loop and the golang emerge test script for 6 hours without reproducing any corruption. During this time I had a tuning script randomly setting the above module parameters to different values every 90 seconds. ZTS also passes running locally.
@rrevans Stupid question, will the fix still work with sync=disabled?
sync=disabled
ignores you requesting a sync, not syncing in general
@IvanVolosyuk Yes, and I've confirmed it works.
@rrevans - Kudos on your outstanding investigation. While I haven't been able to reproduce the corruption running your script overnight, it's worth noting that I got several instances of the following stack trace. Not sure if these are related to your patch though. I ran with zfs_bclone_enabled = 1, zfs_bclone_wait_dirty = 0, zfs_dmu_offset_next_sync = 1
:
[ 3648.422271] INFO: task sync:72175 blocked for more than 120 seconds.
[ 3648.422340] Tainted: P OE 6.5.0-26-generic #26-Ubuntu
[ 3648.422366] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3648.422390] task:sync state:D stack:0 pid:72175 ppid:62726 flags:0x00000002
[ 3648.422431] Call Trace:
[ 3648.422438] <TASK>
[ 3648.422448] __schedule+0x2cb/0x760
[ 3648.422575] ? __pfx_sync_inodes_one_sb+0x10/0x10
[ 3648.422589] schedule+0x63/0x110
[ 3648.422597] wb_wait_for_completion+0x89/0xc0
[ 3648.422608] ? __pfx_autoremove_wake_function+0x10/0x10
[ 3648.422620] sync_inodes_sb+0xd6/0x2c0
[ 3648.422632] ? __pfx_sync_inodes_one_sb+0x10/0x10
[ 3648.422638] sync_inodes_one_sb+0x1b/0x30
[ 3648.422644] iterate_supers+0x7a/0x100
[ 3648.422726] ksys_sync+0x42/0xb0
[ 3648.422736] __do_sys_sync+0xe/0x20
[ 3648.422743] do_syscall_64+0x59/0x90
[ 3648.422752] ? irqentry_exit_to_user_mode+0x17/0x20
[ 3648.422764] ? irqentry_exit+0x43/0x50
[ 3648.422775] ? exc_page_fault+0x94/0x1b0
[ 3648.422787] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 3648.422800] RIP: 0033:0x7d3e85325c8b
[ 3648.422840] RSP: 002b:00007ffd3ba8efb8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
[ 3648.422849] RAX: ffffffffffffffda RBX: 00007ffd3ba8f1a8 RCX: 00007d3e85325c8b
[ 3648.422853] RDX: 00007d3e85405a01 RSI: 00007ffd3ba90ee8 RDI: 00007d3e853c1b94
[ 3648.422958] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
[ 3648.422964] R10: 000062f1a1d5d19e R11: 0000000000000246 R12: 00007ffd3ba90681
[ 3648.422967] R13: 00007d3e853fe428 R14: 000062f1a1d5eb80 R15: 000062f1a1d5d034
[ 3648.422975] </TASK>
@ixhamza Thanks I appreciate the feedback
This message is the userland thread calling the sync system call taking a long time to finish. The kernel notices after a while and flags it.
It's somewhat expected as the repro script is spamming a lot of load, but I suspect that there may be a separate bug causing sync to livelock (block on ongoing writes vs. triggering a single pass of writeback work).
If you stop the script, does the sync process eventually terminate? It would be a large concern if the thread has become stuck forever. Note that lots and lots I/O may have to flush first - you can look at dirty and write back page counts in /proc/meminfo. These should be decreasing with writeback after interrupting the script.
Investigation of the sync call is on my to-do list. I've seen other usual behaviors like sync(3) blocking on background txg sync instead of triggering one, and I don't understand enough there yet to know what is going on.
@rrevans - I just ran the script again for about 12 minutes and noticed that only one sync process was able to spawn; it wasn't able to finish during the script's lifetime. After interrupting the script with Ctrl + C
, it took around 20 seconds for the script to terminate along with the sync process. Additionally, both Dirty and Writeback pages reset back to zero after the termination. Also, the Dirty and Writeback pages kept resetting to 0 during the script run as well.
rrevans/zfs@c7100fc is my attempt at a simple retry loop to fix writeback gaps.
@rrevans Addition of error handling looks right to me, unless kernel expects errors to be silently ignored and it supposed to reissue silently writes later by itself.
What makes me really wonder though, why do we even use TXG_NOWAIT with the dmu_tx_assign() in Linux zfs_putpage()? I see Illumos switched to TXG_WAIT 10 years ago: https://www.illumos.org/issues/4347 , FreeBSD done it 7 years ago: https://github.com/freebsd/freebsd-src/commit/16b46572fa997 , why Linux still uses TXG_NOWAIT? My first guess was that the context where it is called is not sleepable, but I see there:
<------>err = dmu_tx_assign(tx, TXG_NOWAIT);
<------>if (err != 0) {
<------><------>if (err == ERESTART)
<------><------><------>dmu_tx_wait(tx);
, where dmu_tx_wait() does sleep, even if though only once without retrial. It makes no sense to me. Could we switch it to TXG_WAIT similar to FreeBSD and call it a day?
@amotin I agree it makes no sense, and TXG_WAIT
is better. See rrevans@5270f1553e77127503fc0f092742aef5825bc18d. I am running the repro tests and it looks OK so far.
As for how did we get here, openzfs@3c0e5c0f45 removed the loop but left the wait behind. I don't know why TXG_WAIT was not used at that time since it is still sleeping.
@rrevans nice work running this down!
Could we switch it to TXG_WAIT similar to FreeBSD and call it a day?
For the most part I think so. It looks like when the original Illumos change was ported zfs_putpage()
was intentionally not updated to use TXG_WAIT
. The comment indicates the concern was this function is reachable via direct memory reclaim so we'd prefer not to get stuck waiting on a sync. It'd potentially be better to redirty the page, allow direct reclaim to free memory elsewhere, and let the VFS try again in a few moments latter to write out the page when conditions may have improved.
zfs_putpage() still uses TXG_NOWAIT, unlike the upstream version. This
case may be a contention point just like zfs_write(), however it is not
safe to block here since it may be called during memory reclaim.
But as @rrevans pointed out here's where things went wrong:
Meanwhile, Linux will only retries failed writeback pages for wbc->sync_mode == WB_SYNC_ALL && err == -ENOMEM which is implemented in do_writepages.
Over the years the kernel has significantly changed exactly how this page writeback was done and retries were handled. Unfortunately, no so much that it obviously broke things. When this change was originally ported from Illumos the Linux 3.10 kernel was hot off the presses.
I'd also suggest we switch to TXG_WAIT
to help keep things simple. Adding the missing error propagation also looks like a good idea. Looking at some very old kernels it seems they largely ignored the return value, but that's no longer the case.
TXG_WAIT
(same as my previous test patches).I have some ideas how to improve the writeback code to avoid synchronous writes, but they need more invasive refactoring of how writeback works.
@thulle if possible would you mind confirming your compiling go
reproducer is unable to reproduce this with 2.2.4 when zfs_bclone_enabled=1
. All of the reported BRT fixes have been included in 2.2.4 but it sure wouldn't hurt to get some more testing before we consider enabling it by default.
@thulle if possible would you mind confirming your compiling go reproducer is unable to reproduce this with 2.2.4 when zfs_bclone_enabled=1. All of the reported BRT fixes have been included in 2.2.4 but it sure wouldn't hurt to get some more testing before we consider enabling it by default.
Is that still open or done?
Yours lopiuh
System information
Describe the problem you're observing
After seeing 2.2.3 released and #15526 closed I enabled blockcloning again and tried my unfailing reproducer: compiling go. It still seems to trigger the/some bug.
Worth noting is that I'm running on top of LUKS. Scrub reports no issues.
Describe how to reproduce the problem