openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.46k stars 1.73k forks source link

some copied files are still corrupted (chunks replaced by zeros) #15933

Closed thulle closed 6 months ago

thulle commented 7 months ago

System information

Type Version/Name
Distribution Name Gentoo
Distribution Version (rolling)
Kernel Version 6.7.6-gentoo-x86_64
Architecture amd64
OpenZFS Version 2.2.3

Describe the problem you're observing

After seeing 2.2.3 released and #15526 closed I enabled blockcloning again and tried my unfailing reproducer: compiling go. It still seems to trigger the/some bug.

Worth noting is that I'm running on top of LUKS. Scrub reports no issues.

Describe how to reproduce the problem

# echo 1 > /sys/module/zfs/parameters/zfs_bclone_enabled 

# emerge --buildpkg=n --usepkg=n --quiet-build=y dev-lang/go 
>>> Completed (1 of 1) dev-lang/go-1.22.0::gentoo

# file /usr/lib/go/pkg/tool/linux_amd64/*|grep data$
/usr/lib/go/pkg/tool/linux_amd64/asm:       data
/usr/lib/go/pkg/tool/linux_amd64/cgo:       data
/usr/lib/go/pkg/tool/linux_amd64/compile:   data
/usr/lib/go/pkg/tool/linux_amd64/cover:     data
/usr/lib/go/pkg/tool/linux_amd64/link:      data
/usr/lib/go/pkg/tool/linux_amd64/vet:       data

# hexdump -C /usr/lib/go/pkg/tool/linux_amd64/asm
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000fa0  00 00 00 00 00 00 00 00  00 00 00 00 53 43 51 58  |............SCQX|
00000fb0  63 56 34 74 49 75 66 65  59 67 64 4a 32 54 54 4e  |cV4tIufeYgdJ2TTN|
00000fc0  2f 51 78 63 67 68 50 6b  43 33 67 48 31 54 66 64  |/QxcghPkC3gH1Tfd|
00000fd0  70 59 48 39 33 2f 43 36  32 31 50 69 70 58 45 71  |pYH93/C621PipXEq|
00000fe0  30 6f 5f 39 4c 57 65 71  49 4d 2f 48 32 52 64 56  |0o_9LWeqIM/H2RdV|
00000ff0  50 64 46 45 4c 77 6d 77  4a 37 42 31 51 33 47 00  |PdFELwmwJ7B1Q3G.|
00001000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
002346c0  53 43 51 58 63 56 34 74  49 75 66 65 59 67 64 4a  |SCQXcV4tIufeYgdJ|
002346d0  32 54 54 4e 2f 51 78 63  67 68 50 6b 43 33 67 48  |2TTN/QxcghPkC3gH|
002346e0  31 54 66 64 70 59 48 39  33 2f 43 36 32 31 50 69  |1TfdpYH93/C621Pi|
002346f0  70 58 45 71 30 6f 5f 39  4c 57 65 71 49 4d 2f 48  |pXEq0o_9LWeqIM/H|
00234700  32 52 64 56 50 64 46 45  4c 77 6d 77 4a 37 42 31  |2RdVPdFELwmwJ7B1|
00234710  51 33 47 00 00 00 00 00  00 00 00 00 00 00 00 00  |Q3G.............|
00234720  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
003a88d0  00 00 00 00 00 00 00 00  00 00 00 00 00           |.............|
003a88dd

# zfs -V
zfs-2.2.3-r0-gentoo
zfs-kmod-2.2.3-r0-gentoo
rincebrain commented 6 months ago

Well, falling back to generic_file_llseek still breaks in the BRT case, as @rrevans said above, but we appear to no longer break in the BRT-disabled case for that, suggesting at least two issues at hand, unless it's just so rare now that I've not hit it yet.

@rrevans what do you think about shorting out to that for the SEEK_DATA case, assuming Linux's llseek has correct visibility into state, and encouraging people to leave bclone disabled for now? Can you think of any cases that doesn't cover?

rincebrain commented 6 months ago

I would appreciate it if people could take the current state of #16000 for a spin - I haven't been able to reproduce it doing the wrong thing again on that branch, with any of the above repros, though I'm still running them to try and be more confident for now.

rrevans commented 6 months ago

@rincebrain As I currently understand, the problem for non-BRT is that sometimes mmap page dirtiness doesn't reach the DMU before dnode_is_dirty. IIRC generic_file_llseek implements "the whole file is data" based on the inode size which so far seems correct. Based on the fact ZFS has no special handling for SEEK_END it should work to disable hole seeks.

That said, that approach is a regression for sparse copies without mmap, and I'm not sure we need to break that case.

I'm thoroughly convinced that the original fix works for the original problem and its root cause, though it's still got theoretical data races which I think are unreachable in practice. I've been studying this space and all the code ever since in pursuit of an implementation that doesn't need to force txg syncs because I think ZFS shouldn't have limits in this regard (and I have working code that indeed does that and passes ZTS, ztest, and my own stress tests).

While #16000 is super conservative I'd vote for adding code that keeps llseek functional EXCEPT for files with mappings (dirty or not) which is an easy test to add. Maybe leave in the option to fully disable as cheap advice for users in case this analysis is wrong somehow...?

rincebrain commented 6 months ago

Given the history, I'm really in favor of "super conservative" at the moment. I'm not opposed to other improvements, obviously, but given that distros tend to ship releases and take forever to cherrypick fixes from the future if they won't update (if they do it at all), I'm really in favor of safety models that don't require assuming people run a newer version if we're wrong about it being safe.

Also, I have a similar document detailing the history, with commits, so it's interesting to read someone else's deep dive, thank you. A little curious why you prepared it, but.

rrevans commented 6 months ago

@Gendra13 can you please try with 20 or 200 files instead of 2000 on 2.1.5 / 5.15.0 without bclone but withzfs_dmu_offset_next_sync=0?

Occasionally when developing the repro I saw smaller numbers of files being more effective for some cases including this one if memory serves.

rincebrain commented 6 months ago

Breaks on 5.10.0-27 zfs_dmu_offset_next_sync=0 and bclone_enabled=0.

Files x.1109 and x.1109.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.1116 and x.1116.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.1839 and x.1839.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.1874 and x.1874.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.1615 and x.1615.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.1961 and x.1961.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)

e: Were you actually asking about 2.1.5, or no? Because bclone only went in on 2.2, so that setting wouldn't exist there...

rincebrain commented 6 months ago

Same applies to 2.1.5 stock with either dmu_offset_next_sync. 1:

Files x.1816 and x.1816.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)

0:

Files x.1958 and x.1958.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.639 and x.639.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.1261 and x.1261.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.1885 and x.1885.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Gendra13 commented 6 months ago

@Gendra13 can you please try with 20 or 200 files instead of 2000 on 2.1.5 / 5.15.0 without bclone but withzfs_dmu_offset_next_sync=0?

Occasionally when developing the repro I saw smaller numbers of files being more effective for some cases including this one if memory serves.

That was weird. When reducing the files down to 200 (or 20) I couldn't trigger the bug at all, even with zfs_dmu_offset_next_sync=1. I tested different numbers with zfs_dmu_offset_next_sync=1 and found 400 files to be the sweet spot where the script failed immediately.

Then I repeated the test with zfs_dmu_offset_next_sync=0 and indeed, it also failed but took a few minutes of running. ZFS-Version: zfs-kmod-2.1.15-1 Uname -r : 5.15.0-100-generic cp --version: cp (GNU coreutils) 9.4

Files x.346 and x.346.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.341 and x.341.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
rrevans commented 6 months ago

Well, falling back to generic_file_llseek still breaks in the BRT case, as @rrevans said above, but we appear to no longer break in the BRT-disabled case for that, suggesting at least two issues at hand, unless it's just so rare now that I've not hit it yet.

I don't think the problems at hand are possible without fancy llseek or block clone.

If we attempt a direct read, then the cached mapping pages are used as the source of truth of the file state instead of the dirty buffers in the DMU/ARC. This happens as long as cp is forced to actually read the file or have copy_file_range do it indirectly (which also DTRT - eventually the kernel calls splice_read which becomes zpl_iter_read which ultimately calls zfs_read; that in turn does a read of the pagecache for mapped ranges).

Thus, assuming those paths work, skipping fancy llseek and reflink is always going to be correct and avoid any dependence on writeback or flushing dirty pages to the DMU.

@rrevans what do you think about shorting out to that for the SEEK_DATA case, assuming Linux's llseek has correct visibility into state, and encouraging people to leave bclone disabled for now? Can you think of any cases that doesn't cover?

No, I can't think of any other cases. Fancy llseek getting it wrong is the proximate cause of all known problems pre-BRT. It's the same problem as #15526 but triggered by mmap instead of racing syncs. Edit: I'm also convinced Linux has the correct state already because the other seek flavors work, especially SEEK_END (see also O_APPEND maybe).

In short, while at least some older releases are affected, I'm convinced that triggering the writeback flavor of this bug for zfs_bclone_enabled=0 requires approximately all of: 1) recent cp tool that is sparse aware (plain reads do not trigger this) 2) a sparse file (heuristics will likely skip fancy llseek in dense files) 3) mmap used to make some changes to that file (plain writes do not trigger this) 4) copying the file soon after changing it (before normal writeback happens)

Note that for golang it seems to rewrite the outputs during build, so the files are briefly sparse during the build process but this is enough to create a window for (2): stat reports on-disk block usage from the prior moment the file was sparse if it happens to be flushed as part of a normal txg sync during the window where it is sparse. (I think this explains why golang repros are really hard pre-BRT: in addition to hitting a writeback race the system also needs to hit a txg flush in the small window that the output file is sparse.)

rrevans commented 6 months ago

Okay... kernel tracing to the rescue. TL;DR zfs_putpage fails on pushback from dsl_dir_tempreserve_space, and nothing in Linux retries the writeback after this happens (in part because the error is not propagated to the kernel).

I ran my repro with the zn_flush_cached_data(B_TRUE) patch in both zfs_holey_common and zfs_clone_range. Then I installed bpftrace probes focused on a specific DMU object to try and figure out why mmap flush is broken.

On flush, the kernel calls zpl_writepages which calls write_cache_pages twice and that in turn calls zfs_putpage for each dirty page. I added a trace for each page and found that some of the blocks that were mismatching were ones being written in the second write_cache_pages call.

After a bit more poking, I gathered stacks for end_page_writeback and noticed that the traces included stacks with write_cache_pages which shouldn't happen as the normal call happens during zil_commit callbacks:

@epw[
    bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
    bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
    bpf_trampoline_6442492082+71
    end_page_writeback+9
    zfs_putpage+698
    zpl_putpage+86
    zpl_putfolio+14
    write_cache_pages+322
    bpf_trampoline_6442532424+141
    write_cache_pages+9
    zpl_writepages+274
    do_writepages+207
    __writeback_single_inode+61
    writeback_single_inode+175
    write_inode_now+118
    zfs_clone_range+1247
    zpl_clone_file_range_impl+250
    zpl_remap_file_range+93
    do_clone_file_range+261
    vfs_clone_file_range+62
    ioctl_file_clone+73
    do_vfs_ioctl+1752
    __x64_sys_ioctl+114
    do_syscall_64+96
    entry_SYSCALL_64_after_hwframe+110
]: 4
@epw[
    bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
    bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
    bpf_trampoline_6442492082+71
    end_page_writeback+9
    zfs_putpage+698
    zpl_putpage+86
    zpl_putfolio+14
    write_cache_pages+322
    zpl_writepages+149
    do_writepages+207
    __writeback_single_inode+61
    writeback_single_inode+175
    write_inode_now+118
    zfs_clone_range+1247
    zpl_clone_file_range_impl+250
    zpl_remap_file_range+93
    do_clone_file_range+261
    vfs_clone_file_range+62
    ioctl_file_clone+73
    do_vfs_ioctl+1752
    __x64_sys_ioctl+114
    do_syscall_64+96
    entry_SYSCALL_64_after_hwframe+110
]: 5
@epw[
    bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
    bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
    bpf_trampoline_6442492082+71
    end_page_writeback+9
    zfs_putpage+698
    zpl_putpage+86
    zpl_putfolio+14
    write_cache_pages+322
    bpf_trampoline_6442532424+141
    write_cache_pages+9
    zpl_writepages+274
    do_writepages+207
    __writeback_single_inode+61
    writeback_single_inode+175
    write_inode_now+118
    icp_init+45
    write_inode_now+9
    zfs_clone_range+1247
    zpl_clone_file_range_impl+250
    zpl_remap_file_range+93
    do_clone_file_range+261
    vfs_clone_file_range+62
    ioctl_file_clone+73
    do_vfs_ioctl+1752
    __x64_sys_ioctl+114
    do_syscall_64+96
    entry_SYSCALL_64_after_hwframe+110
]: 13
@epw[
    bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
    bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
    bpf_trampoline_6442492082+71
    end_page_writeback+9
    zfs_putpage+698
    zpl_putpage+86
    zpl_putfolio+14
    write_cache_pages+322
    bpf_trampoline_6442532424+141
    write_cache_pages+9
    zpl_writepages+149
    do_writepages+207
    __writeback_single_inode+61
    writeback_single_inode+175
    write_inode_now+118
    icp_init+45
    write_inode_now+9
    zfs_clone_range+1247
    zpl_clone_file_range_impl+250
    zpl_remap_file_range+93
    do_clone_file_range+261
    vfs_clone_file_range+62
    ioctl_file_clone+73
    do_vfs_ioctl+1752
    __x64_sys_ioctl+114
    do_syscall_64+96
    entry_SYSCALL_64_after_hwframe+110
]: 22
@epw[
    bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
    bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
    bpf_trampoline_6442492082+71
    end_page_writeback+9
    zil_itx_destroy+174
    zil_lwb_flush_vdevs_done+212
    zio_done+975
    zio_execute+243
    taskq_thread+510
    kthread+232
    ret_from_fork+52
    ret_from_fork_asm+27
]: 992
@epw[
    bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
    bpf_prog_2bb67ce8f65af3d2_kfunc_vmlinux_end_page_writeback+301
    bpf_trampoline_6442492082+71
    end_page_writeback+9
    zil_itx_destroy+174
    zil_itxg_clean+149
    taskq_thread+510
    kthread+232
    ret_from_fork+52
    ret_from_fork_asm+27
]: 2660

generated from

kfunc:end_page_writeback
{
    if (args.page->mapping->host->i_ino == 3337) {
        printf("end %d\n", args.page->index);
        @epw[kstack()] = count();
    }
}

invoked as (from ZFS src root):

# bpftrace --include zfs_config.h -I include/ -I include/spl -I include/os/linux/spl -I include/os/linux/zfs -I include/os/linux/kernel trace_mmap.bt

The only place that zfs_putpage ends writeback early is when it hands the page back due to dmu_tx_assign failing.

Sure enough, if I trace the dmu_tx_abort call in that branch of zfs_putpage it is indeed getting hit:

kprobe:zfs_putpage
{
        $ip = (struct inode*)arg0;
        if ($ip->i_ino == 3337) {
                @put[tid] = 1;
        }
}
kretprobe:zfs_putpage /@put[tid] != 0/
{
        delete(@put[tid]);
}
kprobe:dmu_tx_abort /@put[tid] != 0/
{
        $tx = (struct dmu_tx*)arg0;
        printf("abort %d %d %d\n", $tx->tx_err, $tx->tx_needassign_txh, $tx->tx_wait_dirty);
        print(kstack());
}

And a bit more poking I find that dsl_dir_tempreserve_space is returning ERESTART which is checked at the bottom of dmu_tx_assign. Tracing also confirms zfs_putpage is returning ERESTART.

kretprobe:dsl_dir_tempreserve_space /@put[tid] != 0 && retval != 0/
{
        printf("reserve %d\n", retval);
}

The final failing sequence prints something like:

writepages 3337 1

        bpf_prog_6097e4b9567b95c5_kfunc_vmlinux_write_cache_pages+405
        bpf_prog_6097e4b9567b95c5_kfunc_vmlinux_write_cache_pages+405
        bpf_trampoline_6442532424+98
        write_cache_pages+9
        zpl_writepages+274
        do_writepages+207
        __writeback_single_inode+61
        writeback_single_inode+175
        write_inode_now+118
        icp_init+45
        write_inode_now+9
        zfs_clone_range+1247
        zpl_clone_file_range_impl+250
        zpl_remap_file_range+93
        do_clone_file_range+261
        vfs_clone_file_range+62
        ioctl_file_clone+73
        do_vfs_ioctl+1752
        __x64_sys_ioctl+114
        do_syscall_64+96
        entry_SYSCALL_64_after_hwframe+110

put 112
end 112
put 260
end 260
put 522
reserve 85
abort 0 0 0

        dmu_tx_abort+5
        zfs_putpage+647
        zpl_putpage+86
        zpl_putfolio+14
        write_cache_pages+322
        bpf_trampoline_6442532424+141
        write_cache_pages+9
        zpl_writepages+274
        do_writepages+207
        __writeback_single_inode+61
        writeback_single_inode+175
        write_inode_now+118
        icp_init+45
        write_inode_now+9
        zfs_clone_range+1247
        zpl_clone_file_range_impl+250
        zpl_remap_file_range+93
        do_clone_file_range+261
        vfs_clone_file_range+62
        ioctl_file_clone+73
        do_vfs_ioctl+1752
        __x64_sys_ioctl+114
        do_syscall_64+96
        entry_SYSCALL_64_after_hwframe+110

end 522
put 523
end 523
put 524
reserve 85
abort 0 0 0

        dmu_tx_abort+5
        zfs_putpage+647
        zpl_putpage+86
        zpl_putfolio+14
        write_cache_pages+322
        bpf_trampoline_6442532424+141
        write_cache_pages+9
        zpl_writepages+274
        do_writepages+207
        __writeback_single_inode+61
        writeback_single_inode+175
        write_inode_now+118
        icp_init+45
        write_inode_now+9
        zfs_clone_range+1247
        zpl_clone_file_range_impl+250
        zpl_remap_file_range+93
        do_clone_file_range+261
        vfs_clone_file_range+62
        ioctl_file_clone+73
        do_vfs_ioctl+1752
        __x64_sys_ioctl+114
        do_syscall_64+96
        entry_SYSCALL_64_after_hwframe+110

end 524
put 586
end 586
put 992
end 992
put 993
reserve 85
abort 0 0 0

        dmu_tx_abort+5
        zfs_putpage+647
        zpl_putpage+86
        zpl_putfolio+14
        write_cache_pages+322
        bpf_trampoline_6442532424+141
        write_cache_pages+9
        zpl_writepages+274
        do_writepages+207
        __writeback_single_inode+61
        writeback_single_inode+175
        write_inode_now+118
        icp_init+45
        write_inode_now+9
        zfs_clone_range+1247
        zpl_clone_file_range_impl+250
        zpl_remap_file_range+93
        do_clone_file_range+261
        vfs_clone_file_range+62
        ioctl_file_clone+73
        do_vfs_ioctl+1752
        __x64_sys_ioctl+114
        do_syscall_64+96
        entry_SYSCALL_64_after_hwframe+110

end 993
put 994
reserve 85
abort 0 0 0

        dmu_tx_abort+5
        zfs_putpage+647
        zpl_putpage+86
        zpl_putfolio+14
        write_cache_pages+322
        bpf_trampoline_6442532424+141
        write_cache_pages+9
        zpl_writepages+274
        do_writepages+207
        __writeback_single_inode+61
        writeback_single_inode+175
        write_inode_now+118
        icp_init+45
        write_inode_now+9
        zfs_clone_range+1247
        zpl_clone_file_range_impl+250
        zpl_remap_file_range+93
        do_clone_file_range+261
        vfs_clone_file_range+62
        ioctl_file_clone+73
        do_vfs_ioctl+1752
        __x64_sys_ioctl+114
        do_syscall_64+96
        entry_SYSCALL_64_after_hwframe+110

end 994
put 995
reserve 85
abort 0 0 0

        dmu_tx_abort+5
        zfs_putpage+647
        zpl_putpage+86
        zpl_putfolio+14
        write_cache_pages+322
        bpf_trampoline_6442532424+141
        write_cache_pages+9
        zpl_writepages+274
        do_writepages+207
        __writeback_single_inode+61
        writeback_single_inode+175
        write_inode_now+118
        icp_init+45
        write_inode_now+9
        zfs_clone_range+1247
        zpl_clone_file_range_impl+250
        zpl_remap_file_range+93
        do_clone_file_range+261
        vfs_clone_file_range+62
        ioctl_file_clone+73
        do_vfs_ioctl+1752
        __x64_sys_ioctl+114
        do_syscall_64+96
        entry_SYSCALL_64_after_hwframe+110

end 995
put 996
end 996
ret 0
filemap_fdatawait
done write_inode_now 0

My repro has new instrumentation to dump out blocks that don't match, and I'm running that with set -x and piping thru to grep x.364 to filter messages for the focus object.

This outputs something like this immediately after the traces above:

+ write x.364
+ truncate --size=4M x.364
' x.364
++ cp --debug x.364 x.364.2
+ cpout=''\''x.364'\'' -> '\''x.364.2'\''
++ diff -q x.364 x.364.2
+ diff='Files x.364 and x.364.2 differ'
++ echo ''\''x.364'\'' -> '\''x.364.2'\''
+ echo 'Files x.364 and x.364.2 differ' '(copy offload: unknown, reflink: yes, sparse detection: unknown)'
Files x.364 and x.364.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
' x.364 x.364.2
x.364 block 993 differs
x.364 block 994 differs
x.364 block 995 differs

All of blocks 993, 994, and 995 failed putpage in the trace. Some others did too, but they were probably zeros that didn't change from the last write to the mapping in the repro loop.

The new instrumentation:

# write, copy, compare, and print an error on mismatch
write() {
    truncate --size=4M "${f?}"
    python3 -c "${tool}" "${f?}" &>/dev/null
    cpout=$(cp --debug "${f?}" "${f?}.2" 2>&1)
    diff=$(diff -q "${f?}" "${f?}.2")
    if (($?)); then
        echo "$diff" "($(echo "$cpout" | tail -1))"
        python -c "
import sys
def blocks(f):
    while True:
        b = f.read(1<<12)
        if b:
            yield b
        else:
            break
with open(sys.argv[1], 'rb') as f1, open(sys.argv[2], 'rb') as f2:
    for n, (b1, b2) in enumerate(zip(blocks(f1), blocks(f2))):
        if b1 != b2:
            print(sys.argv[1], 'block', n, 'differs')
" "${f?}" "${f?}.2"
    fi
}

TL;DR With more tracing, I confirmed all this by comparing the mismatched blocks reported in the repro to those that failed to write during the second call to write_cache_pages (the one which uses WB_SYNC_ALL, which is the one that matters). Skipping the writeback due to tx pressure causes the data to be missed later.

Meanwhile, Linux will only retries failed writeback pages for wbc->sync_mode == WB_SYNC_ALL && err == -ENOMEM which is implemented in do_writepages. Other errors break out of the loop (though all pages are written once), and the error is returned to the caller. Not that this error path does anything as-written: ZFS doesn't return errors from zpl_writefolio or zpl_writepage, nor does ZFS check that write_inode_now actually succeeds in either code path.

Note this is happening block-by-block with the first write_cache_pages call writing most of the data and the second call writing out the rest. When both calls hit tx pressure the data goes unwritten and bclone fails. The general pattern I'm seeing in traces is a few individual blocks fail and then dmu_tx_wait waits for pressure before attempting more. As a result, only individual blocks get skipped, and this probably explains why llseek is immune when sync=TRUE. Every DMU record is >1 page, so the next page succeeds with high probability, some block in thus dirtied, fancy llseek will notice dirty data, and DTRT because the record is dirty even though the DMU is missing some writebacks.

A simple fix, I guess, is to plumb the error up and loop on flush until it succeeds if it returns ERESTART.

Edit: See trace_mmap.bt for my full bpftrace tooling.

rrevans commented 6 months ago

rrevans@c7100fc32c35f90d1155d5d4eab3157ae240c856 is my attempt at a simple retry loop to fix writeback gaps.

I'm a few minutes into testing, and so far I am unable to trigger failures with any of:

Write throughput is lower due to the synchronous ZIL commits. zil_commit_count from /proc/spl/kstat/zfs/test/objset-0x36 shows 5k ZIL commits per second with bclone and 2k files.

I would appreciate any feedback or attempts at reproducing with this patch (especially any glaring problems caused by synchronous flushing).

EDIT: I've run the repro loop and the golang emerge test script for 6 hours without reproducing any corruption. During this time I had a tuning script randomly setting the above module parameters to different values every 90 seconds. ZTS also passes running locally.

IvanVolosyuk commented 6 months ago

@rrevans Stupid question, will the fix still work with sync=disabled?

rincebrain commented 6 months ago

sync=disabled ignores you requesting a sync, not syncing in general

rrevans commented 6 months ago

@IvanVolosyuk Yes, and I've confirmed it works.

ixhamza commented 6 months ago

@rrevans - Kudos on your outstanding investigation. While I haven't been able to reproduce the corruption running your script overnight, it's worth noting that I got several instances of the following stack trace. Not sure if these are related to your patch though. I ran with zfs_bclone_enabled = 1, zfs_bclone_wait_dirty = 0, zfs_dmu_offset_next_sync = 1:

[ 3648.422271] INFO: task sync:72175 blocked for more than 120 seconds.
[ 3648.422340]       Tainted: P           OE      6.5.0-26-generic #26-Ubuntu
[ 3648.422366] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3648.422390] task:sync            state:D stack:0     pid:72175 ppid:62726  flags:0x00000002
[ 3648.422431] Call Trace:
[ 3648.422438]  <TASK>
[ 3648.422448]  __schedule+0x2cb/0x760
[ 3648.422575]  ? __pfx_sync_inodes_one_sb+0x10/0x10
[ 3648.422589]  schedule+0x63/0x110
[ 3648.422597]  wb_wait_for_completion+0x89/0xc0
[ 3648.422608]  ? __pfx_autoremove_wake_function+0x10/0x10
[ 3648.422620]  sync_inodes_sb+0xd6/0x2c0
[ 3648.422632]  ? __pfx_sync_inodes_one_sb+0x10/0x10
[ 3648.422638]  sync_inodes_one_sb+0x1b/0x30
[ 3648.422644]  iterate_supers+0x7a/0x100
[ 3648.422726]  ksys_sync+0x42/0xb0
[ 3648.422736]  __do_sys_sync+0xe/0x20
[ 3648.422743]  do_syscall_64+0x59/0x90
[ 3648.422752]  ? irqentry_exit_to_user_mode+0x17/0x20
[ 3648.422764]  ? irqentry_exit+0x43/0x50
[ 3648.422775]  ? exc_page_fault+0x94/0x1b0
[ 3648.422787]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 3648.422800] RIP: 0033:0x7d3e85325c8b
[ 3648.422840] RSP: 002b:00007ffd3ba8efb8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
[ 3648.422849] RAX: ffffffffffffffda RBX: 00007ffd3ba8f1a8 RCX: 00007d3e85325c8b
[ 3648.422853] RDX: 00007d3e85405a01 RSI: 00007ffd3ba90ee8 RDI: 00007d3e853c1b94
[ 3648.422958] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
[ 3648.422964] R10: 000062f1a1d5d19e R11: 0000000000000246 R12: 00007ffd3ba90681
[ 3648.422967] R13: 00007d3e853fe428 R14: 000062f1a1d5eb80 R15: 000062f1a1d5d034
[ 3648.422975]  </TASK>
rrevans commented 6 months ago

@ixhamza Thanks I appreciate the feedback

This message is the userland thread calling the sync system call taking a long time to finish. The kernel notices after a while and flags it.

It's somewhat expected as the repro script is spamming a lot of load, but I suspect that there may be a separate bug causing sync to livelock (block on ongoing writes vs. triggering a single pass of writeback work).

If you stop the script, does the sync process eventually terminate? It would be a large concern if the thread has become stuck forever. Note that lots and lots I/O may have to flush first - you can look at dirty and write back page counts in /proc/meminfo. These should be decreasing with writeback after interrupting the script.

Investigation of the sync call is on my to-do list. I've seen other usual behaviors like sync(3) blocking on background txg sync instead of triggering one, and I don't understand enough there yet to know what is going on.

ixhamza commented 6 months ago

@rrevans - I just ran the script again for about 12 minutes and noticed that only one sync process was able to spawn; it wasn't able to finish during the script's lifetime. After interrupting the script with Ctrl + C, it took around 20 seconds for the script to terminate along with the sync process. Additionally, both Dirty and Writeback pages reset back to zero after the termination. Also, the Dirty and Writeback pages kept resetting to 0 during the script run as well.

amotin commented 6 months ago

rrevans/zfs@c7100fc is my attempt at a simple retry loop to fix writeback gaps.

@rrevans Addition of error handling looks right to me, unless kernel expects errors to be silently ignored and it supposed to reissue silently writes later by itself.

What makes me really wonder though, why do we even use TXG_NOWAIT with the dmu_tx_assign() in Linux zfs_putpage()? I see Illumos switched to TXG_WAIT 10 years ago: https://www.illumos.org/issues/4347 , FreeBSD done it 7 years ago: https://github.com/freebsd/freebsd-src/commit/16b46572fa997 , why Linux still uses TXG_NOWAIT? My first guess was that the context where it is called is not sleepable, but I see there:

<------>err = dmu_tx_assign(tx, TXG_NOWAIT);
<------>if (err != 0) {
<------><------>if (err == ERESTART)
<------><------><------>dmu_tx_wait(tx);

, where dmu_tx_wait() does sleep, even if though only once without retrial. It makes no sense to me. Could we switch it to TXG_WAIT similar to FreeBSD and call it a day?

rrevans commented 6 months ago

@amotin I agree it makes no sense, and TXG_WAIT is better. See rrevans@5270f1553e77127503fc0f092742aef5825bc18d. I am running the repro tests and it looks OK so far.

As for how did we get here, openzfs@3c0e5c0f45 removed the loop but left the wait behind. I don't know why TXG_WAIT was not used at that time since it is still sleeping.

behlendorf commented 6 months ago

@rrevans nice work running this down!

Could we switch it to TXG_WAIT similar to FreeBSD and call it a day?

For the most part I think so. It looks like when the original Illumos change was ported zfs_putpage() was intentionally not updated to use TXG_WAIT. The comment indicates the concern was this function is reachable via direct memory reclaim so we'd prefer not to get stuck waiting on a sync. It'd potentially be better to redirty the page, allow direct reclaim to free memory elsewhere, and let the VFS try again in a few moments latter to write out the page when conditions may have improved.

    zfs_putpage() still uses TXG_NOWAIT, unlike the upstream version.  This
    case may be a contention point just like zfs_write(), however it is not
    safe to block here since it may be called during memory reclaim.

But as @rrevans pointed out here's where things went wrong:

Meanwhile, Linux will only retries failed writeback pages for wbc->sync_mode == WB_SYNC_ALL && err == -ENOMEM which is implemented in do_writepages.

Over the years the kernel has significantly changed exactly how this page writeback was done and retries were handled. Unfortunately, no so much that it obviously broke things. When this change was originally ported from Illumos the Linux 3.10 kernel was hot off the presses.

I'd also suggest we switch to TXG_WAIT to help keep things simple. Adding the missing error propagation also looks like a good idea. Looking at some very old kernels it seems they largely ignored the return value, but that's no longer the case.

rrevans commented 6 months ago

16019 fixes this with synchronous writeback and TXG_WAIT (same as my previous test patches).

I have some ideas how to improve the writeback code to avoid synchronous writes, but they need more invasive refactoring of how writeback works.

behlendorf commented 4 months ago

@thulle if possible would you mind confirming your compiling go reproducer is unable to reproduce this with 2.2.4 when zfs_bclone_enabled=1. All of the reported BRT fixes have been included in 2.2.4 but it sure wouldn't hurt to get some more testing before we consider enabling it by default.

lopiuh commented 2 months ago

@thulle if possible would you mind confirming your compiling go reproducer is unable to reproduce this with 2.2.4 when zfs_bclone_enabled=1. All of the reported BRT fixes have been included in 2.2.4 but it sure wouldn't hurt to get some more testing before we consider enabling it by default.

Is that still open or done?

Yours lopiuh