Closed thulle closed 6 months ago
cc @robn
\
# echo 1 > /sys/module/zfs/parameters/zfs_bclone_wait_dirty
Started the script again, but it stalled and something seemed stuck writing at ~10MB/s to the nvme. Killed the script and saw this in syslog:
Feb 26 15:35:27 soma kernel: INFO: task kworker/u65:1:22874 blocked for more than 122 seconds.
Feb 26 15:35:27 soma kernel: Tainted: P W OE T 6.7.6-gentoo-x86_64 #2
Feb 26 15:35:27 soma kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 26 15:35:27 soma kernel: task:kworker/u65:1 state:D stack:0 pid:22874 tgid:22874 ppid:2 flags:0x00004000
Feb 26 15:35:27 soma kernel: Workqueue: writeback wb_workfn (flush-zfs-5)
Feb 26 15:35:27 soma kernel: Call Trace:
Feb 26 15:35:27 soma kernel: <TASK>
Feb 26 15:35:27 soma kernel: __schedule+0x277/0x680
Feb 26 15:35:27 soma kernel: schedule+0x31/0xe0
Feb 26 15:35:27 soma kernel: __cv_broadcast+0x154/0x190 [spl]
Feb 26 15:35:27 soma kernel: ? __pfx_autoremove_wake_function+0x10/0x10
Feb 26 15:35:27 soma kernel: zfs_rangelock_exit+0x815/0x930 [zfs]
Feb 26 15:35:27 soma kernel: zfs_putpage+0x128/0x7f0 [zfs]
Feb 26 15:35:27 soma kernel: zio_do_crypt_abd+0x21fd/0x38f0 [zfs]
Feb 26 15:35:27 soma kernel: write_cache_pages+0x13e/0x3b0
Feb 26 15:35:27 soma kernel: ? zio_do_crypt_abd+0x21c0/0x38f0 [zfs]
Feb 26 15:35:27 soma kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 15:35:27 soma kernel: zio_do_crypt_abd+0x1bb1/0x38f0 [zfs]
Feb 26 15:35:27 soma kernel: do_writepages+0xce/0x1f0
Feb 26 15:35:27 soma kernel: __writeback_single_inode+0x3d/0x290
Feb 26 15:35:27 soma kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 15:35:27 soma kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 15:35:27 soma kernel: writeback_sb_inodes+0x206/0x4e0
Feb 26 15:35:27 soma kernel: __writeback_inodes_wb+0x4c/0xf0
Feb 26 15:35:27 soma kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 15:35:27 soma kernel: wb_writeback+0x29e/0x310
Feb 26 15:35:27 soma kernel: wb_do_writeback+0x238/0x2c0
Feb 26 15:35:27 soma kernel: wb_workfn+0x5b/0x270
Feb 26 15:35:27 soma kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 15:35:27 soma kernel: ? queue_delayed_work_on+0x8d/0xa0
Feb 26 15:35:27 soma kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 15:35:27 soma kernel: process_one_work+0x173/0x340
Feb 26 15:35:27 soma kernel: worker_thread+0x28b/0x3b0
Feb 26 15:35:27 soma kernel: ? preempt_count_add+0x7a/0xb0
Feb 26 15:35:27 soma kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 15:35:27 soma kernel: ? __pfx_worker_thread+0x10/0x10
Feb 26 15:35:27 soma kernel: kthread+0xf6/0x130
Feb 26 15:35:27 soma kernel: ? __pfx_kthread+0x10/0x10
Feb 26 15:35:27 soma kernel: ret_from_fork+0x30/0x50
Feb 26 15:35:27 soma kernel: ? __pfx_kthread+0x10/0x10
Feb 26 15:35:27 soma kernel: ret_from_fork_asm+0x1b/0x30
Feb 26 15:35:27 soma kernel: </TASK>
checking the process list I suspect this straggler is the culprit, only remaining child of the ebuild python process:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 9953 33.3 0.0 10136 2008 ? RN 15:31 12:42 cp -R api bin doc lib pkg misc src test /var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go
edit: As soon as I disabled zfs_bclone_wait_dirty the write load disappeared and the cp-process died.
Rebooted and tried the same thing again, but not killing the script this time, instead just disabling zfs_bclone_wait_dirty. Waited for the trace first:
Feb 26 16:37:51 soma kernel: INFO: task kworker/u66:2:2121 blocked for more than 122 seconds.
Feb 26 16:37:51 soma kernel: Tainted: P W OE T 6.7.6-gentoo-x86_64 #2
Feb 26 16:37:51 soma kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 26 16:37:51 soma kernel: task:kworker/u66:2 state:D stack:0 pid:2121 tgid:2121 ppid:2 flags:0x00004000
Feb 26 16:37:51 soma kernel: Workqueue: writeback wb_workfn (flush-zfs-5)
Feb 26 16:37:51 soma kernel: Call Trace:
Feb 26 16:37:51 soma kernel: <TASK>
Feb 26 16:37:51 soma kernel: __schedule+0x277/0x680
Feb 26 16:37:51 soma kernel: schedule+0x31/0xe0
Feb 26 16:37:51 soma kernel: __cv_broadcast+0x154/0x190 [spl]
Feb 26 16:37:51 soma kernel: ? __pfx_autoremove_wake_function+0x10/0x10
Feb 26 16:37:51 soma kernel: zfs_rangelock_exit+0x815/0x930 [zfs]
Feb 26 16:37:51 soma kernel: zfs_putpage+0x128/0x7f0 [zfs]
Feb 26 16:37:51 soma kernel: zio_do_crypt_abd+0x21fd/0x38f0 [zfs]
Feb 26 16:37:51 soma kernel: write_cache_pages+0x13e/0x3b0
Feb 26 16:37:51 soma kernel: ? zio_do_crypt_abd+0x21c0/0x38f0 [zfs]
Feb 26 16:37:51 soma kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 16:37:51 soma kernel: zio_do_crypt_abd+0x1bb1/0x38f0 [zfs]
Feb 26 16:37:51 soma kernel: do_writepages+0xce/0x1f0
Feb 26 16:37:51 soma kernel: __writeback_single_inode+0x3d/0x290
Feb 26 16:37:51 soma kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 16:37:51 soma kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 16:37:51 soma kernel: writeback_sb_inodes+0x206/0x4e0
Feb 26 16:37:51 soma kernel: __writeback_inodes_wb+0x4c/0xf0
Feb 26 16:37:51 soma kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 16:37:51 soma kernel: wb_writeback+0x29e/0x310
Feb 26 16:37:51 soma kernel: wb_do_writeback+0x238/0x2c0
Feb 26 16:37:51 soma kernel: wb_workfn+0x5b/0x270
Feb 26 16:37:51 soma kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 16:37:51 soma kernel: ? srso_alias_return_thunk+0x5/0xfbef5
Feb 26 16:37:51 soma kernel: ? try_to_wake_up+0x9d/0x690
Feb 26 16:37:51 soma kernel: process_one_work+0x173/0x340
Feb 26 16:37:51 soma kernel: worker_thread+0x28b/0x3b0
Feb 26 16:37:51 soma kernel: ? __pfx_worker_thread+0x10/0x10
Feb 26 16:37:51 soma kernel: kthread+0xf6/0x130
Feb 26 16:37:51 soma kernel: ? __pfx_kthread+0x10/0x10
Feb 26 16:37:51 soma kernel: ret_from_fork+0x30/0x50
Feb 26 16:37:51 soma kernel: ? __pfx_kthread+0x10/0x10
Feb 26 16:37:51 soma kernel: ret_from_fork_asm+0x1b/0x30
Feb 26 16:37:51 soma kernel: </TASK>
Not killing the script and disabling zfs_bclone_wait_dirty it completes instead, no corruption detected. I let it loop like this another 6 times, surprisingly without issue, until I realized my CPU was limited to 3GHz, default limit if a script doesn't detect that any listed processes are running. Listed are processes needing more speed, like portage or similar. Uncapping CPU speed it fails in the next loop:
Loop 8 - Corrupted files in /var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64:
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/cover: data
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/vet: data
Two theories:
\
Stabilizing the core speeds a bit:
# cpupower frequency-set -g performance -d 4000000 -u 5200000
Loop 3 - Corrupted files in /var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64:
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/compile: data
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/cover: data
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/vet: data
Letting it vary more:
# cpupower frequency-set -g schedutil -d 550000 -u 5200000
Loop 2 - Corrupted files in /var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64:
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/compile: data
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/cover: data
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/vet: data
\
NAME PROPERTY VALUE SOURCE
kc3000/EPHEMERAL/usrtmp type filesystem -
kc3000/EPHEMERAL/usrtmp creation fre nov 24 16:51 2023 -
kc3000/EPHEMERAL/usrtmp used 7.31G -
kc3000/EPHEMERAL/usrtmp available 449G -
kc3000/EPHEMERAL/usrtmp referenced 7.31G -
kc3000/EPHEMERAL/usrtmp compressratio 2.40x -
kc3000/EPHEMERAL/usrtmp mounted yes -
kc3000/EPHEMERAL/usrtmp quota none default
kc3000/EPHEMERAL/usrtmp reservation none default
kc3000/EPHEMERAL/usrtmp recordsize 128K default
kc3000/EPHEMERAL/usrtmp mountpoint /usr/tmp local
kc3000/EPHEMERAL/usrtmp sharenfs off default
kc3000/EPHEMERAL/usrtmp checksum blake3 inherited from kc3000
kc3000/EPHEMERAL/usrtmp compression zstd-3 inherited from kc3000
kc3000/EPHEMERAL/usrtmp atime on default
kc3000/EPHEMERAL/usrtmp devices on default
kc3000/EPHEMERAL/usrtmp exec on default
kc3000/EPHEMERAL/usrtmp setuid on default
kc3000/EPHEMERAL/usrtmp readonly off default
kc3000/EPHEMERAL/usrtmp zoned off default
kc3000/EPHEMERAL/usrtmp snapdir hidden default
kc3000/EPHEMERAL/usrtmp aclmode discard default
kc3000/EPHEMERAL/usrtmp aclinherit restricted default
kc3000/EPHEMERAL/usrtmp createtxg 15 -
kc3000/EPHEMERAL/usrtmp canmount on default
kc3000/EPHEMERAL/usrtmp xattr sa inherited from kc3000
kc3000/EPHEMERAL/usrtmp copies 1 default
kc3000/EPHEMERAL/usrtmp version 5 -
kc3000/EPHEMERAL/usrtmp utf8only on -
kc3000/EPHEMERAL/usrtmp normalization formD -
kc3000/EPHEMERAL/usrtmp casesensitivity sensitive -
kc3000/EPHEMERAL/usrtmp vscan off default
kc3000/EPHEMERAL/usrtmp nbmand off default
kc3000/EPHEMERAL/usrtmp sharesmb off default
kc3000/EPHEMERAL/usrtmp refquota none default
kc3000/EPHEMERAL/usrtmp refreservation none default
kc3000/EPHEMERAL/usrtmp guid 10378968521923463382 -
kc3000/EPHEMERAL/usrtmp primarycache all default
kc3000/EPHEMERAL/usrtmp secondarycache all default
kc3000/EPHEMERAL/usrtmp usedbysnapshots 0B -
kc3000/EPHEMERAL/usrtmp usedbydataset 7.31G -
kc3000/EPHEMERAL/usrtmp usedbychildren 0B -
kc3000/EPHEMERAL/usrtmp usedbyrefreservation 0B -
kc3000/EPHEMERAL/usrtmp logbias latency default
kc3000/EPHEMERAL/usrtmp objsetid 398 -
kc3000/EPHEMERAL/usrtmp dedup off default
kc3000/EPHEMERAL/usrtmp mlslabel none default
kc3000/EPHEMERAL/usrtmp sync standard default
kc3000/EPHEMERAL/usrtmp dnodesize legacy default
kc3000/EPHEMERAL/usrtmp refcompressratio 2.40x -
kc3000/EPHEMERAL/usrtmp written 7.31G -
kc3000/EPHEMERAL/usrtmp logicalused 13.9G -
kc3000/EPHEMERAL/usrtmp logicalreferenced 13.9G -
kc3000/EPHEMERAL/usrtmp volmode default default
kc3000/EPHEMERAL/usrtmp filesystem_limit none default
kc3000/EPHEMERAL/usrtmp snapshot_limit none default
kc3000/EPHEMERAL/usrtmp filesystem_count none default
kc3000/EPHEMERAL/usrtmp snapshot_count none default
kc3000/EPHEMERAL/usrtmp snapdev hidden default
kc3000/EPHEMERAL/usrtmp acltype posix inherited from kc3000
kc3000/EPHEMERAL/usrtmp context none default
kc3000/EPHEMERAL/usrtmp fscontext none default
kc3000/EPHEMERAL/usrtmp defcontext none default
kc3000/EPHEMERAL/usrtmp rootcontext none default
kc3000/EPHEMERAL/usrtmp relatime on inherited from kc3000
kc3000/EPHEMERAL/usrtmp redundant_metadata all default
kc3000/EPHEMERAL/usrtmp overlay on default
kc3000/EPHEMERAL/usrtmp encryption off default
kc3000/EPHEMERAL/usrtmp keylocation none default
kc3000/EPHEMERAL/usrtmp keyformat none default
kc3000/EPHEMERAL/usrtmp pbkdf2iters 0 default
kc3000/EPHEMERAL/usrtmp special_small_blocks 0 default
\
# echo 0 > /sys/module/zfs/parameters/zfs_dmu_offset_next_sync
Loop 2 - Corrupted files in /var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64:
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/compile: data
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/cover: data
/var/tmp/portage/dev-lang/go-1.22.0/image/usr/lib/go/pkg/tool/linux_amd64/vet: data
Yeah this reproduces really easily on 6.7.6 and 2.2.3 with a dataset with all default settings and the tunable flipped to enable BRT.
Gross.
How about https://github.com/openzfs/zfs/pull/15588 - I wonder if that changes the outcome?
I tried, and no.
Thanks for trying - and damn!
Yeah this reproduces really easily on 6.7.6 and 2.2.3 with a dataset with all default settings and the tunable flipped to enable BRT.
Gross.
"Really easily" as in "just as easily as before"?
Rebased and running this all night... https://github.com/numinit/nixpkgs/commit/c0712aad80c9e89a5004ec5133f74748787755f9
Increased blocksize to 256k kept at 16k and added more logging:
zfs_2_2_3_stock # [zhammer::1266] Uname: Linux zfs223stock 6.7.9 #1-NixOS SMP PREEMPT_DYNAMIC Wed Mar 6 14:54:01 UTC 2024 x86_64 GNU/Linux
zfs_2_2_3_stock # [zhammer::1266] Cmdline: initrd=\EFI\nixos\zi31gj6mb1lmp3w92wxl0n93s36kszq9-initrd-linux-6.7.9-initrd.efi init=/nix/store/gs2xw1yis46qx221h1jy0z4jv7z8w5sd-nixos-syst
em-zfs_2_2_3_stock-test/init console=ttyS0 panic=1 boot.panic_on_fail clock=acpi_pm zfs.zfs_bclone_enabled=1 nohibernate loglevel=7 net.ifnames=0
zfs_2_2_3_stock # [zhammer::1266] - zfs_bclone_enabled: 1
zfs_2_2_3_stock # [zhammer::1266] - zfs_bclone_wait_dirty: 0
zfs_2_2_3_stock # [zhammer::1266] - zfs_dmu_offset_next_sync: 1
zfs_2_2_3_stock # [zhammer::1266] ZFS userspace: zfs-2.2.3-1
zfs_2_2_3_stock # [zhammer::1266] ZFS kernel: zfs-kmod-2.2.3-1
zfs_2_2_3_stock # [zhammer::1266] Module: /run/booted-system/kernel-modules/lib/modules/6.7.9/extra/zfs.ko.xz
zfs_2_2_3_stock # [zhammer::1266] Srcversion: 62EA6EC3953A8524022765E
zfs_2_2_3_stock # [zhammer::1266] SHA256: 9f3dc6469b41a73ab0cf0c6c76c7ddce6e9ca0b1bc4da64f0e09206ee5d70f2b
zfs_2_2_3_stock # [zhammer::1266] ===
zfs_2_2_3_stock # [zhammer::1266] Work dir: /test
zfs_2_2_3_stock # [zhammer::1266] Count: 10000000 files
zfs_2_2_3_stock # [zhammer::1266] Block size: 16k
zfs_2_2_3_stock # [zhammer::1266] Check every: 5000 files
zfs_2_2_3_stock # [zhammer::1266] writing 5000 files at iteration 0
I'll update in case anything happens but it didn't repro immediately which is a good sign.
@thulle Are you able to reproduce this with zfs_bclone_enabled=0
?
See #15994 for a possible explanation for cp
stuckness when zfs_bclone_wait_dirty=1
.
10,000,000 iterations of 16k blocks on each of 8 cores, and... no trouble.
(finished: must succeed: parallel --lb --halt-on-error now,fail=1 zhammer /test 10000000 16k 5000 ::: $(seq $(nproc)), in 81147.21 seconds)
(finished: run the VM test script, in 81156.74 seconds)
test script finished in 81156.79s
cleanup
kill machine (pid 9)
qemu-kvm: terminating on signal 15 from pid 6 (/nix/store/3v2ch16fkl50i85n05h5ckss8pxx6836-python3-3.11.8/bin/python3.11)
(finished: cleanup, in 0.08 seconds)
kill vlan (pid 7)
/nix/store/vxxi7mfvynzqpyavjszgqmjd5ni0l1xh-vm-test-run-zfs-zfs_2_2_3_stock
Is this another problem?
Yes.
So, not racing holes; probably very specifically pointing toward something else related to block cloning, or at least a second-order issue thereof, then. Maybe something ARC related?
I think this is zpl_clone_file_range_impl
not flushing mapped pages before attempting to perform the clone.
The file is miscopied if it has been synced out once, pages are dirty, and the kernel has not flushed any of those pages yet. (See zfs_holey_common
for another code path that needs to flush dirty pages before examining the file.)
Here is a trivial reproducer:
#!/bin/bash
echo 1 >/sys/module/zfs/parameters/zfs_bclone_enabled
dd if=/dev/zero of=x bs=1M count=1 status=none
zpool sync
python -c "
from mmap import mmap
with open('x', 'r+b') as fh:
with mmap(fh.fileno(), 0) as m:
m[4000:5000] = b'z' * 1000
"
cp --reflink=always x y
diff -q x y
Output:
Files x and y differ
What's happening here?
zpool sync
flushes out the file and its block pointersThis happens because none of the pages have been written back yet, so the file looks clean to the clone code. But the contents are the last-synced rather than the latest mmap'ed state. (Also note that merely reading the file without using mmap or clone will flush dirty pages hiding the bug.)
golang build is a reproducer because it builds the tool chain multiple times in the work directory and overwrites the contents with mmap. This reliably triggers the sequence above.
The script above succeeds with no differences for zfs_bclone_enabled=0
or cp --reflink=never
.
That tracks with what I privately found, which was that turning the conditional check for cached pages triggering a flush in zfs_holey_common
into an unconditional flush made this bug vanish.
My theory when I last looked at this was that because we didn't hold the range lock until after we did this check, there was a nasty window for us to have passed that check, not trigger the flush, and then go bang. But attempting to move that check inside the range lock (in addition to causing fun with nested locks sometimes) seemed to still break where unconditional flushing did not, suggesting that there was more to it.
Then life got busy for the past few days and I haven't looked again since.
Another interesting question is whether this reproduces on FreeBSD, because my theory for the issue I was observing (which was that, in my testing at least, I ended up with everything past a certain point zeroed where it shouldn't have been) was that the filesize was changing between when we read it and used it for the check, or between the check and when we actually did the probe, resulting in comedy, but since FreeBSD's mapping of the check doesn't actually use the range arguments, it might not have that problem if that was indeed the problem.
Anyway, if someone wants to just slap a FIXME on this, I imagine you could do something like turning
/* Flush any mmap()'d data to disk */
if (zn_has_cached_data(zp, 0, file_sz - 1))
zn_flush_cached_data(zp, B_FALSE);
lr = zfs_rangelock_enter(&zp->z_rangelock, 0, UINT64_MAX, RL_READER);
error = dmu_offset_next(ZTOZSB(zp)->z_os, zp->z_id, hole, &noff);
zfs_rangelock_exit(lr);
into
/* Flush any mmap()'d data to disk */
if (zn_has_cached_data(zp, 0, file_sz - 1) || (zfs_brt_force_flush))
zn_flush_cached_data(zp, B_FALSE);
lr = zfs_rangelock_enter(&zp->z_rangelock, 0, UINT64_MAX, RL_READER);
error = dmu_offset_next(ZTOZSB(zp)->z_os, zp->z_id, hole, &noff);
zfs_rangelock_exit(lr);
with a new tunable plumbed in, and have a nice day for now. But this feels like the problem there is that a clone should be triggering that before it tries it, not flushing before anything that might reference it is hitting it later. (cf. #15772).
So similar to how zfs_clone_range
added:
error = dmu_brt_clone(outos, outzp->z_id, outoff, size, tx,
bps, nbps);
if (error != 0) {
dmu_tx_commit(tx);
break;
}
if (zn_has_cached_data(outzp, outoff, outoff + size - 1)) {
update_pages(outzp, outoff, size, outos);
}
after the clone, we probably want something like (untested):
if (zn_has_cached_data(inzp, inoff, inoff + size - 1)) {
update_pages(inzp, inoff, size, inos);
}
nbps = maxblocks;
last_synced_txg = spa_last_synced_txg(dmu_objset_spa(inos));
error = dmu_read_l0_bps(inos, inzp->z_id, inoff, size, bps,
&nbps);
.
$ for i in `seq 1 50`;do sudo bash repro.sh;done;
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
Files x and y differ
^C
$ echo 1 | sudo tee /sys/module/zfs/parameters/zfs_brt_fix_this_mess
1
$ for i in `seq 1 50`;do sudo bash repro.sh;done;
$
Where that tunable is an if guard around the above patch.
Don't try it, probably.
This ameliorates the reproducer script listed, but trying the original report, of emerge go
on Gentoo, produces mangled results more often with that proposed change, fascinatingly.
Wonder if what I'm seeing is this fix turning this test case into #15994.
Nah, just me rushing in trying things.
What I actually wanted was:
if (zn_has_cached_data(inzp, inoff, inoff + size - 1)) {
zn_flush_cached_data(inzp, B_FALSE);
}
nbps = maxblocks;
last_synced_txg = spa_last_synced_txg(dmu_objset_spa(inos));
error = dmu_read_l0_bps(inos, inzp->z_id, inoff, size, bps,
&nbps);
But I didn't think carefully about what update_pages
did. I wanted to flush cached pages, not rewrite them with the buffer's contents, that's the bad place.
...and if I do that in there, I deadlock on the rangelock. So let's move that around...
rrevans/zfs@ae5efbed360e9ba12ce38e5a66dfe14e0b12a77b fixes both emerge go (using @thulle's step-at-a-time testing script) and the reproducer script for me. I've run both in a loop for 30 minutes with no failures so far.
The dirty page flush happens before the rangelock in zfs_holey_common
, and I'm just copying that approach.
I am not sure this is correct or sufficient or if there are other bclone/mmap gaps. (And, it flushes the whole file when it could maybe flush only the cloned range.) That said, it does pass ZTS sanity.run
and bclone
tests for me.
That doesn't fix it, for me, as I already said above.
Sorry, to be less terse, I wound up with https://github.com/rincebrain/zfs/commit/e55d3b03643f92500610e8e6066b59304e09b93b because you can't flush that under the rangelock without deadlocking sometimes, so there's going to be a window where it might change between the flush and the lock again.
So I added a case for the check inside the lock as well, told it to retry once to hopefully catch more of the trivial cases of it being dirtied with inconvenient timing infrequently, and then fall back to not reflinking in the event that it's still dirty.
I confirmed that it sometimes did still reflink with that code (since I was slightly worried I might be missing some detail of how it worked that implied the check would always be true), and then pushed it to see whether the CI blew up.
It can probably be improved to avoid that window, but similarly, I could not produce incorrect behavior after hammering it for some time, and anything simpler still failed some of the time.
(Not, admittedly, very often, but "one in 50 runs" is still "not zero times".)
(Also, to be clear, I'm redoing my tests with your patch to see if my memory is faulty of having tried that, because I would certainly prefer a simpler solution...)
I'll need to look closer on the area, but just a first thought: if it is impossible to reliably flush the caches, we could abort cloning in zn_has_cached_data() case. I.e. we could flush caches first before taking the lock, but if somebody is still writing the mmap()'ed region when we get the lock, then give up completely. Though since we'd need to commit the txg before we'll be able to clone the file, we may just give up on it from the beginning and don't bother to even flush the cache, leaving it to fall-back copy routine to do.
@amotin +1
Looking more at this, I'm suspecting background writeback is a confounding factor (and then in turn also system load).
Staring at zn_flush_cached_data
and zfs_putpage
, if the pages are already in writeback by the time this is called, then it won't wait if sync == B_FALSE
. Then it's possible it returns before any pages get copied to the DMU/ARC state, and the file state looks clean. (@rincebrain's version of the patch uses sync == B_TRUE
which is more reliable vs. background WB.)
But note the fallback copy routine might seek holes and end up calling zn_flush_cached_data
with sync == FALSE
which could in turn have the same problem? Maybe this explains why unconditional flushing in zfs_holey_common
has some effect on the outcome?
p.s. I don't think it's correct to use rangelocks to coordinate with mmap writes since page locking is entirely separate and mmap writes are inherently racy. IOW if there are racing page writes there's no way to guarantee any outcome anyway, right? This is true even for filesystems that use the pagecache directly. That said, copying should always reliably observe page writes guaranteed to be dirtied before clone starts (e.g. the build tool process wait
is done before clone begins).
@thulle Are you able to reproduce this with
zfs_bclone_enabled=0
?
It seems like so, or at least something similar. I ran the go-building script overnight and after 13h and ~500 compiles it couldn't loop anymore:
+ /usr/lib/go/bin/go build -o cmd/dist/dist ./cmd/dist
00x0x0x0x0x0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x<0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x00x0x0x0x0x0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x<0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x0x 0x 0x
* ERROR: dev-lang/go-1.22.1::gentoo failed (compile phase):
It seems like half the go-binary got zeroed out:
# hexdump -C /usr/lib/go/bin/go|tail
0065ffa0 16 28 99 00 00 00 00 00 20 f5 a0 00 00 00 00 00 |.(...... .......|
0065ffb0 20 00 00 00 00 00 00 00 23 64 99 00 00 00 00 00 | .......#d......|
0065ffc0 a0 f4 a0 00 00 00 00 00 30 00 00 00 00 00 00 00 |........0.......|
0065ffd0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
0065ffe0 40 00 00 00 00 00 00 00 28 00 00 00 00 00 00 00 |@.......(.......|
0065fff0 39 af b4 e3 07 08 08 19 00 00 00 00 00 00 00 00 |9...............|
00660000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00c18c20 00 00 00 00 00 00 00 00 |........|
00c18c28
Yeah, I don't think in the general case you can guarantee anything with true mmap direct access semantics, though because we play games with indirection precisely to try and avoid things like users modifying a "live" buffer and causing exciting outcomes, we might be able to do better, but I haven't looked hard at how that code works, and it might only be true on Linux since FBSD is more or less a thin shim around UMA.
I haven't been able to reproduce what I said about it occasionally failing again with that patch so far, interestingly, but I wasn't kidding about how uncommonly it reproduced for me before, and I've had a number of changes where it looked like it helped and then "oops".
So personally, even if we don't opt for the "try twice" approach in my commit, I would encourage us to have the bailout case I have there, because I'd rather have a "this should never happen" check that's not used and we discard in a few years, than have to play games like this again in the next 6-12 months.
I would agree that having looked at this, it's almost certainly true that B_FALSE is the wrong thing in zfs_holey_common, even if in practice the window to demonstrate that is vanishingly small, but maybe @robn will have opinions on that.
e: Amusingly, thulle, that's the way I've seen it fail sometimes, with bclone_enabled=1, so perhaps i was just hitting the case you're seeing now.
How very...frustrating.
Looking deeper it does look like zn_flush_cached_data(B_FALSE) may be unreliable, just hard to say if that is the cause. It would be good to check of setting it to B_TRUE fix the go builds (either just in zfs_holey_common() with bclone_enabled = 0 or also in zfs_clone_range() as in https://github.com/rrevans/zfs/commit/ae5efbed360e9ba12ce38e5a66dfe14e0b12a77b). On the other side, it seems ZPL translates all B_TRUE/WB_SYNC_ALL flushes into ZIL commits, that would be much more expensive than we need for zfs_holey_common() and zfs_clone_range(). It would be nice to have something in between, but I don't see anything so far.
Another issue I see in use of zn_has_cached_data() for zfs_holey_common() and zfs_clone_range() purposes. It only reports that some pages are cached, not that they are dirty. I suspect if we use it to abort cloning, we may get many unneeded aborts. Using it for zn_flush_cached_data(B_TRUE) should be safe, but I wonder if it may also cause ZIL commits even if there are no dirty pages. While looking through the Linux I spotted filemap_range_needs_writeback(), which seems to report specifically whether we have any dirty or in writeback pages, that is what we need for zfs_clone_range(), or even for zfs_holey_common() if zfs_dmu_offset_next_sync = 0.
@thulle Are you able to reproduce this with
zfs_bclone_enabled=0
?It seems like so, or at least something similar. I ran the go-building script overnight and after 13h and ~500 compiles it couldn't loop anymore:
[...]
It seems like half the go-binary got zeroed out:
Just to make sure, I understood you correctly:
You were able to trigger silent data corruption even with the default setting of zfs_bclone_enabled=0
?
If yes, it is possible to reproduce this behaviour on older versions like 2.1.x, too? Because then we'll have last year all over again.
Just to make sure, I understood you correctly: You were able to trigger silent data corruption even with the default setting of
zfs_bclone_enabled=0
? If yes, it is possible to reproduce this behaviour on older versions like 2.1.x, too? Because then we'll have last year all over again.
Yes, though it took way longer time.
Haven't tried it on 2.1, and can't downgrade on the system I'm testing on since pool features are activated. I can try in a VM tonight, but considering the time it took to trigger it with zfs_bclone_enabled=0
, if just being in VM makes it sligthly rarer it might take weeks.
@Gendra13 I have been able to also reproduce with zfs_bclone_enabled=0
AND either sysctl vm.dirty_writeback_centisecs=100 vm.dirty_expire_centisecs=100
OR zfs_dmu_offset_next_sync=0
. These tweaks trigger the underlying writeback race more often, but it's possible the default Linux writeback settings are enough to prevent the problem. @thulle if you do 2.1 testing in a VM, maybe try tweaking the writeback sysctls?
I can also reproduce the problem now with zfs_bclone_enabled=1
even with rrevans@ae5efbed360e9ba12ce38e5a66dfe14e0b12a77b via writeback race.
I'm still trying to reproduce it with default writeback and ZFS settings without bclone. With defaults writeback is infrequent and ZFS txg syncs due to lseek(SEEK_HOLE/SEEK_DATA)
interfere with setting up that race.
(Unrelated TIL cp --debug
will print how it copied which is handy for trying to repro this class of bug.)
I'm just going to add a tunable to kill non-dense seek results for now, as a mitigation.
It's obviously exceedingly rare in most use cases, but really, "exceedingly rare" doesn't mean "this is okay", to me, and that seems like a better plan than waiting for us to figure out all the edge cases.
@rincebrain I tried with the patch you posted
On loop 791 after ~22h:
# hexdump -C /usr/lib/go/bin/go|tail
0071ffa0 bb 5f 01 00 20 03 00 00 5a 00 00 00 00 00 00 00 |._.. ...Z.......|
0071ffb0 e5 5f 01 00 20 03 00 00 08 00 00 00 00 00 00 00 |._.. ...........|
0071ffc0 bb 5f 01 00 8c 00 00 00 5a 00 00 00 00 00 00 00 |._......Z.......|
0071ffd0 e5 5f 01 00 8c 00 00 00 08 00 00 00 00 00 00 00 |._..............|
0071ffe0 2d c1 01 00 16 00 00 00 72 00 00 00 00 00 00 00 |-.......r.......|
0071fff0 bd c0 01 00 51 00 00 00 22 00 00 00 00 00 00 00 |....Q...".......|
00720000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00c18c20 00 00 00 00 00 00 00 00 |........|
00c18c28
Yeah, I'm not completely astonished. Would you care to try with #16000?
I just realized that I'm on an unsupported kernel since 4 days ago, currently running 6.8.0-gentoo-x86 + https://github.com/openzfs/zfs/pull/15931
I assumed it's just a new interface, but worth noting. Building and trying #16000 now.
Would you care to try with #16000?
That resulted in some other issue, the initramfs hands over to the root-on-zfs-pool, OpenRC starts, it creates /run/openrc and /run/lock and then freezes on * /run/lock: correcting owner
Both on 6.8.0 and 6.7.9.
Cute. Maybe I broke the world.
That'd be kind of weird though since the test suite seems to have mostly passed...
Regardless, I'll try and figure out why that's happening, assuming I can reproduce it here.
It doesn't seem broken for me on 6.7.6 in any way I've found so far, so I'll try building 6.8 and see.
Ah, yes, SEEK_SET. One moment.
Nah, I just missed the required semantics for SEEK_DATA. Silly me. I'll run the updated patch through more testing then push it.
...wild. I still got a mangled one with this patch.
Here's my reproducer:
#!/bin/bash
# kill background jobs on exit
trap 'trap - SIGTERM && kill -- -$$' SIGINT SIGTERM EXIT
tool="""
import mmap, random, sys
blk = 1<<17
with open(sys.argv[1], 'r+b') as fh, mmap.mmap(fh.fileno(), 0) as m:
for o in range(0, len(m), blk):
# N.B. last blk must be filled for bclone (see #15993)
if random.randint(0, 99) < 10 or o+blk >= len(m):
m[o:o+blk] = random.randbytes(blk)
else:
m[o:o+blk] = b'\0' * blk
"""
# write, copy, compare, and print an error on mismatch
write() {
truncate --size=4M "${f?}"
python3 -c "${tool}" "${f?}" &>/dev/null
cpout=$(cp --debug "${f?}" "${f?}.2" 2>&1)
diff=$(diff -q "${f?}" "${f?}.2")
if (($?)); then
echo "$diff" "($(echo "$cpout" | tail -1))"
fi
}
# do copies for the same file in a loop
loop() {
f="$1"
while sleep 0.1; do
write "${f?}"
done
}
# run many copy loops at once
for i in $(seq 2000); do
loop "x.${i?}" &
done
# trigger writeback flusher in a loop
while sleep 1; do
sync -f .
done
This triggers the writeback flavor of this issue within 30 seconds for me most of the time regardless of tunables.
With zfs_bclone_enabled=0
, zfs_dmu_offset_next_sync=1
, I get this after 10 seconds:
Files x.495 and x.495.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.147 and x.147.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.1132 and x.1132.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.1387 and x.1387.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
With zfs_bclone_enabled=0
, zfs_dmu_offset_next_sync=0
, I get this after about 20 seconds:
Files x.1121 and x.1121.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.393 and x.393.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
With zfs_bclone_enabled=1
, zfs_bclone_wait_dirty=0
, zfs_dmu_offset_next_sync=1
, I get this after about 20 seconds:
Files x.950 and x.950.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.457 and x.457.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.1018 and x.1018.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.320 and x.320.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.1604 and x.1604.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.554 and x.554.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.859 and x.859.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.1031 and x.1031.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.1470 and x.1470.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.1187 and x.1187.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.271 and x.271.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
With zfs_bclone_enabled=1
, zfs_bclone_wait_dirty=0
, zfs_dmu_offset_next_sync=0
, I get this after about 20 seconds:
Files x.1794 and x.1794.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.1750 and x.1750.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.1103 and x.1103.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.1858 and x.1858.2 differ (copy offload: yes, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.1014 and x.1014.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.572 and x.572.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.544 and x.544.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.1856 and x.1856.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
With zfs_bclone_enabled=1
, zfs_bclone_wait_dirty=1
, zfs_dmu_offset_next_sync=1
, I get this almost immediately:
Files x.1507 and x.1507.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.900 and x.900.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.1613 and x.1613.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.83 and x.83.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.312 and x.312.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
With zfs_bclone_enabled=1
, zfs_bclone_wait_dirty=1
, zfs_dmu_offset_next_sync=0
, I get this almost immediately:
Files x.725 and x.725.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.651 and x.651.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.409 and x.409.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.545 and x.545.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.1191 and x.1191.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Files x.63 and x.63.2 differ (copy offload: unknown, reflink: yes, sparse detection: unknown)
Note that the cp
debug part is telling us exactly which path is getting it wrong for each case. reflink: yes
failures are bclone getting it wrong, and reflink: unsupported, sparse detection: SEEK_HOLE
failures are zfs_holey_common
getting it wrong on the fallback leg. (Edit: copy offload: yes
cases are still using bclone when enabled.)
So this is direct evidence that bclone without wait_dirty hits both cases as this thread has already hinted. (It also seems that wait_dirty makes this worse somehow with a lot more cases hit per second when I run those tests.)
This seems once again timing dependent, so try tuning the number of parallel writes if this doesn't work for you. Also blk
is matching my dataset's recordsize of 128k, and I didn't try other values or recordsizes. YMMV. My exact setup is ZFS at rrevans@ae5efbe (DEBUG), local Fedora 39 VM (6.6.8-200.fc39.x86_64) with 32 vcpu, 64 GiB RAM, and NVMe disk. I'm using default kernel writeback settings for these tests (vm.dirty_writeback_centisecs = 500, vm.dirty_expire_centisecs = 3000
).
Lastly note that the use of python3 -c
in the reproducer seems to be required to trigger this; placing this script in a file on disk in /tmp
will not reproduce the issue as readily despite /tmp
being a tmpfs
on my system. This is a really sensitive case to try to reproduce.
Wonder if Python trying to compile the bytecode and write it out is doing something funny to timing without that.
I've got an improved version of the killswitch patch that I haven't broken yet that I'll push after giving it another 100 or 200 more rounds of "did it explode", as well as some experimenting with that reproducer.
What I think is happening, based on seeing failure even with that patch, is that we're racing the znode's filesize being updated from the inode's filesize, and since the hole logic always uses the znode's size, us returning a stale one results in a narrow window where externally, they've widened it, but it hasn't been flushed back yet, so ZFS is out of sync with what the VFS layer thinks reality is.
That would also, to some extent, explain this being very finicky and not reproducing on FreeBSD, since we don't have this heavy pile of indirection between us and FreeBSD.
So I think doing i_size_read
in zfs_holey
would avoid this, but since that's a Linuxism, I'm currently testing with just moving the bailout logic under the zfs_enter and calling zfs_znode_update_vfs
before we do anything, which should do something sane, I think?
If not, it'll certainly break more excitingly, and I'll evaluate different solutions.
FWIW- My repro is not changing filesize or unlinking nodes between passes. So, the filesize never changes after first truncate which also creates the file.
Also I've been examining diffs and they are 1) not only at the end 2) not recordsize aligned (they are page aligned though).
Meanwhile flush withsync=B_TRUE
is sufficient to stop the repro triggering for me without bclone, but bclone is still triggering even with sync forced. (However I can't repro if I change to bailout of clone with EAGAIN in the presence of any mappings dirty or otherwise. Doing that for both paths seems reasonable as a mitigation.)
I think that non-bclone is harder to hit because it would take all pages in a record hitting this at the same time to get seek to skip over it. And non-bclone reads are served from pagecache on mapped objects, so as long as cp tries to read the record it will be correct regardless of writeback.
Contrast with bclone which will copy the record directly via L0 bps. If we miss an in-progress writeback that should dirty the record, then contents will be wrong.
Long story short, I think this is Linux writeback vs. zpl pagecache sync failing to reliably collect all dirty pages at sync time when bclone flushes, but I don't know why. I think it only happens to work without bclone because pages are smaller than records making it harder to hit. Otherwise there's no reason sync=TRUE fixes one path but not the other; IOW I think it's not a sufficient fix for either despite apparently fixing the non-bclone case.
Which means the culprit lies in writeback syncing. The best theory I have is that some page gets background writeback during the time userland is updating the mapped pages and so is immediately redirtied, but this sequence somehow causes the later sync flush to fail to block on the redirtied page.
From a cursory read of Linux writeback, I don't yet see how that could happen. write_cached_pages won't wait for pages it didn't tag on the sync leg, but __write_single_inode will do fdatawait on the whole mapping which should block on any writeback started by that point. But this code is very difficult to follow because ZFS writeback has a lot of special handling for Linux. Are there any pagecache writeback experts who can help look here?
Entirely separately, neither bclone or seek really needs to flush. This is just how ZFS is trying to convey dirty page state downward to the DMU to have the record dirty state match, and it's really tricky to lean on synchronous writeback for this. IMO ZFS should maybe either inspect the mm state to see if any mapped pages for a specific record are dirty or track the dirtied pages with page_mkwrite
etc. instead of asking for sync on those pages just to detect dirtiness.
But then again writeback sync being unreliable is probably a differently serious bug because then it calls into question whether or not msync and friends are actually reliable for data integrity writes over mappings in the presence of this writeback race...
I was trying to work with what's in front of me, and it was reproducing at recordsize
granularity on the end for me, so I was trying to ameliorate that consistently and see what was left.
I've since moved on to seeing if filemap_write_and_wait_range
will make it stop, and yes, rearchitecting how it handles this to not be indirect might work, though for at least block cloning that could be fun to reason about since you're cloning things that aren't necessarily synced out yet, and the whole design is around "clone existing block".
I did a few test runs with the reproducer from @rrevans on Ubuntu 22.04. All the zfs-kernel-parameters were left on their default settings (e.g. zfs_dmu_offset_next_sync=1).
TL;DR: Yes, this bug can be triggered on 2.1.x. if you are using the new coreutils EDIT2: Unfortunately, after more tests, I was able to trigger this both with zfs_dmu_offset_next_sync
0 and 1. See my later post.
_(was before: In my additional tests I wasn't able to trigger the bug with zfs_dmu_offset_nextsync=0 on v2.1.x)
zfs-kmod-2.1.5-1ubuntu6~22.04.2
(that's the old one without the dnode fix since Ubuntu's bug fixes are slooooow)
Uname -r : 5.15.0-100-generic
cp --version: cp (GNU coreutils) 8.32
diff: x.12.2: No such file or directory
(Try 'cp --help' for more information.)
diff: x.267.2: No such file or directory
diff: x.1143.2: No such file or directory
diff: x.747.2: No such file or directory
zfs-kmod-2.1.5-1ubuntu6~22.04.2
(that's the old one without the dnode fix since Ubuntu's bug fixes are slooooow)
Uname -r : 5.15.0-100-generic
cp --version: cp (GNU coreutils) 9.4
Files x.206 and x.206.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.244 and x.244.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.97 and x.97.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.504 and x.504.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.286 and x.286.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
zfs-kmod-2.1.5-1ubuntu6~22.04.3
(that's has the dirty_dnode fix backported)
Uname -r : 5.15.0-100-generic
cp --version: cp (GNU coreutils) 9.4
Files x.657 and x.657.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.201 and x.201.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.374 and x.374.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
zfs-kmod-2.1.15-1
Uname -r : 5.15.0-100-generic
cp --version: cp (GNU coreutils) 9.4
Files x.1016 and x.1016.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.346 and x.346.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.415 and x.415.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.616 and x.616.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
Files x.450 and x.450.2 differ (copy offload: unknown, reflink: unsupported, sparse detection: SEEK_HOLE)
So in my tests I was able to reproduce the bug even with a rather old kernel version (5.15) and running 2.1.x without any of the block-cloning code baked in.
I did an additional test with 2.1.15 but setting zfs_dmu_offset_next_sync=0
and then I could run the reproducer.sh for at least 20min without triggering the bug (the other cases failed within a few seconds).
ZFS-Version: zfs-kmod-2.1.15-1
Uname -r : 5.15.0-100-generic
cp --version: cp (GNU coreutils) 9.4
See my later post where I was able to reproduce this even with zfs_dmu_offset_next_sync=0
on 2.1.15.
I'm not surprised, but thanks for the confirmation.
My current plan, having tried a bunch of other things, is:
generic_file_llseek
hasn't broken for me, so I'm going to be aggressively testing a branch that changes nothing but that, and confirm that having it do that doesn't break - if so, I'll suggest people test that more broadly.
System information
Describe the problem you're observing
After seeing 2.2.3 released and #15526 closed I enabled blockcloning again and tried my unfailing reproducer: compiling go. It still seems to trigger the/some bug.
Worth noting is that I'm running on top of LUKS. Scrub reports no issues.
Describe how to reproduce the problem