Closed terinjokes closed 7 months ago
I've been unable to reproduce after upgrading to 2.2.1 with the tunable set to disabled.
I was able to come up with a simple reproducer script based off @rincebrain's comment https://github.com/openzfs/zfs/issues/15554#issuecomment-1822154030. script is here: reproducer.sh.
Inside your pool mount, spawn off multiple copies of the script in parallel (3-4 copies worked for me):
fedora39:$ cd /tank
fedora39:/tank$ ~/reproducer.sh & ~/reproducer.sh & ~/reproducer.sh & ~/reproducer.sh && wait
[1] 194921
[2] 194922
[3] 194923
writing files
writing files
writing files
writing files
checking files
checking files
checking files
checking files
Binary files reproducer_194963_0 and reproducer_194963_388 differ
Binary files reproducer_194963_0 and reproducer_194963_777 differ
Binary files reproducer_194963_0 and reproducer_194963_778 differ
[2]- Done ~/reproducer.sh
[3]+ Done ~/reproducer.sh
[1]+ Done rm -f * && ~/reproducer.sh
fedora39:/tank$ hexdump reproducer_194963_0 | head -n 5
0000000 50b4 8a6a 77a8 681f d35f 061a 3a16 1587
0000010 73cc c42b f481 00b0 8ef7 e3ea f741 c5ec
0000020 3648 ee57 c0b8 3fcb 1cd4 992d 9e5a dc7e
0000030 3f6c b3e3 7359 59fd 0c86 600a eede 7c49
0000040 31c7 94bd 9936 3c46 9952 b0da 9caf 2c66
fedora39:/tank$ hexdump reproducer_194963_777 | head -n 5
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0100000
Note that it will create tons of reproducer_*_*
data files in there, so run it from a junk directory.
:arrow_up: this was running 2.2.1 on Fedora 39 with /sys/module/zfs/parameters/zfs_bclone_enabled=1
But will it repro if you apply https://github.com/rincebrain/zfs/commit/3f9688eb36023f6f69b98ffbc30267ba24d33ad8 ? :)
@rincebrain unfortunately yes :disappointed:
Even better. By which I mean worse, of course.
I have a nastier solution, I suppose, since I didn't really think that alone would fix it, but I'll try that and report back.
@rincebrain hate to say it, but looks like it fails with /sys/module/zfs/parameters/zfs_bclone_enabled = 0
...
fedora39:/tank$ ~/reproducer-nocheck.sh & ~/reproducer-nocheck.sh & ~/reproducer-nocheck.sh & ~/reproducer-nocheck.sh & wait
[1] 304410
[2] 304411
[3] 304412
[4] 304413
writing files
writing files
writing files
writing files
checking files
checking files
checking files
checking files
Binary files reproducer_304412_0 and reproducer_304412_620 differ
[1] Done ~/reproducer-nocheck.sh
[4]+ Done ~/reproducer-nocheck.sh
[2]- Done ~/reproducer-nocheck.sh
[3]+ Done ~/reproducer-nocheck.sh
fedora39:/tank$ cat /sys/module/zfs/parameters/zfs_bclone_enabled
0
fedora39:/tank$ hexdump reproducer_304412_0 | head -n 5
0000000 9683 56ac ba05 5c35 ef52 4ecd 5fc9 0c39
0000010 2b5c d795 8e0d 49dd a9f7 dd67 6af5 9cab
0000020 c87a 04ec be89 ae9a f45f b84d a2fe bc9b
0000030 ef0c 2da9 7f44 95ce f6ac 1297 09f1 2df5
0000040 fb54 cad0 7a73 d34d f048 9c68 3ebe a988
fedora39:/tank$ hexdump reproducer_304412_620 | head -n 5
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0100000
fedora39:/tank$ sudo ~/zfs/zpool get all | grep bclone
tank bcloneused 0 -
tank bclonesaved 0 -
tank bcloneratio 1.00x -
So maybe it's not block cloning? I'll see if I can bisect master
to the bad commit.
I'd be both happy and upset if it's not block cloning causing people issues. How...unfortunate.
That reproducer could never use BRT anyway unless you're using a coreutils newer than 9.0, which you might be, I didn't ask about the environment you're in.
I'm using Fedora 39 with coreutils-9.3-4.fc39.x86_64
If it's not BRT related, which would be nice but also bad, then I'd just go back to #11900 again, and assume the dirty check is failing to trigger even if it's not BRT related, and look at any delta that touches dirty state as a starting point to bisect around.
Script doesn't repro here. Linux 5.10.170, coreutils 9.1, on both memory-backed and file-backed raidz1, with and without cloning enabled. When enabled, clones are certainly generated.
If its a race, as it seems, then its probably not surprising that we get different results; its gonna be sensitive to local performance characteristics.
Cannot reproduce with emerge dev-lang/go after 4 try with 2.2.1 and /sys/module/zfs/parameters/zfs_bclone_enabled = 0
Also cannot reproduce with @tonyhutter script for i in {1..8} ; do ./reproducer.sh & done
I'm able to reproduce in Fedora 37 with 6.5.11 kernel, coreutils 9.1 and zfs-2.1.13. So doesn't look like this is 2.2.x only.
Got it! I am able to trigger the bug nearly every time with 8 running reproducer.sh (7950X here + zpool on 3x nVME modules in RAID-Z1) :
I did not "zpool upgrade" from OpenZFS 2.1.13. so no 2.2.x feature flag is active on the zpool now (no block-cloning, no blake3 checksum, etc). So far, I remarked absolutely nothing under normal daily usage as my Gentoo box has /var/tmp/portage on tmpfs, no errors in scrubs, nothing crashing abnormally more due to corrupt binaries, etc.
Kernel 6.6.2, coreutils 9.4, glibc 2.38.
Spicy.
Everything is awful forever.
Gentlemen, I am also able to trigger it on a TrueNAS Core storage box..... :( 8x rusty platter HDD with no ZIL in a RAID-Z2 layout.
# freebsd-version
13.1-RELEASE-p7
# zfs version
zfs-2.1.11-1
zfs-kmod-v2023072100-zfs_0eb787a7e
EDIT: tested twice with 16 instances running, bug triggered twice.
I wonder if it's the case that there's a bug with block cloning and another bug, and so disabling clock cloning closes a very big window but not the other reason this can happen? Oy.
I feel fairly confident from code reads that there are still some locking problems in block cloning code, but nothing I've been able to reproduce yet (and some other places in ZFS, if I'm honest). So I wouldn't say that there's not a block cloning bug and that we're not hitting here (absence of evidence etc).
However, a clone is much faster than a content copy, so its possible that just the timing differences are making it easier to hit the different bug. As noted, I couldn't reproduce it at all, but that's in a development VM that has some quite different latency characteristics to any real computer.
I guess I'm saying, I'd track down the other one first, then once that's nailed, retest the original bug here with cloning enabled and see what shakes out.
@tonyhutter holler if you need extra eyeballs, rubber duck, etc.
@robn yea any help would be great. Right now let's try to bisect it down to a specific commit.
An update to my previous update, that might be a bit late as of the past few minutes, I can reproduce with @tonyhutter's script with zfs_bclone_enabled = 0
, despite my original test case no longer consistently reproducing.
Can't reproduce with tonyhutters script, even with 24 instances. 16 HDDs pool with 2 raidz2. ZFS 2.1.13, kernel 5.15.139, cp 8.32
Same system as 20 minutes ago (2.2.1, zfs_bclone_enabled = 0
, Linux 6.5.12-gentoo-dist), but I downgraded coreutils to 8.32 to match the previous poster and I also can't reproduce.
As a data point, with zfs_dmu_offset_next_sync=0
, I can no longer make the reproducer script reproduce on 2.1.x with coreutils 9.x in several thousand runs of it, when it reproduces within 2 or 3 with it =1.
So for those of us on 2.1.x, that might be useful.
Quick update - I've been testing on AlmaLinux 8 (RHEL clone) w/4.18 kernel using a custom-installed coreutils 9.1 from Fedora 37 (since Alma 8 comes with coreutils 8 by default). I can NOT reproduce on zfs-2.1.2, but can reproduce on zfs 2.1.5. Still bisecting...
Remember there was a bug with this fixed back then, Tony. Twice, even.
@rincebrain your data point was correct - it bisected down to 9f6943504aec36f897f814fb7ae5987425436b11 "Default to zfs_dmu_offset_next_sync=1" in the zfs-2.1-release branch. That commit got pulled into zfs-2.1.4. That same commit is 05b3eb6d2 in master.
I also tried setting /sys/module/zfs/parameters/zfs_dmu_offset_next_sync = 0
using that same commit, and could not reproduce the error.
So setting /sys/module/zfs/parameters/zfs_dmu_offset_next_sync = 0
might be the workaround for now.
I'm really fascinated because I swear we couldn't originally reproduce this with that =1 then, or I would have been very vociferous about turning it off again...
...plus, the original reason we found this was a lot of people having issues on Gentoo with the aforementioned emerge case, so I'm not entirely certain the original turning on of that tunable is why it's ruining everyone's day now, at least.
As a side note we may want to try the reproducer using a variety of file sizes. When I test with the 1MB file size in the reproducer, I see all zeros for the bad files. I'm hoping that's the case for all corrupted files (all zeros), which would be easy to detect on existing datasets.
I would expect that the problem is not all zeroes, based on the gentoo reports saying the resulting binaries were "data" not "empty", and that this is just a consequence of touching the whole file at once, basically.
In particular, when 519851122b1703b8445ec17bc89b347cea965bb9 went in and got cherrypicked into 2.1.10 as https://github.com/openzfs/zfs/commit/4b3133e671b958fa2c915a4faf57812820124a7b, we IMMEDIATELY found horrible issues with this cropping up again, which is why 2.1.11 happened.
So I'm very surprised if something hasn't changed to make this crop up more, now, than it did previously, and I don't immediately know what that would be.
Well the shape of problem is that a dnode is not marked dirty when it should be, right?
I can totally believe that there is still a place where that was happening in 2.1, that still exists and was just incredibly hard to hit. Because I'm still reasonably sure that the case I mentioned in dmu_buf_will_clone()
is real too.
So maybe the original post in this issue is triggered by the cloning case, which is why turning it off helped there. And maybe Tony wrote a test that just happened to tickle the other kind?
I still can't reproduce any of it anywhere, so I can imagine its pretty sensitive to timing.
(Through this I've got some ideas for how we might be able to detect when we're unsafely undirtying a dnode, to try and identify any and all comers. No time to poke at that before the weekend though, so don't wait for me.)
Same here, zfs_dmu_offset_next_sync=0 => no more complaints, both with Linux and FreeBSD.
Perhaps a naive question, but in the case of ZFS acting as a Lustre backend, is there any potential corruption issues as well if zfs_dmu_offset_next_sync=1 or is this issue is specific to ZFS filesystems datasets?
There's a little theory brewing over here. Request for information: if you're on Linux, and you hit the original bug, or tried the reproducer, could you please post the version of coreutils you have, and whether or not you hit the problem? Thanks!
(this is not to say coreutils is at fault; that this happens on FreeBSD proves that. coreutils has changed when and why it tries to detect holes multiple times in the 9.x series, and narrowing that down may help us see what's happening a little easier).
Perhaps a naive question, but in the case of ZFS acting as a Lustre backend, is there any potential corruption issues as well if zfs_dmu_offset_next_sync=1 or is this issue is specific to ZFS filesystems datasets?
@admnd I can only give you a pointer towards the answer: if Lustre calls function zfs_holey()
, then possibly - its exported from zfs.ko
, but I don't have the Lustre source nearby to look. For Linux and FreeBSD, its used in the implementation of lseek()
.
Hi @robn , I tried the reproducer and was able to reproduce the bug.
Some details about my setup:
$ uname -a
Linux nas 6.6.2-gentoo #1 SMP Wed Nov 22 15:12:20 PST 2023 x86_64 AMD Ryzen 5 5600X 6-Core Processor AuthenticAMD GNU/Linux
$ zfs --version
zfs-2.2.1-r0-gentoo
zfs-kmod-2.2.1-r0-gentoo
$ equery l coreutils
* Searching for coreutils ...
[IP-] [ ] sys-apps/coreutils-9.3-r3:0
$ zpool status
pool: data
state: ONLINE
status: Some supported and requested features are not enabled on the pool.
The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: scrub repaired 0B in 05:31:00 with 0 errors on Tue Nov 21 23:31:00 2023
config:
NAME STATE READ WRITE CKSUM
data ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
wwn-0x5000c500dcafce99 ONLINE 0 0 0
wwn-0x5000c500dc3b961c ONLINE 0 0 0
wwn-0x5000c500dcb07258 ONLINE 0 0 0
wwn-0x5000c500dc065425 ONLINE 0 0 0
wwn-0x5000c500e340bf49 ONLINE 0 0 0
errors: No known data errors
$ zpool get all data
NAME PROPERTY VALUE SOURCE
data size 72.8T -
data capacity 18% -
data altroot - default
data health ONLINE -
data guid 250402256521350630 -
data version - default
data bootfs - default
data delegation on default
data autoreplace off default
data cachefile - default
data failmode wait default
data listsnapshots off default
data autoexpand off default
data dedupratio 1.00x -
data free 59.0T -
data allocated 13.7T -
data readonly off -
data ashift 12 local
data comment - default
data expandsize - -
data freeing 0 -
data fragmentation 0% -
data leaked 0 -
data multihost off default
data checkpoint - -
data load_guid 1997265464620489251 -
data autotrim off default
data compatibility off default
data bcloneused 0 -
data bclonesaved 0 -
data bcloneratio 1.00x -
data feature@async_destroy enabled local
data feature@empty_bpobj active local
data feature@lz4_compress active local
data feature@multi_vdev_crash_dump enabled local
data feature@spacemap_histogram active local
data feature@enabled_txg active local
data feature@hole_birth active local
data feature@extensible_dataset active local
data feature@embedded_data active local
data feature@bookmarks enabled local
data feature@filesystem_limits enabled local
data feature@large_blocks enabled local
data feature@large_dnode enabled local
data feature@sha512 active local
data feature@skein enabled local
data feature@edonr enabled local
data feature@userobj_accounting active local
data feature@encryption active local
data feature@project_quota active local
data feature@device_removal enabled local
data feature@obsolete_counts enabled local
data feature@zpool_checkpoint enabled local
data feature@spacemap_v2 active local
data feature@allocation_classes enabled local
data feature@resilver_defer enabled local
data feature@bookmark_v2 enabled local
data feature@redaction_bookmarks enabled local
data feature@redacted_datasets enabled local
data feature@bookmark_written enabled local
data feature@log_spacemap active local
data feature@livelist enabled local
data feature@device_rebuild enabled local
data feature@zstd_compress active local
data feature@draid enabled local
data feature@zilsaxattr disabled local
data feature@head_errlog disabled local
data feature@blake3 disabled local
data feature@block_cloning disabled local
data feature@vdev_zaps_v2 disabled local
Reproducer output:
$ ./reproducer.sh & ./reproducer.sh & ./reproducer.sh & ./reproducer.sh & ./reproducer.sh & ./reproducer.sh & ./reproducer.sh & ./reproducer.sh & ./reproducer.sh & ./reproducer.sh & ./reproducer.sh & ./reproducer.sh & ./reproducer.sh & wait
[1] 195954
[2] 195955
[3] 195956
[4] 195957
[5] 195958
[6] 195959
[7] 195960
[8] 195961
[9] 195962
[10] 195963
[11] 195964
[12] 195965
[13] 195969
writing files
writing files
writing files
writing files
writing files
writing files
writing files
writing files
writing files
writing files
writing files
writing files
writing files
checking files
checking files
checking files
checking files
checking files
checking files
checking files
checking files
checking files
checking files
checking files
checking files
checking files
Binary files reproducer_195954_0 and reproducer_195954_74 differ
Binary files reproducer_195954_0 and reproducer_195954_149 differ
Binary files reproducer_195954_0 and reproducer_195954_150 differ
Binary files reproducer_195954_0 and reproducer_195954_299 differ
Binary files reproducer_195954_0 and reproducer_195954_300 differ
Binary files reproducer_195954_0 and reproducer_195954_301 differ
Binary files reproducer_195954_0 and reproducer_195954_302 differ
Binary files reproducer_195954_0 and reproducer_195954_599 differ
Binary files reproducer_195954_0 and reproducer_195954_600 differ
Binary files reproducer_195954_0 and reproducer_195954_601 differ
Binary files reproducer_195954_0 and reproducer_195954_602 differ
Binary files reproducer_195954_0 and reproducer_195954_603 differ
Binary files reproducer_195954_0 and reproducer_195954_604 differ
Binary files reproducer_195954_0 and reproducer_195954_605 differ
Binary files reproducer_195954_0 and reproducer_195954_606 differ
[1] Done ./reproducer.sh
[6] Done ./reproducer.sh
[8] Done ./reproducer.sh
[11] Done ./reproducer.sh
[12]- Done ./reproducer.sh
[2] Done ./reproducer.sh
[3] Done ./reproducer.sh
[7] Done ./reproducer.sh
[9] Done ./reproducer.sh
[10]- Done ./reproducer.sh
[4] Done ./reproducer.sh
[5]- Done ./reproducer.sh
[13]+ Done ./reproducer.sh
Ah, one more bit of info. I reproduced the problem above with zfs_bclone_enabled=1 zfs_dmu_offset_next_sync=1
.
However, with zfs_bclone_enabled=0 zfs_dmu_offset_next_sync=1
it does not reproduce for me any longer.
zfs_bclone_enabled=0
, in my limited observations, makes it much harder to hit a bug like this, but it appears there may be multiple bugs like this, one of which is much harder to hit but still possible, with coreutils >= 9 and zfs_dmu_offset_next_sync=1
. I would strongly advise, based on my current incomplete understanding, that you use zfs_dmu_offset_next_sync=0
, as that seems to be a complete avoidance of the problem, to the best of my ability to reproduce it at this time.
This advice is subject to change with new information, but that's the best understanding I've got at the moment.
Understood, I plan to leave both disabled. I enabled them briefly just to try the reproducer, since @robn wanted data points about reproductions vs. coreutils version.
There's a little theory brewing over here. Request for information: if you're on Linux, and you hit the original bug, or tried the reproducer, could you please post the version of coreutils you have, and whether or not you hit the problem? Thanks!
(this is not to say coreutils is at fault; that this happens on FreeBSD proves that. coreutils has changed when and why it tries to detect holes multiple times in the 9.x series, and narrowing that down may help us see what's happening a little easier).
I had sys-apps/coreutils-9.3-r3 when initially reporting, latest test have been done on sys-apps/coreutils-9.4 Also: kernel was previously 6.1 with an uptime of a few months, now the box is on kernel 6.6 and uptime of a few hours
pool: B100
state: ONLINE
scan: scrub repaired 0B in 00:03:37 with 0 errors on Wed Nov 1 00:47:39 2023
config:
NAME STATE READ WRITE CKSUM
B100 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme-SAMSUNG_MZQL21T9HCJR-00A07_S64GNN0W204179-part5 ONLINE 0 0 0
nvme-SAMSUNG_MZQL21T9HCJR-00A07_S64GNN0W204173-part5 ONLINE 0 0 0
NAME PROPERTY VALUE SOURCE
B100 size 1.64T -
B100 capacity 18% -
B100 altroot - default
B100 health ONLINE -
B100 guid 10713805351153124334 -
B100 version - default
B100 bootfs - default
B100 delegation on default
B100 autoreplace off default
B100 cachefile none local
B100 failmode wait default
B100 listsnapshots off default
B100 autoexpand off default
B100 dedupratio 1.00x -
B100 free 1.34T -
B100 allocated 312G -
B100 readonly off -
B100 ashift 12 local
B100 comment - default
B100 expandsize - -
B100 freeing 0 -
B100 fragmentation 6% -
B100 leaked 0 -
B100 multihost off default
B100 checkpoint - -
B100 load_guid 7239006034030037162 -
B100 autotrim on local
B100 compatibility off default
B100 bcloneused 8.13M -
B100 bclonesaved 8.18M -
B100 bcloneratio 2.00x -
B100 feature@async_destroy enabled local
B100 feature@empty_bpobj active local
B100 feature@lz4_compress active local
B100 feature@multi_vdev_crash_dump enabled local
B100 feature@spacemap_histogram active local
B100 feature@enabled_txg active local
B100 feature@hole_birth active local
B100 feature@extensible_dataset active local
B100 feature@embedded_data active local
B100 feature@bookmarks enabled local
B100 feature@filesystem_limits enabled local
B100 feature@large_blocks enabled local
B100 feature@large_dnode active local
B100 feature@sha512 enabled local
B100 feature@skein enabled local
B100 feature@edonr enabled local
B100 feature@userobj_accounting active local
B100 feature@encryption enabled local
B100 feature@project_quota active local
B100 feature@device_removal enabled local
B100 feature@obsolete_counts enabled local
B100 feature@zpool_checkpoint enabled local
B100 feature@spacemap_v2 active local
B100 feature@allocation_classes enabled local
B100 feature@resilver_defer enabled local
B100 feature@bookmark_v2 enabled local
B100 feature@redaction_bookmarks enabled local
B100 feature@redacted_datasets enabled local
B100 feature@bookmark_written enabled local
B100 feature@log_spacemap active local
B100 feature@livelist active local
B100 feature@device_rebuild enabled local
B100 feature@zstd_compress active local
B100 feature@draid enabled local
B100 feature@zilsaxattr active local
B100 feature@head_errlog active local
B100 feature@blake3 enabled local
B100 feature@block_cloning active local
B100 feature@vdev_zaps_v2 active local
B102 size 5.19T -
B102 capacity 35% -
B102 altroot - default
B102 health ONLINE -
B102 guid 11661773680785260975 -
B102 version - default
B102 bootfs - default
B102 delegation on default
B102 autoreplace off default
B102 cachefile none local
B102 failmode wait default
B102 listsnapshots off default
B102 autoexpand off default
B102 dedupratio 1.00x -
B102 free 3.34T -
B102 allocated 1.85T -
B102 readonly off -
B102 ashift 12 local
B102 comment - default
B102 expandsize - -
B102 freeing 0 -
B102 fragmentation 2% -
B102 leaked 0 -
B102 multihost off default
B102 checkpoint - -
B102 load_guid 9340586567662951585 -
B102 autotrim on local
B102 compatibility off default
B102 bcloneused 0 -
B102 bclonesaved 0 -
B102 bcloneratio 1.00x -
B102 feature@async_destroy enabled local
B102 feature@empty_bpobj active local
B102 feature@lz4_compress active local
B102 feature@multi_vdev_crash_dump enabled local
B102 feature@spacemap_histogram active local
B102 feature@enabled_txg active local
B102 feature@hole_birth active local
B102 feature@extensible_dataset active local
B102 feature@embedded_data active local
B102 feature@bookmarks enabled local
B102 feature@filesystem_limits enabled local
B102 feature@large_blocks enabled local
B102 feature@large_dnode active local
B102 feature@sha512 enabled local
B102 feature@skein enabled local
B102 feature@edonr active local
B102 feature@userobj_accounting active local
B102 feature@encryption enabled local
B102 feature@project_quota active local
B102 feature@device_removal enabled local
B102 feature@obsolete_counts enabled local
B102 feature@zpool_checkpoint enabled local
B102 feature@spacemap_v2 active local
B102 feature@allocation_classes enabled local
B102 feature@resilver_defer enabled local
B102 feature@bookmark_v2 enabled local
B102 feature@redaction_bookmarks enabled local
B102 feature@redacted_datasets enabled local
B102 feature@bookmark_written enabled local
B102 feature@log_spacemap active local
B102 feature@livelist enabled local
B102 feature@device_rebuild enabled local
B102 feature@zstd_compress active local
B102 feature@draid enabled local
B102 feature@zilsaxattr enabled local
B102 feature@head_errlog active local
B102 feature@blake3 active local
B102 feature@block_cloning enabled local
B102 feature@vdev_zaps_v2 active local
Checking out reproducer.sh
on macOS, just to see if I can trigger it there.
Interestingly, it does not work as-is:
cp: reproducer__1: clonefile failed: Resource temporarily unavailable
which is returned from:
error = dmu_read_l0_bps(inos, inzp->z_id, inoff, size, bps,
&nbps);
if (error != 0) {
/*
* If we are trying to clone a block that was created
* in the current transaction group, error will be
* EAGAIN here, which we can just return to the caller
* so it can fallback if it likes.
*/
break;
So I have to do this:
let "j=$i+1"
+ zpool sync
cp ${prefix}$h ${prefix}$i
cp --reflink=never ${prefix}$i ${prefix}$j
and I can not trigger the issue.
Could a Linux chap try adding zpool sync
just to rule that out, so we can resume digging deeper.
@lundman I think your problem is going to be on the other side: whatever your cp
is doing, its not doing the equivalent of copy_file_range
, that is, try to clone, then fall back to content copy.
zpool sync
is probably just gonna make everything work nice, or at least much harder to hit, because the whole thing is about dirty buffers that appear clean, and sync is gonna clean them all for reals? Hmm, maybe. In any case, I can't reproduce it myself, but I reckon you're better to get your cp
doing the right thing so its at least comparable with the other tests we've done today.
It won't fix the other half (whatever was happening in 2.1), but its something, and maybe I've got a bit more of a feel for the shape of it. Not sure I'll have any time in the next few days to look further (got a day of work to catch up on and some conference submissions to do). Hopefully its useful to build on!
Gentlemen, I am also able to trigger it on a TrueNAS Core storage box..... :( 8x rusty platter HDD with no ZIL in a RAID-Z2 layout.
# freebsd-version 13.1-RELEASE-p7 # zfs version zfs-2.1.11-1 zfs-kmod-v2023072100-zfs_0eb787a7e
EDIT: tested twice with 16 instances running, bug triggered twice.
@admnd: Quick question: I'm testing this on my end, on TrueNAS (Core/FreeBSD) 13.0-U5.3, with no positive test results so far. Since the script doesn't work as-is on FreeBSD, what I'm doing is removing the --reflink=never
flag (not supported in BSD cp, not relevant in OpenZFS 2.1), in addition to commenting out the check for block cloning being enabled. Does this match your test scenario?
@admnd I can only give you a pointer towards the answer: if Lustre calls function
zfs_holey()
, then possibly - its exported fromzfs.ko
, but I don't have the Lustre source nearby to look. For Linux and FreeBSD, its used in the implementation oflseek()
.
@robn Thanks for the pointer. A grep
thorough Lustre 2.15.3 source code shows no call to zfs_holey()
. So for the moment I would tend to consider Lustre to be on the "safe side". Unless some new piece of information would pop out to contrary that hypothesis.
So the reproducer script suggests that the silent data corruption bug has been in 2.1.x as well, and has possibly been around for years perhaps? There's no way to figure out whether any files are corrupted?
I know at least one organization that is frantically trying to roll their systems back to 0.8.6 but I don't even think we know whether that's affected either
We do, but thanks for fearmongering.
It's #11900 that was never fixed correctly, apparently.
@admnd: Quick question: I'm testing this on my end, on TrueNAS (Core/FreeBSD) 13.0-U5.3, with no positive test results so far. Since the script doesn't work as-is on FreeBSD, what I'm doing is removing the
--reflink=never
flag (not supported in BSD cp, not relevant in OpenZFS 2.1), in addition to commenting out the check for block cloning being enabled. Does this match your test scenario?
@ericloewe: Absolutely, here is what I have:
#!/bin/bash
prefix="reproducer_${BASHPID}_"
dd if=/dev/urandom of=${prefix}0 bs=1M count=1 status=none
echo "writing files"
end=1000
h=0
for i in `seq 1 2 $end` ; do
let "j=$i+1"
cp ${prefix}$h ${prefix}$i
cp ${prefix}$i ${prefix}$j
let "h++"
done
echo "checking files"
for i in `seq 1 $end` ; do
diff ${prefix}0 ${prefix}$i
done
I managed, with some luck, to trigger it again this evening. It does not show up every time, perhaps because of the storage "slowlyness" (rusty platters vs NVME). For the Fortune and Glory:
(Intel Tiger Lake 6 cores + HT.)
For what it worth, I tested a couple of times the reproducer on a zpool backed by a single SSD and the issue seems much harder to trigger. I double checked and yes, vfs.zfs.dmu_offset_next_sync is set to 1, nothing special seems to happen when vfs.zfs.dmu_offset_next_sync is set to 0.
Hello,
I am in the "happy" situation that I can reproduce this issue pretty reliable on my debian sid box running with 2.1.13. Running 4 reproducer.sh in parallel more or less always triggers the issue for me. I then tried setting "zfs_dmu_offset_next_sync" to 0 and noticed two things.
With zfs_dmu_offset_next_sync set to 0 one run of one of the reproducer instances on a NVME took a pretty stable 6.2 seconds. With it enabled on the other hand the same run took between 16 and 23 seconds....
On the second pool with 4 drivers two of them setup as a mirror each the difference was a little bit uhhh bigger
With it enabled it took 2 minutes and 48 seconds. When I disabled it it finished in 7.7 seconds.... and no errors.
Forgot to mention that coreutils is version 9.4.2 on this machine.
System information
Describe the problem you're observing
When installing the Go compiler with Portage, many of the internal compiler commands have been corrupted by having most of the files replaced by zeros.
I'm able to reproduce on two separate machines running 6.5.11 and ZFS 2.2.0.
ZFS does not see any errors with the pool.
Describe how to reproduce the problem
emerge -1 dev-lang/go
, where Portage's TMPDIR is on ZFS./usr/lib/go/pkg/tool/linux_amd64/compile
are corrupted.I was able to reproduce with and without Portage's "native-extensions" feature. I was unable to reproduce after changing Portage's TMPDIR to another filesystem (such as tmpfs).
Include any warning/errors/backtraces from the system logs