openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.32k stars 1.72k forks source link

some copied files are corrupted (chunks replaced by zeros) #15526

Closed terinjokes closed 7 months ago

terinjokes commented 7 months ago

System information

Type Version/Name
Distribution Name Gentoo
Distribution Version (rolling)
Kernel Version 6.5.11
Architecture amd64
OpenZFS Version 2.2.0
Reference https://bugs.gentoo.org/917224

Describe the problem you're observing

When installing the Go compiler with Portage, many of the internal compiler commands have been corrupted by having most of the files replaced by zeros.

$  file /usr/lib/go/pkg/tool/linux_amd64/* | grep data
/usr/lib/go/pkg/tool/linux_amd64/asm:       data
/usr/lib/go/pkg/tool/linux_amd64/cgo:       data
/usr/lib/go/pkg/tool/linux_amd64/compile:   data
/usr/lib/go/pkg/tool/linux_amd64/covdata:   ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, Go BuildID=xHCzRQtrkEP-Bbxql0SF/zxsofCJFlBoPlUclgwBG/TrsgK6SKiY4q6TIhyBjU/UwcISvZgqfQaEf3Kr_Tq, not stripped
/usr/lib/go/pkg/tool/linux_amd64/cover:     data
/usr/lib/go/pkg/tool/linux_amd64/link:      data
/usr/lib/go/pkg/tool/linux_amd64/vet:       data

$ hexdump /usr/lib/go/pkg/tool/linux_amd64/compile
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0000fa0 0000 0000 0000 0000 0000 0000 5a41 3447
0000fb0 336a 3933 5a49 4f2d 6641 6342 7a6d 3646
0000fc0 582f 5930 5a4d 6761 5659 6f34 6d39 4130
0000fd0 4957 6555 2f67 686d 6a63 6675 5976 4e6a
0000fe0 346c 3070 5157 494e 5f41 5a2f 336d 6342
0000ff0 4e6d 4a4f 306c 4277 4a72 774d 4d41 006c
0001000 0000 0000 0000 0000 0000 0000 0000 0000
*
0ac9280 5a41 3447 336a 3933 5a49 4f2d 6641 6342
0ac9290 7a6d 3646 582f 5930 5a4d 6761 5659 6f34
0ac92a0 6d39 4130 4957 6555 2f67 686d 6a63 6675
0ac92b0 5976 4e6a 346c 3070 5157 494e 5f41 5a2f
0ac92c0 336d 6342 4e6d 4a4f 306c 4277 4a72 774d
0ac92d0 4d41 006c 0000 0000 0000 0000 0000 0000
0ac92e0 0000 0000 0000 0000 0000 0000 0000 0000
*
1139380 0000 0000 0000 0000 0000
1139389

I'm able to reproduce on two separate machines running 6.5.11 and ZFS 2.2.0.

ZFS does not see any errors with the pool.

$ zpool status
  pool: zroot
 state: ONLINE
  scan: scrub repaired 0B in 00:07:24 with 0 errors on Wed Nov  1 00:06:45 2023
config:

        NAME                                          STATE     READ WRITE CKSUM
        zroot                                         ONLINE       0     0     0
          nvme-WDS100T1X0E-XXXXXX_XXXXXXXXXXXX-part2  ONLINE       0     0     0

errors: No known data errors

$ zpool status -t
  pool: zroot
 state: ONLINE
  scan: scrub repaired 0B in 00:07:24 with 0 errors on Wed Nov  1 00:06:45 2023
config:

        NAME                                          STATE     READ WRITE CKSUM
        zroot                                         ONLINE       0     0     0
          nvme-WDS100T1X0E-XXXXXX_XXXXXXXXXXXX-part2  ONLINE       0     0     0  (100% trimmed, completed at Tue 31 Oct 2023 11:15:47 PM GMT)

errors: No known data errors

Describe how to reproduce the problem

  1. On a system running ZFS 2.2.0, upgrade pools to enable the block cloning feature.
  2. emerge -1 dev-lang/go, where Portage's TMPDIR is on ZFS.
  3. After a successful install of Go, the files in /usr/lib/go/pkg/tool/linux_amd64/compile are corrupted.

I was able to reproduce with and without Portage's "native-extensions" feature. I was unable to reproduce after changing Portage's TMPDIR to another filesystem (such as tmpfs).

Include any warning/errors/backtraces from the system logs

terinjokes commented 7 months ago

I've been unable to reproduce after upgrading to 2.2.1 with the tunable set to disabled.

tonyhutter commented 7 months ago

I was able to come up with a simple reproducer script based off @rincebrain's comment https://github.com/openzfs/zfs/issues/15554#issuecomment-1822154030. script is here: reproducer.sh.

Inside your pool mount, spawn off multiple copies of the script in parallel (3-4 copies worked for me):

fedora39:$ cd /tank
fedora39:/tank$ ~/reproducer.sh & ~/reproducer.sh & ~/reproducer.sh & ~/reproducer.sh && wait
[1] 194921
[2] 194922
[3] 194923
writing files
writing files
writing files
writing files
checking files
checking files
checking files
checking files
Binary files reproducer_194963_0 and reproducer_194963_388 differ
Binary files reproducer_194963_0 and reproducer_194963_777 differ
Binary files reproducer_194963_0 and reproducer_194963_778 differ
[2]-  Done                    ~/reproducer.sh
[3]+  Done                    ~/reproducer.sh
[1]+  Done                    rm -f * && ~/reproducer.sh

fedora39:/tank$ hexdump  reproducer_194963_0 | head -n 5
0000000 50b4 8a6a 77a8 681f d35f 061a 3a16 1587
0000010 73cc c42b f481 00b0 8ef7 e3ea f741 c5ec
0000020 3648 ee57 c0b8 3fcb 1cd4 992d 9e5a dc7e
0000030 3f6c b3e3 7359 59fd 0c86 600a eede 7c49
0000040 31c7 94bd 9936 3c46 9952 b0da 9caf 2c66

fedora39:/tank$ hexdump reproducer_194963_777 | head -n 5
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0100000

Note that it will create tons of reproducer_*_* data files in there, so run it from a junk directory.

tonyhutter commented 7 months ago

:arrow_up: this was running 2.2.1 on Fedora 39 with /sys/module/zfs/parameters/zfs_bclone_enabled=1

rincebrain commented 7 months ago

But will it repro if you apply https://github.com/rincebrain/zfs/commit/3f9688eb36023f6f69b98ffbc30267ba24d33ad8 ? :)

tonyhutter commented 7 months ago

@rincebrain unfortunately yes :disappointed:

rincebrain commented 7 months ago

Even better. By which I mean worse, of course.

rincebrain commented 7 months ago

I have a nastier solution, I suppose, since I didn't really think that alone would fix it, but I'll try that and report back.

tonyhutter commented 7 months ago

@rincebrain hate to say it, but looks like it fails with /sys/module/zfs/parameters/zfs_bclone_enabled = 0 ...

fedora39:/tank$ ~/reproducer-nocheck.sh & ~/reproducer-nocheck.sh & ~/reproducer-nocheck.sh & ~/reproducer-nocheck.sh & wait
[1] 304410
[2] 304411
[3] 304412
[4] 304413
writing files
writing files
writing files
writing files
checking files
checking files
checking files
checking files
Binary files reproducer_304412_0 and reproducer_304412_620 differ
[1]   Done                    ~/reproducer-nocheck.sh
[4]+  Done                    ~/reproducer-nocheck.sh
[2]-  Done                    ~/reproducer-nocheck.sh
[3]+  Done                    ~/reproducer-nocheck.sh

fedora39:/tank$ cat /sys/module/zfs/parameters/zfs_bclone_enabled 
0

fedora39:/tank$ hexdump reproducer_304412_0 | head -n 5
0000000 9683 56ac ba05 5c35 ef52 4ecd 5fc9 0c39
0000010 2b5c d795 8e0d 49dd a9f7 dd67 6af5 9cab
0000020 c87a 04ec be89 ae9a f45f b84d a2fe bc9b
0000030 ef0c 2da9 7f44 95ce f6ac 1297 09f1 2df5
0000040 fb54 cad0 7a73 d34d f048 9c68 3ebe a988

fedora39:/tank$ hexdump reproducer_304412_620 | head -n 5
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0100000

fedora39:/tank$ sudo ~/zfs/zpool get all | grep bclone
tank  bcloneused                     0                              -
tank  bclonesaved                    0                              -
tank  bcloneratio                    1.00x                          -

So maybe it's not block cloning? I'll see if I can bisect master to the bad commit.

rincebrain commented 7 months ago

I'd be both happy and upset if it's not block cloning causing people issues. How...unfortunate.

rincebrain commented 7 months ago

That reproducer could never use BRT anyway unless you're using a coreutils newer than 9.0, which you might be, I didn't ask about the environment you're in.

tonyhutter commented 7 months ago

I'm using Fedora 39 with coreutils-9.3-4.fc39.x86_64

rincebrain commented 7 months ago

If it's not BRT related, which would be nice but also bad, then I'd just go back to #11900 again, and assume the dirty check is failing to trigger even if it's not BRT related, and look at any delta that touches dirty state as a starting point to bisect around.

robn commented 7 months ago

Script doesn't repro here. Linux 5.10.170, coreutils 9.1, on both memory-backed and file-backed raidz1, with and without cloning enabled. When enabled, clones are certainly generated.

If its a race, as it seems, then its probably not surprising that we get different results; its gonna be sensitive to local performance characteristics.

vivo75 commented 7 months ago

Cannot reproduce with emerge dev-lang/go after 4 try with 2.2.1 and /sys/module/zfs/parameters/zfs_bclone_enabled = 0

Also cannot reproduce with @tonyhutter script for i in {1..8} ; do ./reproducer.sh & done

tonyhutter commented 7 months ago

I'm able to reproduce in Fedora 37 with 6.5.11 kernel, coreutils 9.1 and zfs-2.1.13. So doesn't look like this is 2.2.x only.

admnd commented 7 months ago

Got it! I am able to trigger the bug nearly every time with 8 running reproducer.sh (7950X here + zpool on 3x nVME modules in RAID-Z1) : image

I did not "zpool upgrade" from OpenZFS 2.1.13. so no 2.2.x feature flag is active on the zpool now (no block-cloning, no blake3 checksum, etc). So far, I remarked absolutely nothing under normal daily usage as my Gentoo box has /var/tmp/portage on tmpfs, no errors in scrubs, nothing crashing abnormally more due to corrupt binaries, etc.

Kernel 6.6.2, coreutils 9.4, glibc 2.38.

rincebrain commented 7 months ago

Spicy.

Everything is awful forever.

admnd commented 7 months ago

Gentlemen, I am also able to trigger it on a TrueNAS Core storage box..... :( 8x rusty platter HDD with no ZIL in a RAID-Z2 layout.

image

# freebsd-version 
13.1-RELEASE-p7
# zfs version
zfs-2.1.11-1
zfs-kmod-v2023072100-zfs_0eb787a7e

EDIT: tested twice with 16 instances running, bug triggered twice.

rincebrain commented 7 months ago

I wonder if it's the case that there's a bug with block cloning and another bug, and so disabling clock cloning closes a very big window but not the other reason this can happen? Oy.

robn commented 7 months ago

I feel fairly confident from code reads that there are still some locking problems in block cloning code, but nothing I've been able to reproduce yet (and some other places in ZFS, if I'm honest). So I wouldn't say that there's not a block cloning bug and that we're not hitting here (absence of evidence etc).

However, a clone is much faster than a content copy, so its possible that just the timing differences are making it easier to hit the different bug. As noted, I couldn't reproduce it at all, but that's in a development VM that has some quite different latency characteristics to any real computer.

I guess I'm saying, I'd track down the other one first, then once that's nailed, retest the original bug here with cloning enabled and see what shakes out.

@tonyhutter holler if you need extra eyeballs, rubber duck, etc.

tonyhutter commented 7 months ago

@robn yea any help would be great. Right now let's try to bisect it down to a specific commit.

terinjokes commented 7 months ago

An update to my previous update, that might be a bit late as of the past few minutes, I can reproduce with @tonyhutter's script with zfs_bclone_enabled = 0, despite my original test case no longer consistently reproducing.

AllKind commented 7 months ago

Can't reproduce with tonyhutters script, even with 24 instances. 16 HDDs pool with 2 raidz2. ZFS 2.1.13, kernel 5.15.139, cp 8.32

terinjokes commented 7 months ago

Same system as 20 minutes ago (2.2.1, zfs_bclone_enabled = 0, Linux 6.5.12-gentoo-dist), but I downgraded coreutils to 8.32 to match the previous poster and I also can't reproduce.

rincebrain commented 7 months ago

As a data point, with zfs_dmu_offset_next_sync=0, I can no longer make the reproducer script reproduce on 2.1.x with coreutils 9.x in several thousand runs of it, when it reproduces within 2 or 3 with it =1.

So for those of us on 2.1.x, that might be useful.

tonyhutter commented 7 months ago

Quick update - I've been testing on AlmaLinux 8 (RHEL clone) w/4.18 kernel using a custom-installed coreutils 9.1 from Fedora 37 (since Alma 8 comes with coreutils 8 by default). I can NOT reproduce on zfs-2.1.2, but can reproduce on zfs 2.1.5. Still bisecting...

rincebrain commented 7 months ago

Remember there was a bug with this fixed back then, Tony. Twice, even.

tonyhutter commented 7 months ago

@rincebrain your data point was correct - it bisected down to 9f6943504aec36f897f814fb7ae5987425436b11 "Default to zfs_dmu_offset_next_sync=1" in the zfs-2.1-release branch. That commit got pulled into zfs-2.1.4. That same commit is 05b3eb6d2 in master.

I also tried setting /sys/module/zfs/parameters/zfs_dmu_offset_next_sync = 0 using that same commit, and could not reproduce the error.

So setting /sys/module/zfs/parameters/zfs_dmu_offset_next_sync = 0 might be the workaround for now.

rincebrain commented 7 months ago

I'm really fascinated because I swear we couldn't originally reproduce this with that =1 then, or I would have been very vociferous about turning it off again...

...plus, the original reason we found this was a lot of people having issues on Gentoo with the aforementioned emerge case, so I'm not entirely certain the original turning on of that tunable is why it's ruining everyone's day now, at least.

tonyhutter commented 7 months ago

As a side note we may want to try the reproducer using a variety of file sizes. When I test with the 1MB file size in the reproducer, I see all zeros for the bad files. I'm hoping that's the case for all corrupted files (all zeros), which would be easy to detect on existing datasets.

rincebrain commented 7 months ago

I would expect that the problem is not all zeroes, based on the gentoo reports saying the resulting binaries were "data" not "empty", and that this is just a consequence of touching the whole file at once, basically.

rincebrain commented 7 months ago

In particular, when 519851122b1703b8445ec17bc89b347cea965bb9 went in and got cherrypicked into 2.1.10 as https://github.com/openzfs/zfs/commit/4b3133e671b958fa2c915a4faf57812820124a7b, we IMMEDIATELY found horrible issues with this cropping up again, which is why 2.1.11 happened.

So I'm very surprised if something hasn't changed to make this crop up more, now, than it did previously, and I don't immediately know what that would be.

robn commented 7 months ago

Well the shape of problem is that a dnode is not marked dirty when it should be, right?

I can totally believe that there is still a place where that was happening in 2.1, that still exists and was just incredibly hard to hit. Because I'm still reasonably sure that the case I mentioned in dmu_buf_will_clone() is real too.

So maybe the original post in this issue is triggered by the cloning case, which is why turning it off helped there. And maybe Tony wrote a test that just happened to tickle the other kind?

I still can't reproduce any of it anywhere, so I can imagine its pretty sensitive to timing.

(Through this I've got some ideas for how we might be able to detect when we're unsafely undirtying a dnode, to try and identify any and all comers. No time to poke at that before the weekend though, so don't wait for me.)

admnd commented 7 months ago

Same here, zfs_dmu_offset_next_sync=0 => no more complaints, both with Linux and FreeBSD.

Perhaps a naive question, but in the case of ZFS acting as a Lustre backend, is there any potential corruption issues as well if zfs_dmu_offset_next_sync=1 or is this issue is specific to ZFS filesystems datasets?

robn commented 7 months ago

There's a little theory brewing over here. Request for information: if you're on Linux, and you hit the original bug, or tried the reproducer, could you please post the version of coreutils you have, and whether or not you hit the problem? Thanks!

(this is not to say coreutils is at fault; that this happens on FreeBSD proves that. coreutils has changed when and why it tries to detect holes multiple times in the 9.x series, and narrowing that down may help us see what's happening a little easier).

robn commented 7 months ago

Perhaps a naive question, but in the case of ZFS acting as a Lustre backend, is there any potential corruption issues as well if zfs_dmu_offset_next_sync=1 or is this issue is specific to ZFS filesystems datasets?

@admnd I can only give you a pointer towards the answer: if Lustre calls function zfs_holey(), then possibly - its exported from zfs.ko, but I don't have the Lustre source nearby to look. For Linux and FreeBSD, its used in the implementation of lseek().

CmdrMoozy commented 7 months ago

Hi @robn , I tried the reproducer and was able to reproduce the bug.

Some details about my setup:

$ uname -a
Linux nas 6.6.2-gentoo #1 SMP Wed Nov 22 15:12:20 PST 2023 x86_64 AMD Ryzen 5 5600X 6-Core Processor AuthenticAMD GNU/Linux

$ zfs --version
zfs-2.2.1-r0-gentoo
zfs-kmod-2.2.1-r0-gentoo

$ equery l coreutils
 * Searching for coreutils ...
[IP-] [  ] sys-apps/coreutils-9.3-r3:0

$ zpool status
  pool: data
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 05:31:00 with 0 errors on Tue Nov 21 23:31:00 2023
config:

        NAME                        STATE     READ WRITE CKSUM
        data                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            wwn-0x5000c500dcafce99  ONLINE       0     0     0
            wwn-0x5000c500dc3b961c  ONLINE       0     0     0
            wwn-0x5000c500dcb07258  ONLINE       0     0     0
            wwn-0x5000c500dc065425  ONLINE       0     0     0
            wwn-0x5000c500e340bf49  ONLINE       0     0     0

errors: No known data errors

$ zpool get all data
NAME  PROPERTY                       VALUE                          SOURCE
data  size                           72.8T                          -
data  capacity                       18%                            -
data  altroot                        -                              default
data  health                         ONLINE                         -
data  guid                           250402256521350630             -
data  version                        -                              default
data  bootfs                         -                              default
data  delegation                     on                             default
data  autoreplace                    off                            default
data  cachefile                      -                              default
data  failmode                       wait                           default
data  listsnapshots                  off                            default
data  autoexpand                     off                            default
data  dedupratio                     1.00x                          -
data  free                           59.0T                          -
data  allocated                      13.7T                          -
data  readonly                       off                            -
data  ashift                         12                             local
data  comment                        -                              default
data  expandsize                     -                              -
data  freeing                        0                              -
data  fragmentation                  0%                             -
data  leaked                         0                              -
data  multihost                      off                            default
data  checkpoint                     -                              -
data  load_guid                      1997265464620489251            -
data  autotrim                       off                            default
data  compatibility                  off                            default
data  bcloneused                     0                              -
data  bclonesaved                    0                              -
data  bcloneratio                    1.00x                          -
data  feature@async_destroy          enabled                        local
data  feature@empty_bpobj            active                         local
data  feature@lz4_compress           active                         local
data  feature@multi_vdev_crash_dump  enabled                        local
data  feature@spacemap_histogram     active                         local
data  feature@enabled_txg            active                         local
data  feature@hole_birth             active                         local
data  feature@extensible_dataset     active                         local
data  feature@embedded_data          active                         local
data  feature@bookmarks              enabled                        local
data  feature@filesystem_limits      enabled                        local
data  feature@large_blocks           enabled                        local
data  feature@large_dnode            enabled                        local
data  feature@sha512                 active                         local
data  feature@skein                  enabled                        local
data  feature@edonr                  enabled                        local
data  feature@userobj_accounting     active                         local
data  feature@encryption             active                         local
data  feature@project_quota          active                         local
data  feature@device_removal         enabled                        local
data  feature@obsolete_counts        enabled                        local
data  feature@zpool_checkpoint       enabled                        local
data  feature@spacemap_v2            active                         local
data  feature@allocation_classes     enabled                        local
data  feature@resilver_defer         enabled                        local
data  feature@bookmark_v2            enabled                        local
data  feature@redaction_bookmarks    enabled                        local
data  feature@redacted_datasets      enabled                        local
data  feature@bookmark_written       enabled                        local
data  feature@log_spacemap           active                         local
data  feature@livelist               enabled                        local
data  feature@device_rebuild         enabled                        local
data  feature@zstd_compress          active                         local
data  feature@draid                  enabled                        local
data  feature@zilsaxattr             disabled                       local
data  feature@head_errlog            disabled                       local
data  feature@blake3                 disabled                       local
data  feature@block_cloning          disabled                       local
data  feature@vdev_zaps_v2           disabled                       local

Reproducer output:

$ ./reproducer.sh & ./reproducer.sh & ./reproducer.sh & ./reproducer.sh & ./reproducer.sh & ./reproducer.sh & ./reproducer.sh & ./reproducer.sh & ./reproducer.sh & ./reproducer.sh & ./reproducer.sh & ./reproducer.sh & ./reproducer.sh & wait
[1] 195954
[2] 195955
[3] 195956
[4] 195957
[5] 195958
[6] 195959
[7] 195960
[8] 195961
[9] 195962
[10] 195963
[11] 195964
[12] 195965
[13] 195969
writing files
writing files
writing files
writing files
writing files
writing files
writing files
writing files
writing files
writing files
writing files
writing files
writing files
checking files
checking files
checking files
checking files
checking files
checking files
checking files
checking files
checking files
checking files
checking files
checking files
checking files
Binary files reproducer_195954_0 and reproducer_195954_74 differ
Binary files reproducer_195954_0 and reproducer_195954_149 differ
Binary files reproducer_195954_0 and reproducer_195954_150 differ
Binary files reproducer_195954_0 and reproducer_195954_299 differ
Binary files reproducer_195954_0 and reproducer_195954_300 differ
Binary files reproducer_195954_0 and reproducer_195954_301 differ
Binary files reproducer_195954_0 and reproducer_195954_302 differ
Binary files reproducer_195954_0 and reproducer_195954_599 differ
Binary files reproducer_195954_0 and reproducer_195954_600 differ
Binary files reproducer_195954_0 and reproducer_195954_601 differ
Binary files reproducer_195954_0 and reproducer_195954_602 differ
Binary files reproducer_195954_0 and reproducer_195954_603 differ
Binary files reproducer_195954_0 and reproducer_195954_604 differ
Binary files reproducer_195954_0 and reproducer_195954_605 differ
Binary files reproducer_195954_0 and reproducer_195954_606 differ
[1]   Done                    ./reproducer.sh
[6]   Done                    ./reproducer.sh
[8]   Done                    ./reproducer.sh
[11]   Done                    ./reproducer.sh
[12]-  Done                    ./reproducer.sh
[2]   Done                    ./reproducer.sh
[3]   Done                    ./reproducer.sh
[7]   Done                    ./reproducer.sh
[9]   Done                    ./reproducer.sh
[10]-  Done                    ./reproducer.sh
[4]   Done                    ./reproducer.sh
[5]-  Done                    ./reproducer.sh
[13]+  Done                    ./reproducer.sh
CmdrMoozy commented 7 months ago

Ah, one more bit of info. I reproduced the problem above with zfs_bclone_enabled=1 zfs_dmu_offset_next_sync=1.

However, with zfs_bclone_enabled=0 zfs_dmu_offset_next_sync=1 it does not reproduce for me any longer.

rincebrain commented 7 months ago

zfs_bclone_enabled=0, in my limited observations, makes it much harder to hit a bug like this, but it appears there may be multiple bugs like this, one of which is much harder to hit but still possible, with coreutils >= 9 and zfs_dmu_offset_next_sync=1. I would strongly advise, based on my current incomplete understanding, that you use zfs_dmu_offset_next_sync=0, as that seems to be a complete avoidance of the problem, to the best of my ability to reproduce it at this time.

This advice is subject to change with new information, but that's the best understanding I've got at the moment.

CmdrMoozy commented 7 months ago

Understood, I plan to leave both disabled. I enabled them briefly just to try the reproducer, since @robn wanted data points about reproductions vs. coreutils version.

vivo75 commented 7 months ago

There's a little theory brewing over here. Request for information: if you're on Linux, and you hit the original bug, or tried the reproducer, could you please post the version of coreutils you have, and whether or not you hit the problem? Thanks!

(this is not to say coreutils is at fault; that this happens on FreeBSD proves that. coreutils has changed when and why it tries to detect holes multiple times in the 9.x series, and narrowing that down may help us see what's happening a little easier).

I had sys-apps/coreutils-9.3-r3 when initially reporting, latest test have been done on sys-apps/coreutils-9.4 Also: kernel was previously 6.1 with an uptime of a few months, now the box is on kernel 6.6 and uptime of a few hours

  pool: B100
 state: ONLINE
  scan: scrub repaired 0B in 00:03:37 with 0 errors on Wed Nov  1 00:47:39 2023
config:

        NAME                                                      STATE     READ WRITE CKSUM
        B100                                                      ONLINE       0     0     0
          mirror-0                                                ONLINE       0     0     0
            nvme-SAMSUNG_MZQL21T9HCJR-00A07_S64GNN0W204179-part5  ONLINE       0     0     0
            nvme-SAMSUNG_MZQL21T9HCJR-00A07_S64GNN0W204173-part5  ONLINE       0     0     0
NAME  PROPERTY                       VALUE                          SOURCE
B100  size                           1.64T                          -
B100  capacity                       18%                            -
B100  altroot                        -                              default
B100  health                         ONLINE                         -
B100  guid                           10713805351153124334           -
B100  version                        -                              default
B100  bootfs                         -                              default
B100  delegation                     on                             default
B100  autoreplace                    off                            default
B100  cachefile                      none                           local
B100  failmode                       wait                           default
B100  listsnapshots                  off                            default
B100  autoexpand                     off                            default
B100  dedupratio                     1.00x                          -
B100  free                           1.34T                          -
B100  allocated                      312G                           -
B100  readonly                       off                            -
B100  ashift                         12                             local
B100  comment                        -                              default
B100  expandsize                     -                              -
B100  freeing                        0                              -
B100  fragmentation                  6%                             -
B100  leaked                         0                              -
B100  multihost                      off                            default
B100  checkpoint                     -                              -
B100  load_guid                      7239006034030037162            -
B100  autotrim                       on                             local
B100  compatibility                  off                            default
B100  bcloneused                     8.13M                          -
B100  bclonesaved                    8.18M                          -
B100  bcloneratio                    2.00x                          -
B100  feature@async_destroy          enabled                        local
B100  feature@empty_bpobj            active                         local
B100  feature@lz4_compress           active                         local
B100  feature@multi_vdev_crash_dump  enabled                        local
B100  feature@spacemap_histogram     active                         local
B100  feature@enabled_txg            active                         local
B100  feature@hole_birth             active                         local
B100  feature@extensible_dataset     active                         local
B100  feature@embedded_data          active                         local
B100  feature@bookmarks              enabled                        local
B100  feature@filesystem_limits      enabled                        local
B100  feature@large_blocks           enabled                        local
B100  feature@large_dnode            active                         local
B100  feature@sha512                 enabled                        local
B100  feature@skein                  enabled                        local
B100  feature@edonr                  enabled                        local
B100  feature@userobj_accounting     active                         local
B100  feature@encryption             enabled                        local
B100  feature@project_quota          active                         local
B100  feature@device_removal         enabled                        local
B100  feature@obsolete_counts        enabled                        local
B100  feature@zpool_checkpoint       enabled                        local
B100  feature@spacemap_v2            active                         local
B100  feature@allocation_classes     enabled                        local
B100  feature@resilver_defer         enabled                        local
B100  feature@bookmark_v2            enabled                        local
B100  feature@redaction_bookmarks    enabled                        local
B100  feature@redacted_datasets      enabled                        local
B100  feature@bookmark_written       enabled                        local
B100  feature@log_spacemap           active                         local
B100  feature@livelist               active                         local
B100  feature@device_rebuild         enabled                        local
B100  feature@zstd_compress          active                         local
B100  feature@draid                  enabled                        local
B100  feature@zilsaxattr             active                         local
B100  feature@head_errlog            active                         local
B100  feature@blake3                 enabled                        local
B100  feature@block_cloning          active                         local
B100  feature@vdev_zaps_v2           active                         local
B102  size                           5.19T                          -
B102  capacity                       35%                            -
B102  altroot                        -                              default
B102  health                         ONLINE                         -
B102  guid                           11661773680785260975           -
B102  version                        -                              default
B102  bootfs                         -                              default
B102  delegation                     on                             default
B102  autoreplace                    off                            default
B102  cachefile                      none                           local
B102  failmode                       wait                           default
B102  listsnapshots                  off                            default
B102  autoexpand                     off                            default
B102  dedupratio                     1.00x                          -
B102  free                           3.34T                          -
B102  allocated                      1.85T                          -
B102  readonly                       off                            -
B102  ashift                         12                             local
B102  comment                        -                              default
B102  expandsize                     -                              -
B102  freeing                        0                              -
B102  fragmentation                  2%                             -
B102  leaked                         0                              -
B102  multihost                      off                            default
B102  checkpoint                     -                              -
B102  load_guid                      9340586567662951585            -
B102  autotrim                       on                             local
B102  compatibility                  off                            default
B102  bcloneused                     0                              -
B102  bclonesaved                    0                              -
B102  bcloneratio                    1.00x                          -
B102  feature@async_destroy          enabled                        local
B102  feature@empty_bpobj            active                         local
B102  feature@lz4_compress           active                         local
B102  feature@multi_vdev_crash_dump  enabled                        local
B102  feature@spacemap_histogram     active                         local
B102  feature@enabled_txg            active                         local
B102  feature@hole_birth             active                         local
B102  feature@extensible_dataset     active                         local
B102  feature@embedded_data          active                         local
B102  feature@bookmarks              enabled                        local
B102  feature@filesystem_limits      enabled                        local
B102  feature@large_blocks           enabled                        local
B102  feature@large_dnode            active                         local
B102  feature@sha512                 enabled                        local
B102  feature@skein                  enabled                        local
B102  feature@edonr                  active                         local
B102  feature@userobj_accounting     active                         local
B102  feature@encryption             enabled                        local
B102  feature@project_quota          active                         local
B102  feature@device_removal         enabled                        local
B102  feature@obsolete_counts        enabled                        local
B102  feature@zpool_checkpoint       enabled                        local
B102  feature@spacemap_v2            active                         local
B102  feature@allocation_classes     enabled                        local
B102  feature@resilver_defer         enabled                        local
B102  feature@bookmark_v2            enabled                        local
B102  feature@redaction_bookmarks    enabled                        local
B102  feature@redacted_datasets      enabled                        local
B102  feature@bookmark_written       enabled                        local
B102  feature@log_spacemap           active                         local
B102  feature@livelist               enabled                        local
B102  feature@device_rebuild         enabled                        local
B102  feature@zstd_compress          active                         local
B102  feature@draid                  enabled                        local
B102  feature@zilsaxattr             enabled                        local
B102  feature@head_errlog            active                         local
B102  feature@blake3                 active                         local
B102  feature@block_cloning          enabled                        local
B102  feature@vdev_zaps_v2           active                         local
lundman commented 7 months ago

Checking out reproducer.sh on macOS, just to see if I can trigger it there.

Interestingly, it does not work as-is:

cp: reproducer__1: clonefile failed: Resource temporarily unavailable

which is returned from:

        error = dmu_read_l0_bps(inos, inzp->z_id, inoff, size, bps,
            &nbps);
        if (error != 0) {
            /*                                                                     
             * If we are trying to clone a block that was created                  
             * in the current transaction group, error will be                     
             * EAGAIN here, which we can just return to the caller                 
             * so it can fallback if it likes.                                     
             */
            break;

So I have to do this:

    let "j=$i+1"
+   zpool sync
    cp  ${prefix}$h ${prefix}$i
    cp --reflink=never ${prefix}$i ${prefix}$j

and I can not trigger the issue.

Could a Linux chap try adding zpool sync just to rule that out, so we can resume digging deeper.

robn commented 7 months ago

@lundman I think your problem is going to be on the other side: whatever your cp is doing, its not doing the equivalent of copy_file_range, that is, try to clone, then fall back to content copy.

zpool sync is probably just gonna make everything work nice, or at least much harder to hit, because the whole thing is about dirty buffers that appear clean, and sync is gonna clean them all for reals? Hmm, maybe. In any case, I can't reproduce it myself, but I reckon you're better to get your cp doing the right thing so its at least comparable with the other tests we've done today.

robn commented 7 months ago

15566 might fix the cloning half of this (exactly what I pondered in https://github.com/openzfs/zfs/issues/15526#issuecomment-1811508893). At least I've done the analysis on it properly now, so I feel reasonably good about it (as much as good feelings are even possible at all after this).

It won't fix the other half (whatever was happening in 2.1), but its something, and maybe I've got a bit more of a feel for the shape of it. Not sure I'll have any time in the next few days to look further (got a day of work to catch up on and some conference submissions to do). Hopefully its useful to build on!

ericloewe commented 7 months ago

Gentlemen, I am also able to trigger it on a TrueNAS Core storage box..... :( 8x rusty platter HDD with no ZIL in a RAID-Z2 layout.

image

# freebsd-version 
13.1-RELEASE-p7
# zfs version
zfs-2.1.11-1
zfs-kmod-v2023072100-zfs_0eb787a7e

EDIT: tested twice with 16 instances running, bug triggered twice.

@admnd: Quick question: I'm testing this on my end, on TrueNAS (Core/FreeBSD) 13.0-U5.3, with no positive test results so far. Since the script doesn't work as-is on FreeBSD, what I'm doing is removing the --reflink=never flag (not supported in BSD cp, not relevant in OpenZFS 2.1), in addition to commenting out the check for block cloning being enabled. Does this match your test scenario?

admnd commented 7 months ago

@admnd I can only give you a pointer towards the answer: if Lustre calls function zfs_holey(), then possibly - its exported from zfs.ko, but I don't have the Lustre source nearby to look. For Linux and FreeBSD, its used in the implementation of lseek().

@robn Thanks for the pointer. A grep thorough Lustre 2.15.3 source code shows no call to zfs_holey(). So for the moment I would tend to consider Lustre to be on the "safe side". Unless some new piece of information would pop out to contrary that hypothesis.

RichardBelzer commented 7 months ago

So the reproducer script suggests that the silent data corruption bug has been in 2.1.x as well, and has possibly been around for years perhaps? There's no way to figure out whether any files are corrupted?

I know at least one organization that is frantically trying to roll their systems back to 0.8.6 but I don't even think we know whether that's affected either

rincebrain commented 7 months ago

We do, but thanks for fearmongering.

It's #11900 that was never fixed correctly, apparently.

admnd commented 7 months ago

@admnd: Quick question: I'm testing this on my end, on TrueNAS (Core/FreeBSD) 13.0-U5.3, with no positive test results so far. Since the script doesn't work as-is on FreeBSD, what I'm doing is removing the --reflink=never flag (not supported in BSD cp, not relevant in OpenZFS 2.1), in addition to commenting out the check for block cloning being enabled. Does this match your test scenario?

@ericloewe: Absolutely, here is what I have:

#!/bin/bash
prefix="reproducer_${BASHPID}_"
dd if=/dev/urandom of=${prefix}0 bs=1M count=1 status=none

echo "writing files"
end=1000
h=0
for i in `seq 1 2 $end` ; do
        let "j=$i+1"
        cp  ${prefix}$h ${prefix}$i
        cp ${prefix}$i ${prefix}$j
        let "h++"
done

echo "checking files"
for i in `seq 1 $end` ; do
        diff ${prefix}0 ${prefix}$i
done

I managed, with some luck, to trigger it again this evening. It does not show up every time, perhaps because of the storage "slowlyness" (rusty platters vs NVME). For the Fortune and Glory:

image

(Intel Tiger Lake 6 cores + HT.)

For what it worth, I tested a couple of times the reproducer on a zpool backed by a single SSD and the issue seems much harder to trigger. I double checked and yes, vfs.zfs.dmu_offset_next_sync is set to 1, nothing special seems to happen when vfs.zfs.dmu_offset_next_sync is set to 0.

maru-sama commented 7 months ago

Hello,

I am in the "happy" situation that I can reproduce this issue pretty reliable on my debian sid box running with 2.1.13. Running 4 reproducer.sh in parallel more or less always triggers the issue for me. I then tried setting "zfs_dmu_offset_next_sync" to 0 and noticed two things.

With zfs_dmu_offset_next_sync set to 0 one run of one of the reproducer instances on a NVME took a pretty stable 6.2 seconds. With it enabled on the other hand the same run took between 16 and 23 seconds....

On the second pool with 4 drivers two of them setup as a mirror each the difference was a little bit uhhh bigger

With it enabled it took 2 minutes and 48 seconds. When I disabled it it finished in 7.7 seconds.... and no errors.

Forgot to mention that coreutils is version 9.4.2 on this machine.