openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.31k stars 1.71k forks source link

some copied files are corrupted (chunks replaced by zeros) #15526

Closed terinjokes closed 7 months ago

terinjokes commented 7 months ago

System information

Type Version/Name
Distribution Name Gentoo
Distribution Version (rolling)
Kernel Version 6.5.11
Architecture amd64
OpenZFS Version 2.2.0
Reference https://bugs.gentoo.org/917224

Describe the problem you're observing

When installing the Go compiler with Portage, many of the internal compiler commands have been corrupted by having most of the files replaced by zeros.

$  file /usr/lib/go/pkg/tool/linux_amd64/* | grep data
/usr/lib/go/pkg/tool/linux_amd64/asm:       data
/usr/lib/go/pkg/tool/linux_amd64/cgo:       data
/usr/lib/go/pkg/tool/linux_amd64/compile:   data
/usr/lib/go/pkg/tool/linux_amd64/covdata:   ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, Go BuildID=xHCzRQtrkEP-Bbxql0SF/zxsofCJFlBoPlUclgwBG/TrsgK6SKiY4q6TIhyBjU/UwcISvZgqfQaEf3Kr_Tq, not stripped
/usr/lib/go/pkg/tool/linux_amd64/cover:     data
/usr/lib/go/pkg/tool/linux_amd64/link:      data
/usr/lib/go/pkg/tool/linux_amd64/vet:       data

$ hexdump /usr/lib/go/pkg/tool/linux_amd64/compile
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0000fa0 0000 0000 0000 0000 0000 0000 5a41 3447
0000fb0 336a 3933 5a49 4f2d 6641 6342 7a6d 3646
0000fc0 582f 5930 5a4d 6761 5659 6f34 6d39 4130
0000fd0 4957 6555 2f67 686d 6a63 6675 5976 4e6a
0000fe0 346c 3070 5157 494e 5f41 5a2f 336d 6342
0000ff0 4e6d 4a4f 306c 4277 4a72 774d 4d41 006c
0001000 0000 0000 0000 0000 0000 0000 0000 0000
*
0ac9280 5a41 3447 336a 3933 5a49 4f2d 6641 6342
0ac9290 7a6d 3646 582f 5930 5a4d 6761 5659 6f34
0ac92a0 6d39 4130 4957 6555 2f67 686d 6a63 6675
0ac92b0 5976 4e6a 346c 3070 5157 494e 5f41 5a2f
0ac92c0 336d 6342 4e6d 4a4f 306c 4277 4a72 774d
0ac92d0 4d41 006c 0000 0000 0000 0000 0000 0000
0ac92e0 0000 0000 0000 0000 0000 0000 0000 0000
*
1139380 0000 0000 0000 0000 0000
1139389

I'm able to reproduce on two separate machines running 6.5.11 and ZFS 2.2.0.

ZFS does not see any errors with the pool.

$ zpool status
  pool: zroot
 state: ONLINE
  scan: scrub repaired 0B in 00:07:24 with 0 errors on Wed Nov  1 00:06:45 2023
config:

        NAME                                          STATE     READ WRITE CKSUM
        zroot                                         ONLINE       0     0     0
          nvme-WDS100T1X0E-XXXXXX_XXXXXXXXXXXX-part2  ONLINE       0     0     0

errors: No known data errors

$ zpool status -t
  pool: zroot
 state: ONLINE
  scan: scrub repaired 0B in 00:07:24 with 0 errors on Wed Nov  1 00:06:45 2023
config:

        NAME                                          STATE     READ WRITE CKSUM
        zroot                                         ONLINE       0     0     0
          nvme-WDS100T1X0E-XXXXXX_XXXXXXXXXXXX-part2  ONLINE       0     0     0  (100% trimmed, completed at Tue 31 Oct 2023 11:15:47 PM GMT)

errors: No known data errors

Describe how to reproduce the problem

  1. On a system running ZFS 2.2.0, upgrade pools to enable the block cloning feature.
  2. emerge -1 dev-lang/go, where Portage's TMPDIR is on ZFS.
  3. After a successful install of Go, the files in /usr/lib/go/pkg/tool/linux_amd64/compile are corrupted.

I was able to reproduce with and without Portage's "native-extensions" feature. I was unable to reproduce after changing Portage's TMPDIR to another filesystem (such as tmpfs).

Include any warning/errors/backtraces from the system logs

thulle commented 7 months ago

I'm also hit by this bug, happened after upgrade to ZFS 2.2.0. Mounting tmpfs on portage TMPDIR (build directory for go) solves the issue. Currently on kernel 6.5.9, originally happened on 6.5.7.

No idea if it's worth noting, but the remaining data in the files seems to be a repetition of the same base64 encoded data. Decoding it gave me nothing comprehensible.

terinjokes commented 7 months ago

After downgrading coreutils from 9.3 to 8.32, I am no longer able to reproduce this corruption. My understanding is 8.32 predates the switch to automatically using reflink?

No idea if it's worth noting, but the remaining data in the files seems to be a repetition of the same base64 encoded data. Decoding it gave me nothing comprehensible.

This is the Go BuildID. Compare the output of a non-corrupted build:

$ file /usr/lib/go/pkg/tool/linux_amd64/compile
/usr/lib/go/pkg/tool/linux_amd64/compile: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, Go BuildID=AZG4j339IZ-OAfBcmzF6/X0YMZagYV4o9m0AWIUeg/mhcjufvYjNl4p0WQNIA_/Zm3BcmNOJl0wBrJMwAMl, not stripped

With the data remaining in the file:

$ hexdump -C /var/tmp/portage/dev-lang/go-1.21.4/image/usr/lib/go/pkg/tool/linux_amd64/compile
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000fa0  00 00 00 00 00 00 00 00  00 00 00 00 41 5a 47 34  |............AZG4|
00000fb0  6a 33 33 39 49 5a 2d 4f  41 66 42 63 6d 7a 46 36  |j339IZ-OAfBcmzF6|
00000fc0  2f 58 30 59 4d 5a 61 67  59 56 34 6f 39 6d 30 41  |/X0YMZagYV4o9m0A|
00000fd0  57 49 55 65 67 2f 6d 68  63 6a 75 66 76 59 6a 4e  |WIUeg/mhcjufvYjN|
00000fe0  6c 34 70 30 57 51 4e 49  41 5f 2f 5a 6d 33 42 63  |l4p0WQNIA_/Zm3Bc|
00000ff0  6d 4e 4f 4a 6c 30 77 42  72 4a 4d 77 41 4d 6c 00  |mNOJl0wBrJMwAMl.|
00001000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00ac9280  41 5a 47 34 6a 33 33 39  49 5a 2d 4f 41 66 42 63  |AZG4j339IZ-OAfBc|
00ac9290  6d 7a 46 36 2f 58 30 59  4d 5a 61 67 59 56 34 6f  |mzF6/X0YMZagYV4o|
00ac92a0  39 6d 30 41 57 49 55 65  67 2f 6d 68 63 6a 75 66  |9m0AWIUeg/mhcjuf|
00ac92b0  76 59 6a 4e 6c 34 70 30  57 51 4e 49 41 5f 2f 5a  |vYjNl4p0WQNIA_/Z|
00ac92c0  6d 33 42 63 6d 4e 4f 4a  6c 30 77 42 72 4a 4d 77  |m3BcmNOJl0wBrJMw|
00ac92d0  41 4d 6c 00 00 00 00 00  00 00 00 00 00 00 00 00  |AMl.............|
00ac92e0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
01139380  00 00 00 00 00 00 00 00  00                       |.........|
01139389
RichardBelzer commented 7 months ago

Had the same issue here... Upgraded from 2.1.13 to 2.2.0 last month right after it came out. I'm on Ubuntu and built the ZFS package from the 2.2.0 tag.

I am discovering random files that were being stored after the 2.2.0 upgrade have suffered silent corruption and either have repetitious data like @thulle reported or have large blocks of contiguous zeroes instead of the data that they should be filled with. I don't use or build go like the other posters.

zpool scrub reports no errors, zpool status reports no errors. Let me know if there is any other information needed.

thulle commented 7 months ago

After downgrading coreutils from 9.3 to 8.32, I am no longer able to reproduce this corruption.

This seems to solve the corruption issue on my end too.

thesamesam commented 7 months ago

https://github.com/openzfs/zfs/issues/11900#issuecomment-927568640 onwards is highly relevant. See also #14753 and #11900 as a whole.

It would be interesting to know if people here have all upgraded their pool for the new block cloning feature or if there's a mix here.

thulle commented 7 months ago

feature@block_cloning active here

RichardBelzer commented 7 months ago

My understanding is that block cloning is enabled by default when upgrading to 2.2.0, but my not be getting used unless an application like coreutils is actually leveraging it.

Can people post the result of this:

zpool get all tank | grep bclone

(where tank is the name of their pool with the corruption)

thulle commented 7 months ago
kc3000    bcloneused                     442M                           -
kc3000    bclonesaved                    1.42G                          -
kc3000    bcloneratio                    4.30x                          -
RichardBelzer commented 7 months ago

My understanding is this: If the result is 0 for both bcloneused and bclonesaved then it's safe to say that you don't have silent corruption.

Any non-zero number for either field and you may OR may not have silent corruption.

rincebrain commented 7 months ago

That is likely but not really safe to say yet, we don't have enough data.

terinjokes commented 7 months ago

but my not be getting used unless an application like coreutils is actually leveraging it.

Is it accurate to say it's being used when the FICLONE ioctl is used, and possibly when copy_source_range?

rincebrain commented 7 months ago

I wouldn't even feel comfortable saying that yet, I agree it's likely, but I'd really like more certainty before declaring "you're safe if not this".

RichardBelzer commented 7 months ago

Agreed, we don't know what we don't know, we need to:

  1. Get a machine / test case to repro it most of the time (Ideally 100%). I think we're here already here with newer coreutils and building go.
  2. Determine what we think is the bug and produce a patch
  3. Patch the source, rerun the test(s), verify that this now happens 0% of the time

I don't know if there's a more robust way of figuring this out outside of that. And of course I'm handwaving 2 which could be a heavy lift.

rincebrain commented 7 months ago

I'd assume, not having looked, that it's something like copy_file_range isn't dirtying things so the thing that triggers a force txg sync on SEEK_DATA/SEEK_HOLE with a dirty thing isn't firing.

But as I'm somewhat occupied recovering from being badly ill at home, my time to test and debug this is rather finite, so I wouldn't assume I'll be doing that in a timely fashion.

robn commented 7 months ago

I agree with @rincebrain on the rough theory. I did look, and landed on code I've looked at before while trying to get my head around corner cases in block cloning. I have been suspicious of it for a while now, and remain so.

When we start to clone a block, we indicate intent to modify the target dbuf by calling dmu_buf_will_clone, which is a special case that unwinds any changes on the dbuf, and then sets it to DB_NOFILL.

In dmu.c:

void
dmu_buf_will_clone(dmu_buf_t *db_fake, dmu_tx_t *tx)
{
    dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;

    /*
     * Block cloning: We are going to clone into this block, so undirty
     * modifications done to this block so far in this txg. This includes
     * writes and clones into this block.
     */
    mutex_enter(&db->db_mtx);
    VERIFY(!dbuf_undirty(db, tx));
    ASSERT0P(dbuf_find_dirty_eq(db, tx->tx_txg));
    if (db->db_buf != NULL) {
        arc_buf_destroy(db->db_buf, db);
        db->db_buf = NULL;
    }
    mutex_exit(&db->db_mtx);

    dmu_buf_will_not_fill(db_fake, tx);
}

void
dmu_buf_will_not_fill(dmu_buf_t *db_fake, dmu_tx_t *tx)
{
    dmu_buf_impl_t *db = (dmu_buf_impl_t *)db_fake;

    mutex_enter(&db->db_mtx);
    db->db_state = DB_NOFILL;
    DTRACE_SET_STATE(db, "allocating NOFILL buffer");
    mutex_exit(&db->db_mtx);

    dbuf_noread(db);
    (void) dbuf_dirty(db, tx);
}

It seems there's a window there where the lock is down and the dbuf is not-dirty. That may be a place where a second thread can add a change to that block, which then gets trampled.

Unfortunately dbuf_dirty and dbuf_undirty are extremely complex, and the gap is tiny, and without a tight test case its gonna take me a lot of time to track this down, which I don't have at the moment[^1]. So this is an unhelpful comment but maybe it gives someone else an idea of where to look.

[^1]: but you can buy some of my time :wink:

grahamperrin commented 7 months ago

https://github.com/openzfs/zfs/issues/15526#issuecomment-1810800004

My understanding is that block cloning is enabled by default when upgrading to 2.2.0, …

With FreeBSD 14.0-RELEASE, there's complementary use of the vfs.zfs.bclone_enabled kernel state, 0 (zero):

root@mowa219-gjp4-freebsd-14-zfs-vm:~ # sysctl -d vfs.zfs.bclone_enabled
vfs.zfs.bclone_enabled: Enable block cloning
root@mowa219-gjp4-freebsd-14-zfs-vm:~ # sysctl vfs.zfs.bclone_enabled
vfs.zfs.bclone_enabled: 0
root@mowa219-gjp4-freebsd-14-zfs-vm:~ # freebsd-version -kru ; uname -aKU
14.0-RELEASE
14.0-RELEASE
14.0-RELEASE
FreeBSD mowa219-gjp4-freebsd-14-zfs-vm 14.0-RELEASE FreeBSD 14.0-RELEASE releng/14.0-n265380-f9716eee8ab GENERIC amd64 1400097 1400097
root@mowa219-gjp4-freebsd-14-zfs-vm:~ # pkg upgrade -r FreeBSD-base
Updating FreeBSD-base repository catalogue...
FreeBSD-base repository is up to date.
All repositories are up to date.
Checking for upgrades (0 candidates): 100%
Processing candidates (0 candidates): 100%
Checking integrity... done (0 conflicting)
Your packages are up to date.
root@mowa219-gjp4-freebsd-14-zfs-vm:~ # exit
logout
grahamperrin@mowa219-gjp4-freebsd-14-zfs-vm:~ % pkg -vv | grep -e url -e enabled
    url             : "pkg+https://pkg.freebsd.org/FreeBSD:14:amd64/latest",
    enabled         : yes,
    url             : "pkg+https://pkg.freebsd.org/FreeBSD:14:amd64/base_release_0",
    enabled         : yes,
grahamperrin@mowa219-gjp4-freebsd-14-zfs-vm:~ % 

sysctl(8)

Whilst 14.0 is not yet announced, I can't imagine a change to the value at this time.

rincebrain commented 7 months ago

Time to upstream that and add Linux support, with how broken this is.

KungFuJesus commented 7 months ago

I am curious if it is broken in the same way on FreeBSD. Has anyone managed to reproduce this with intentional reflink copies? From what I recall the implementer's main platform was FreeBSD.

Also some of these stack traces are blown assertions in the ZIL. Are we certain all of these are in regard to the BRT feature? I know a lot of the more recent optimizations with regard to ZIL messed with lock granularity, do we know that these things haven't happened with BRT not being leveraged?

rincebrain commented 7 months ago

Here is probably not the right place to debate whether the other bugs around the ZIL are or are not from BRT. At least in 15529, I only tagged the ones where it was explicitly the case that block cloning was happening.

I'm also not sure what you mean by "intentional reflink support" here - in both FreeBSD and Linux's case, they're calling a copy_file_range analogue that is going to reflink if it can and do a boring copy otherwise, nobody here is explicitly trying FICLONE, which has last I checked no analogue on FreeBSD at all.

terinjokes commented 7 months ago

The cp from coreutils-9.3 is using FICLONE, and falling back to copy_file_range if it fails.

KungFuJesus commented 7 months ago

By intentional I mean doing cp with a reflink argument, assuming the BSD variant of cp has it. I'd expect that by default it doesn't try to use it automatically like coreutils' --reflink=auto behavior. When something like make is copying things around with quickly generated build artifacts, I expect you're dealing with code that is likely to contend race conditions more than a file that's been stable on disk for a few seconds and is being copied once with the BRT backed clone feature.

rincebrain commented 7 months ago

My understanding is that cp on FreeBSD just invokes copy_file_range in every case, and as I said, has no FICLONE analogue.

KungFuJesus commented 7 months ago

Ah gotcha, so the benefits of the reduced syscall are there without requiring repeated userspace / kernelspace round trips, but the BRT based shortcut and space savings are prohibited behind that sysctl. Makes sense. I am curious if FICLONE is the issue here and copy_file_range worked just fine?

rincebrain commented 7 months ago

Since the code, if I'm reading it right, for FICLONE just invokes the same backend function, it shouldn't be FICLONE specifically. I would guess differing semantics about how easy races are to trigger between the two platforms' memory management, but I really don't know.

tonyhutter commented 7 months ago

Pinging @pjd on block cloning.

I'm seeing if I can make a non-gentoo reproducer.

thulle commented 7 months ago

Since I have 1.42GB of possibly affected files I thought I'd check for stretches of zeroes in the other files to see if anything other than compiling go might have triggered this. The block cloning commit only seem to change zdb code regarding the ZIL, so I just want to verify that we have no way of dumping the BRT to check which files contain cloned blocks, right?

Currently dumping a list of all file blocks to check for multiple references to the same block.

rincebrain commented 7 months ago

I don't think zdb knows how to dump it as it is now, no. I don't think it'd be hard to teach it, though.

robn commented 7 months ago

Its not hard to dump, but also it doesn't really show what you want. Each entry is just the offset within a vdev (half a DVA), and the number of references to it. It doesn't know what a file is, nor even really what a block pointer is.

Probably the dumbest version is to extend the in-memory mini-BRT used in zdb -b to note the object in which cloned blocked were seen, and then dump those out at the end. It does mean a full scan though (about the same heft as a scrub).

Hmm, now I think about it, it might not help anyway. If the problem is that there was a race, and we applied a change in the wrong order, such that zeroes got written, then the clone itself was never performed - we did a real write, just with zero content. So there's no clone to look for.

thulle commented 7 months ago

@robn The path can be shown in the output with a list of all the DVA:s though (-ddddd), and then it's just to work backwards to show all files for them. Having a dump of the BRT would've helped in the process. The idea was that whatever has produced those files are using a syscall that is susceptible to this bug.

I've checked most files with cloned blocks now though, nothing incorrect. The only ones remaining to verify are all copies made by steam, but since it's windows .dlls and the like chunks of zeroes are quite common, so I'll have to verify them some other way.

The next idea is to check all files modified since enabling block cloning and compare them with earlier versions in snapshots to see if chunks of zeroes have appeared in files that used to have none.

behlendorf commented 7 months ago

Unfortunately, I've been unable to reproduce this by rebuilding go on Fedora 39 (kernel 6.5.11-300) with coreutil-9.3 and OpenZFS 2.2. Observing the builds BRT is definitely being exercised, but I haven't been able to trigger this in over 50 builds. @terinjokes @thulle could you share some more details about the exact procedure your using to trigger this.

terinjokes commented 7 months ago

I don't think I'm doing anything unusual. I'm building go with Gentoo Portage and can reliably reproduce with:

  1. emerge -C go (to remove the current version of Go, not necessarily if the current package works).
  2. emerge -1av go, which fetches go-bootstrap-1.18 and uses it to build Go 1.21.4.
  3. At the end of the build, I can check the installed files and see they're corrupted.

I was also able to verify that navigating to the ebuild and executing ebuild go-1.21.4.ebuild clean install in place of step 2 above results in corrupted files in the image DESTDIR directory under /var/tmp/portage, so these are being corrupted as part of the build, and not part of the final copy into the system directories.

I'm using the pre-built "gentoo-kernel-bin" kernel on both of the machines I can reproduce this with.

KungFuJesus commented 7 months ago

I wonder if portage's sandboxing for the build process does anything special / weird with hardlinks that could cause this.

rincebrain commented 7 months ago

It's important to note that it's sometimes very hard to convince ZFS to enable the BRT feature on the pool, and it'll just fail to reflink things then. So I'd check that the feature is actually marked active before trying to reproduce this.

(Specifically I mean that a lot of the time if the pool is even slightly busy it'll return EAGAIN trying to go from enabled to active initially. Or sometimes even if it's already active it'll think the file is busy and return EAGAIN and fall back to a normal copy if you used copy_file_range. I tried making a reproducer out of the existing SEEK_HOLE test cases but gave up after about an hour because it was so inconsistent at enabling the feature and kept mysteriously erroring out trying to make clones.)

thesamesam commented 7 months ago

I wonder if portage's sandboxing for the build process does anything special / weird with hardlinks that could cause this.

I don't think so (although I wouldn't say it's impossible) -- someone can try FEATURES="-sandbox -usersandbox" to confirm. Further, at least that wasn't the case in the other bug. But that other bug was extremely timing sensitive for me and I could only ever hit it on one machine, and not many other Gentoo ones.

terinjokes commented 7 months ago

I discussed with @thesamesam on IRC, and rebuilt go with FEATURES="-usersandbox -sandbox" to disable sandboxing, and I still have the same corruption.

KungFuJesus commented 7 months ago

I'm trying to think of anything else portage does that might be different from a usual compilation outside of portage, but not a whole lot comes to mind. It could just be that you are reliably hitting a data race with reliably bad timing.

Portage:

@thesamesam I'm sure this is leaving a lot out, these are things just off the top of my head that it does.

thulle commented 7 months ago

I'm not doing anything in particular either, only difference from terinjokes is that instead of removing go and rebuilding I've just emerged an older binary package to recover.

I'm not using the prebuild gentoo-kernel, but my own build of gentoo-sources-6.5.9 currently.

Setting FEATURES="-sandbox -usersandbox" and xattr=off doesn't improve things.

thulle commented 7 months ago

Setting sync=always seems to change something, using file usr/lib/go/pkg/tool/linux_amd64/* | cut -c 1-72 to verify I've now successfully built twice.

edit: 4 successful builds, 2 while having enabled sandbox and usersandbox again. But as soon as I set xattr=sa again I got a failure. Doing more builds to check if this keeps true.

rincebrain commented 7 months ago

That makes sense.

IIRC Go builds do a lot of cping things around, which is why this was so good at triggering issues, from looking into why this came up in #11900.

terinjokes commented 7 months ago

I made a new fs with sync=always and mounted it as the Portage TMPDIR, and while it improved the situation (likely by changing the timing), I was still able to reproduce this issue.

/var/tmp/portage/dev-lang/go-1.21.4/image # file usr/lib/go/pkg/tool/linux_amd64/vet
usr/lib/go/pkg/tool/linux_amd64/vet: data

/var/tmp/portage/dev-lang/go-1.21.4/image # hexdump -C usr/lib/go/pkg/tool/linux_amd64/vet
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000fa0  00 00 00 00 00 00 00 00  00 00 00 00 69 74 6a 6e  |............itjn|
00000fb0  6f 69 72 46 39 78 39 66  6b 4c 73 59 4d 77 47 50  |oirF9x9fkLsYMwGP|
00000fc0  2f 59 69 50 4c 58 72 47  65 55 53 57 56 65 4c 76  |/YiPLXrGeUSWVeLv|
00000fd0  4d 4c 6f 74 36 2f 44 53  30 46 37 55 4d 66 5f 6c  |MLot6/DS0F7UMf_l|
00000fe0  45 66 72 32 68 6e 6b 62  71 32 2f 47 57 4e 70 78  |Efr2hnkbq2/GWNpx|
00000ff0  5f 68 2d 43 78 38 69 6d  63 34 76 6b 37 74 62 00  |_h-Cx8imc4vk7tb.|
00001000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
003954a0  69 74 6a 6e 6f 69 72 46  39 78 39 66 6b 4c 73 59  |itjnoirF9x9fkLsY|
003954b0  4d 77 47 50 2f 59 69 50  4c 58 72 47 65 55 53 57  |MwGP/YiPLXrGeUSW|
003954c0  56 65 4c 76 4d 4c 6f 74  36 2f 44 53 30 46 37 55  |VeLvMLot6/DS0F7U|
003954d0  4d 66 5f 6c 45 66 72 32  68 6e 6b 62 71 32 2f 47  |Mf_lEfr2hnkbq2/G|
003954e0  57 4e 70 78 5f 68 2d 43  78 38 69 6d 63 34 76 6b  |WNpx_h-Cx8imc4vk|
003954f0  37 74 62 00 00 00 00 00  00 00 00 00 00 00 00 00  |7tb.............|
00395500  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00595460  00 00 00 00 00 00 00                              |.......|
00595467

Edit to add:

/var/tmp/portage/dev-lang/go-1.21.4/image # findmnt /var/tmp/portage/
TARGET           SOURCE              FSTYPE OPTIONS
/var/tmp/portage zroot/local/portage zfs    rw,relatime,xattr,posixacl,casesensitive

/var/tmp/portage/dev-lang/go-1.21.4/image # zfs get sync zroot/local/portage
NAME                 PROPERTY  VALUE     SOURCE
zroot/local/portage  sync      always    local
vivo75 commented 7 months ago

A delay between writing files to disk to copy remove the bug

the following bash for loops mimics the portage behaviour with and without a delay between phases. Log files are written with the listing of the broken files we can desume that:

1) It happend only when file are copied from the build directory to the image directory ---2) a mere 10 seconds delay is enough to remove the bug entirely--- < false, even 30 seconds do not make a difference 2.1) I can tell you that when using emerge not this mimic the broken files are more so this thing is very time sensitive

Storage is NVME

> pippo
for i in clean compile install merge ; do 
  ebuild  /var/db/repos/gentoo/dev-lang/go/go-1.21.4.ebuild $i
  file /var/tmp/portage/dev-lang/go-1.21.4/{work,image/usr/lib}/go/pkg/tool/linux_amd64/* \
  | grep -e '\bdata\b' >> pippo
done
cat pippo
/var/tmp/portage/dev-lang/go-1.21.4/image/usr/lib/go/pkg/tool/linux_amd64/cover:     data
/var/tmp/portage/dev-lang/go-1.21.4/image/usr/lib/go/pkg/tool/linux_amd64/vet:       data
> pluto
for i in clean compile install merge ; do 
  ebuild  /var/db/repos/gentoo/dev-lang/go/go-1.21.4.ebuild $i
  file /var/tmp/portage/dev-lang/go-1.21.4/{work,image/usr/lib}/go/pkg/tool/linux_amd64/* \
  | grep -e '\bdata\b' >> pluto
  sleep 10
done
wc pluto
0 0 0 pluto
rincebrain commented 7 months ago

txg sync timeout default is 5 seconds, so 10 seconds making it not reproduce tracks.

behlendorf commented 7 months ago

That's helpful thanks. Since you're able to reproduce it could you try setting zfs_dmu_offset_next_sync=0 and see if it reproduces. This will effectively disable hole reporting while will hopefully help us narrow this down.

thulle commented 7 months ago

Did a loop of 5 builds each, with sync=always & xattr=off all 5 are ok. (on top of the 4 successful builds in the edit above) with sync=always & xattr=sa 3 builds failed. (on top of the 1 failed build in the edit above)

vivo75 commented 7 months ago

txg sync timeout default is 5 seconds, so 10 seconds making it not reproduce tracks.

mmh, I've lowered txg to 1s and tried again, lowering also the sleep to 2s. confusing results at best

cat /sys/module/zfs/parameters/zfs_txg_timeout
1

> pippo
for i in clean compile install ; do 
  ebuild  /var/db/repos/gentoo/dev-lang/go/go-1.21.4.ebuild $i
  file /var/tmp/portage/dev-lang/go-1.21.4/{work,image/usr/lib}/go/pkg/tool/linux_amd64/* \
  | grep -e '\bdata\b' >> pippo
done

cat pippo
/var/tmp/portage/dev-lang/go-1.21.4/image/usr/lib/go/pkg/tool/linux_amd64/compile:   data
/var/tmp/portage/dev-lang/go-1.21.4/image/usr/lib/go/pkg/tool/linux_amd64/cover:     data
/var/tmp/portage/dev-lang/go-1.21.4/image/usr/lib/go/pkg/tool/linux_amd64/vet:       data

> pluto
for i in clean compile install ; do 
  ebuild  /var/db/repos/gentoo/dev-lang/go/go-1.21.4.ebuild $i
  file /var/tmp/portage/dev-lang/go-1.21.4/{work,image/usr/lib}/go/pkg/tool/linux_amd64/* \
  | grep -e '\bdata\b' >> pluto
  sleep 2
done

cat pluto
/var/tmp/portage/dev-lang/go-1.21.4/image/usr/lib/go/pkg/tool/linux_amd64/asm:       data
/var/tmp/portage/dev-lang/go-1.21.4/image/usr/lib/go/pkg/tool/linux_amd64/cgo:       data
/var/tmp/portage/dev-lang/go-1.21.4/image/usr/lib/go/pkg/tool/linux_amd64/compile:   data
/var/tmp/portage/dev-lang/go-1.21.4/image/usr/lib/go/pkg/tool/linux_amd64/cover:     data
/var/tmp/portage/dev-lang/go-1.21.4/image/usr/lib/go/pkg/tool/linux_amd64/link:      data
/var/tmp/portage/dev-lang/go-1.21.4/image/usr/lib/go/pkg/tool/linux_amd64/vet:       data
vivo75 commented 7 months ago

That's helpful thanks. Since you're able to reproduce it could you try setting zfs_dmu_offset_next_sync=0 and see if it reproduces. This will effectively disable hole reporting while will hopefully help us narrow this down.

sorry; same loops as before, pippo has 0 delay pluto has 10 seconds delay

cat /sys/module/zfs/parameters/zfs_txg_timeout 5 cat /sys/module/zfs/parameters/zfs_dmu_offset_next_sync 0 cat pippo /var/tmp/portage/dev-lang/go-1.21.4/image/usr/lib/go/pkg/tool/linux_amd64/cover: data cat pluto /var/tmp/portage/dev-lang/go-1.21.4/image/usr/lib/go/pkg/tool/linux_amd64/cover: data /var/tmp/portage/dev-lang/go-1.21.4/image/usr/lib/go/pkg/tool/linux_amd64/vet: data

rincebrain commented 7 months ago

I've got a Gentoo VM built that reproduces this readily building Go, so,

And Brian, that tunable won't save us here, I think, because the bug is, I believe, that it's marking the buffer as no longer dirty, so in this block:

        rw_enter(&dn->dn_struct_rwlock, RW_READER);

        if (dnode_is_dirty(dn)) {
                /*
                 * If the zfs_dmu_offset_next_sync module option is enabled
                 * then hole reporting has been requested.  Dirty dnodes
                 * must be synced to disk to accurately report holes.
                 *
                 * Provided a RL_READER rangelock spanning 0-UINT64_MAX is
                 * held by the caller only a single restart will be required.
                 * We tolerate callers which do not hold the rangelock by
                 * returning EBUSY and not reporting holes after one restart.
                 */
                if (zfs_dmu_offset_next_sync) {
                        rw_exit(&dn->dn_struct_rwlock);
                        dnode_rele(dn, FTAG);

                        if (restarted)
                                return (SET_ERROR(EBUSY));

                        txg_wait_synced(dmu_objset_pool(os), 0);
                        restarted = 1;
                        goto restart;
                }

                err = SET_ERROR(EBUSY);
        } else {
                err = dnode_next_offset(dn, DNODE_FIND_HAVELOCK |
                    (hole ? DNODE_FIND_HOLE : 0), off, 1, 1, 0);
        }

        rw_exit(&dn->dn_struct_rwlock);

I believe the first check returns 0, and the tunable is academic.

I think that adding a bit for "was cloned" to the object and checking that here would do the right thing, but I don't know if that's the only problem, and I'll test it.

rincebrain commented 7 months ago

I can't reproduce it any more easily with https://github.com/rincebrain/zfs/commit/3f9688eb36023f6f69b98ffbc30267ba24d33ad8 so have fun testing that, if you like, I make absolutely no promises it won't burn your house down and salt the earth underneath.

robn commented 7 months ago

Nice, that helps to validate the theory. I'll try to tighten it up a little and throw it at the whole test suite over the weekend.

robn commented 7 months ago

@thulle Having a dump of the BRT would've helped in the process.

FWIW, I just posted #15541, adding an option to zdb to dump the BRT.