SEEK_DATA fails randomly

ismell commented 3 years ago

System information

Type	Version/Name
Distribution Name	Debian
Distribution Version	rodete 20210324.05.05RL
Linux Kernel	5.10.19-1rodete1
Architecture	amd64
ZFS Version	zfs-2.0.3-1~bpo10+1
SPL Version	2.0.3-1~bpo10+1

Describe the problem you're observing

lseek + SEEK_DATA is randomly returning -ENXIO.

Code showing the problem: https://source.chromium.org/chromiumos/chromiumos/codesearch/+/main:src/third_party/portage_tool/src/portage_util_file_copy_reflink_linux.c;l=145

I patched it with the following:

$ git diff
diff --git a/src/portage_util_file_copy_reflink_linux.c b/src/portage_util_file_copy_reflink_linux.c
index 352342c06..f261777fc 100644
--- a/src/portage_util_file_copy_reflink_linux.c
+++ b/src/portage_util_file_copy_reflink_linux.c
@@ -140,8 +140,9 @@ do_lseek_data(int fd_out, int fd_in, off_t *off_out) {
     /* Use lseek SEEK_DATA/SEEK_HOLE for sparse file support,
      * as suggested in the copy_file_range man page.
      */
-    off_t offset_data, offset_hole;
+    off_t offset_data, offset_hole, current;

+    current = lseek(fd_in, 0, SEEK_CUR);
     offset_data = lseek(fd_in, *off_out, SEEK_DATA);
     if (offset_data < 0) {
         if (errno == ENXIO) {

On a successful run this is what the strace looks like(depthcharge.elf):

[pid 3014182] lseek(3, 0, SEEK_CUR)     = 0
[pid 3014182] lseek(3, 0, SEEK_DATA)    = 0
[pid 3014182] lseek(3, 0, SEEK_HOLE)    = 2439852
[pid 3014182] copy_file_range(3, [0], 4, [0], 2439852, 0) = 2439852
[pid 3014182] lseek(3, 0, SEEK_CUR)     = 2439852
[pid 3014182] lseek(3, 2439852, SEEK_DATA) = -1 ENXIO (No such device or address)
[pid 3014182] lseek(3, 0, SEEK_END)     = 2439852
[pid 3014182] ftruncate(4, 2439852)     = 0
[pid 3014182] close(4)                  = 0
[pid 3014182] close(3)                  = 0

When a failure happens this is what it looks like (netboot.elf):

[pid 3014182] lseek(3, 0, SEEK_CUR)     = 0
[pid 3014182] lseek(3, 0, SEEK_DATA)    = -1 ENXIO (No such device or address)
[pid 3014182] lseek(3, 0, SEEK_END)     = 1824516
[pid 3014182] lseek(4, 1824516, SEEK_SET) = 1824516
[pid 3014182] ftruncate(4, 1824516)     = 0
[pid 3014182] close(4)                  = 0
[pid 3014182] close(3)                  = 0

man lseek shows:

SEEK_DATA
              Adjust the file offset to the next location in the file greater than or equal to offset containing data.  If offset points to data, then the file offset is set to offset.

ENXIO  whence is SEEK_DATA or SEEK_HOLE, and offset is beyond the end of the file, or whence is SEEK_DATA and offset is within a hole at the end of the file.

Describe how to reproduce the problem

I don't have a simple way to reproduce this yet. It happens as part of a portage build. I have attached a full strace log. The problem shows up randomly, so running the same command multiple times will cause different files to fail to copy.

Include any warning/errors/backtraces from the system logs

depthcharge-strace-patch1.log

rincebrain commented 2 years ago

I see I should have been more precise.

I have two identical copies of a Gentoo root filesystem running on identical VMs on an identical host, with identical kernels. With Gentoo kernel booting an Ubuntu 18.04 userland and then chrooted into one copy, it does not reproduce, ever. With Gentoo kernel booting into a Gentoo userland and then chrooted into the other copy, it does.

thesamesam commented 2 years ago

I'm hitting a different issue since running with the patch committed to master on top of 2.1.1 (note that this didn't seem to happen with https://github.com/openzfs/zfs/commit/93e19c7b35b62c238839536fed0d0f23c0466136 when I was running it in the past, but I've not (yet?) tried downgrading to it):

Running 5.14.17
Go builds fine and is not corrupted

Kernel builds (building =sys-kernel/gentoo-kernel-5.14.17 in Gentoo) seems to result in corrupted modules being installed:

dracut: *** Installing kernel module dependencies ***
depmod: ERROR: failed to load symbols from /var/tmp/portage/sys-fs/zfs-2.1.1-r3/temp/dracut.1BldCZ/initramfs/lib/modules/5.14.17-adry/extra/unicode/zunicode.ko: Exec format error
depmod: ERROR: failed to load symbols from /var/tmp/portage/sys-fs/zfs-2.1.1-r3/temp/dracut.1BldCZ/initramfs/lib/modules/5.14.17-adry/extra/nvpair/znvpair.ko: Exec format error
depmod: ERROR: failed to load symbols from /var/tmp/portage/sys-fs/zfs-2.1.1-r3/temp/dracut.1BldCZ/initramfs/lib/modules/5.14.17-adry/extra/spl/spl.ko: Exec format error
depmod: ERROR: failed to load symbols from /var/tmp/portage/sys-fs/zfs-2.1.1-r3/temp/dracut.1BldCZ/initramfs/lib/modules/5.14.17-adry/extra/zcommon/zcommon.ko: Exec format error
depmod: ERROR: failed to load symbols from /var/tmp/portage/sys-fs/zfs-2.1.1-r3/temp/dracut.1BldCZ/initramfs/lib/modules/5.14.17-adry/extra/zfs/zfs.ko: Exec format error
dracut: *** Installing kernel module dependencies done ***
dracut: *** Resolving executable dependencies ***

The same messages occur for other kernel modules, like libata.ko.

Running file on a bad module yields:

# file /lib/modules/5.14.17-adry/kernel/drivers/ata/libata.ko
/lib/modules/5.14.17-adry/kernel/drivers/ata/libata.ko: data

Running file on a good module (from a previous kernel build) yields:

# file /lib/modules/5.14.16-adry/kernel/drivers/ata/libata.ko
/lib/modules/5.14.16-adry/kernel/drivers/ata/libata.ko: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), BuildID[sha1]=34d7fc7363ebace6b3bd5f1c6d124b8909da6806, not stripped

If I use a tmpfs to build the kernel and ZFS, these errors do not occur. This is all with coreutils 9 too. I'll try to reproduce with no-tmpfs + older coreutils now.

EDIT: If I downgrade coreutils, the modules aren't corrupted. EDIT: This seems to happen reliably (read: every time so far) with coreutils-9.0 and not at all with coreutils-8.32.

The modules are broken during the process of copying from the 'work dir' (where building is done) to the image. Same issue as with Go, IIRC.

# file /var/tmp/portage/sys-fs/zfs-kmod-2.1.1-r1/work/zfs-2.1.1/module/zstd/zzstd.ko 
/var/tmp/portage/sys-fs/zfs-kmod-2.1.1-r1/work/zfs-2.1.1/module/zstd/zzstd.ko: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), BuildID[sha1]=1801c62400d21bcaea7409e1eca7126133252bea, not stripped

# file /var/tmp/portage/sys-fs/zfs-kmod-2.1.1-r1/image/lib/modules/5.14.17-adry/extra/zstd/zzstd.ko 
/var/tmp/portage/sys-fs/zfs-kmod-2.1.1-r1/image/lib/modules/5.14.17-adry/extra/zstd/zzstd.ko: data

EDIT: Summary so far:

Failure manifests as binaries segfaulting / being corrupted when copied again using coreutils 9.
On both git master (at commit https://github.com/openzfs/zfs/commit/de198f2d9507b6dcf3d0d8f037ba33940208733e) and 2.1.1 with that same commit applied, corruption of binaries occurs in a reproducible fashion.
The "Go test case" does not seem to be reproducible anymore in this environment. Note that I think Go is somewhat rare in that it uses mmap heavily during compilation, whereas standard build processes may not.
My most reliable reproducer for this is building "zfs-kmod" in Gentoo or, even better, "gentoo-kernel" (even using a reduced kernel configuration).
Building "smartmontools" somewhat reliably fails but takes about 4-5 attempts sometimes for corruption to occur.
Seems to occur more often / far more easily with zfs_dmu_offset_next_sync=1. I was suspecting it couldn't even occur with zfs_dmu_offset_next_sync=0 but finally building the kernel again, it failed once.
file is not a completely reliable test as it may report a file is ELF even if it then goes on to segfault (so corruption isn't always in the same region?)
Cannot reproduce this new issue on vanilla 2.1.1.
2.1.1 + https://github.com/openzfs/zfs/commit/93e19c7b35b62c238839536fed0d0f23c0466136 (not the version merged) does have this problem too. So, this might not be a new problem from the latest patch but just more easily triggered with the patch as committed to master.

I should've tried the PR before it was merged, but the PR didn't seem to apply (when I tried) to 2.1.1 and I was happy with the commit I mentioned given it seemed to be working for me. Apologies.

rincebrain commented 2 years ago

Time to wake up the experiment VM again...

(Maybe #12710 would be the safe option for shipping 2.1.2/2.0.7 after all?)

rincebrain commented 2 years ago

One curious thing I just had happen.

Running 2.1.1 with that patch cherrypicked, I had it happen that I did an emerge of smartmontools, asked for file [tmp path to sbin]; du [tmp path to sbin] and got back:

/var/portage_tmp/portage/sys-apps/smartmontools-7.2-r1/image/usr/sbin/smartctl:             ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, stripped
/var/portage_tmp/portage/sys-apps/smartmontools-7.2-r1/image/usr/sbin/smartd:               ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, stripped
/var/portage_tmp/portage/sys-apps/smartmontools-7.2-r1/image/usr/sbin/update-smart-drivedb: POSIX shell script, ASCII text executable
1       /var/portage_tmp/portage/sys-apps/smartmontools-7.2-r1/image/usr/sbin/smartctl
1       /var/portage_tmp/portage/sys-apps/smartmontools-7.2-r1/image/usr/sbin/smartd

!?

So I ran du again and got: 459 /var/portage_tmp/portage/sys-apps/smartmontools-7.2-r1/image/usr/sbin/smartctl

So there's at least a race with reported size if you just cherrypick that patch to 2.1.1, I guess...

(This could be pre-existing, I just noticed it and went "...that's very weird.")

rincebrain commented 2 years ago

Yup, reproduced permanently differently mangled files great with zfs_dmu_offset_next_sync=1 (though not 0, so far, so that's something) and ebuild zfs-kmod-2.1.1.ebuild clean install with portage tmpdir on ZFS.

The new files are entirely empty:

# ~rich/zfs_cp/cmd/zdb/zdb -dbdbdbdbdbdb testpool/portage_tmp 112048
Dataset testpool/portage_tmp [ZPL], ID 259, cr_txg 12, 981M, 149884 objects, rootbp DVA[0]=<0:341076600:200> DVA[1]=<0:380095400:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=64564L/64564P fill=149884 cksum=1ae04a529f:73bebec3954:132f6162a5765:26758bb97eb8b2

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
    112048    1   128K   128K      0     512   128K    0.00  ZFS plain file (K=inherit) (Z=inherit=lz4)
                                               176   bonus  System attributes
        dnode flags: USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
        dnode maxblkid: 0
        path    /portage/sys-fs/zfs-kmod-2.1.1/image/lib/modules/5.10.74-gentoo/extra/lua/zlua.ko
        uid     0
        gid     0
        atime   Mon Nov  8 04:47:42 2021
        mtime   Mon Nov  8 04:47:41 2021
        ctime   Mon Nov  8 04:47:41 2021
        crtime  Mon Nov  8 04:47:41 2021
        gen     64561
        mode    100644
        size    320496
        parent  111907
        links   1
        pflags  840800000004
Indirect blocks:

#

(and stay that way even across a pool export+import, this isn't just "zdb is catching state not being flushed yet".)

My guess is whatever is resulting in short sizes reported also results in seeing an empty length when trying to decide how much to copy, but that's just a guess.

@thesamesam reported (IIRC) that he couldn't reproduce this before 93c19c7 in limited testing, with either setting, so this seems like a worse outcome...

I'm...going to test it after I've exhaustively convinced myself this doesn't reproduce with =0, but unless that's broken in novel ways I'm not expecting, I'm pretty strongly going to suggest reverting 93c19c7 and pulling in #12710 until this gets run down more.

Gonna nominate this for the top of the bug blitz on Tuesday unless someone knocks it out first. :D

behlendorf commented 2 years ago

One possible explanation for this would be the slight change to how we're checking for a dirty dnode. Could someone who is already setup to test this try the following small change (on top of the previous fix). It effectively reverts the dirty check to the previous logic while preserving the rest of the mmap fix.

diff --git a/module/zfs/dnode.c b/module/zfs/dnode.c
index 6f87f49f8..db1a5d71d 100644
--- a/module/zfs/dnode.c
+++ b/module/zfs/dnode.c
@@ -1657,7 +1657,7 @@ dnode_is_dirty(dnode_t *dn)
        mutex_enter(&dn->dn_mtx);

        for (int i = 0; i < TXG_SIZE; i++) {
-               if (list_head(&dn->dn_dirty_records[i]) != NULL) {
+               if (multilist_link_active(&dn->dn_dirty_link[i])) {
                        mutex_exit(&dn->dn_mtx);
                        return (B_TRUE);
                }

thesamesam commented 2 years ago

@behlendorf I haven't been able to reproduce the issue with this new patch so far. Previously it would reliably happen within 3-4 attempts with one case and every attempt in the other cases, so it looks promising. I'll keep running with it.

rincebrain commented 2 years ago

Well, my spreadsheet says "2/10 builds of smartmontools with dmu_sync=1 and current master corrupted (and 10/10 for zfs-kmod)" and 0/10 and 0/10 with, so that seems promising...

(Space is still wrong for all the zfs-kmod runs and none of the smartmontools runs, so that's unfortunate, but for another day.)

edit to add: I think I'd advocate defaulting to dmu_offset_next_sync=1 for the tunable, not 0, since "this feature works but it's stubbed out silently by default" will rapidly become "this feature doesn't work", IMO...

behlendorf commented 2 years ago

Thanks for the quick testing, do I understand correctly then that with the patch in #12745 you're no longer able to reproduce the issue?

@rincebrain I tend to agree about the default. This was originally done to minimize the performance impact when checking for holes, but I think in practice it ended up being too confusing. I've gone ahead and opened #12746 to change the default.

rincebrain commented 2 years ago

I can absolutely not reproduce it any more with the patch in #12745 on any case I've tested.

thesamesam commented 2 years ago

Same here -- no issues with master + #12745. No opinion on the default although it may be worth revisiting it once things have calmed down.

robszy commented 2 years ago

For sure we need tests to cover ongoing scenarios as they are hard to reproduce in live systems

thesamesam commented 2 years ago

Bit of an out there idea: would be interesting to see if we could get a CI box fast enough to try doing emerges (in Gentoo) that could try hit such problems.

Of course, a more isolated test case would be great, even if it were just calling cp in a loop?

robszy commented 2 years ago

It would be very interesting to work out what happens actually in case of emerge an build test case around that not only guestimating what could be wrong in code and fix every possible slip probably introducing regression.

rincebrain commented 2 years ago

It would be very interesting to work out what happens actually in case of emerge an build test case around that not only guestimating what could be wrong in code and fix every possible slip probably introducing regression.

I presume you mean something much more constrained than what I'm taking from your comment, otherwise you seem to be proposing solving a somewhat general open AI problem in order to avoid this regression.

robszy commented 2 years ago

I mean if we know where we have bug in code we won't fix other parts that otherwise might look like as source of bug (of course we could find bugs in other places but they are irrelevant to searched bug) and could therefore introduce regression if we are wrong.

As a bonus we have real world test case just to implement not guestimating what test case to this bug could look like

rincebrain commented 2 years ago

This can probably be closed now, unless someone's reproduced it since the fixes?

openzfs / zfs