openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.43k stars 1.73k forks source link

SEEK_DATA fails randomly #11900

Closed ismell closed 2 years ago

ismell commented 3 years ago

System information

Type Version/Name
Distribution Name Debian
Distribution Version rodete 20210324.05.05RL
Linux Kernel 5.10.19-1rodete1
Architecture amd64
ZFS Version zfs-2.0.3-1~bpo10+1
SPL Version 2.0.3-1~bpo10+1

Describe the problem you're observing

lseek + SEEK_DATA is randomly returning -ENXIO.

Code showing the problem: https://source.chromium.org/chromiumos/chromiumos/codesearch/+/main:src/third_party/portage_tool/src/portage_util_file_copy_reflink_linux.c;l=145

I patched it with the following:

$ git diff
diff --git a/src/portage_util_file_copy_reflink_linux.c b/src/portage_util_file_copy_reflink_linux.c
index 352342c06..f261777fc 100644
--- a/src/portage_util_file_copy_reflink_linux.c
+++ b/src/portage_util_file_copy_reflink_linux.c
@@ -140,8 +140,9 @@ do_lseek_data(int fd_out, int fd_in, off_t *off_out) {
     /* Use lseek SEEK_DATA/SEEK_HOLE for sparse file support,
      * as suggested in the copy_file_range man page.
      */
-    off_t offset_data, offset_hole;
+    off_t offset_data, offset_hole, current;

+    current = lseek(fd_in, 0, SEEK_CUR);
     offset_data = lseek(fd_in, *off_out, SEEK_DATA);
     if (offset_data < 0) {
         if (errno == ENXIO) {

On a successful run this is what the strace looks like(depthcharge.elf):

[pid 3014182] lseek(3, 0, SEEK_CUR)     = 0
[pid 3014182] lseek(3, 0, SEEK_DATA)    = 0
[pid 3014182] lseek(3, 0, SEEK_HOLE)    = 2439852
[pid 3014182] copy_file_range(3, [0], 4, [0], 2439852, 0) = 2439852
[pid 3014182] lseek(3, 0, SEEK_CUR)     = 2439852
[pid 3014182] lseek(3, 2439852, SEEK_DATA) = -1 ENXIO (No such device or address)
[pid 3014182] lseek(3, 0, SEEK_END)     = 2439852
[pid 3014182] ftruncate(4, 2439852)     = 0
[pid 3014182] close(4)                  = 0
[pid 3014182] close(3)                  = 0

When a failure happens this is what it looks like (netboot.elf):

[pid 3014182] lseek(3, 0, SEEK_CUR)     = 0
[pid 3014182] lseek(3, 0, SEEK_DATA)    = -1 ENXIO (No such device or address)
[pid 3014182] lseek(3, 0, SEEK_END)     = 1824516
[pid 3014182] lseek(4, 1824516, SEEK_SET) = 1824516
[pid 3014182] ftruncate(4, 1824516)     = 0
[pid 3014182] close(4)                  = 0
[pid 3014182] close(3)                  = 0

man lseek shows:

SEEK_DATA
              Adjust the file offset to the next location in the file greater than or equal to offset containing data.  If offset points to data, then the file offset is set to offset.

ENXIO  whence is SEEK_DATA or SEEK_HOLE, and offset is beyond the end of the file, or whence is SEEK_DATA and offset is within a hole at the end of the file.

Describe how to reproduce the problem

I don't have a simple way to reproduce this yet. It happens as part of a portage build. I have attached a full strace log. The problem shows up randomly, so running the same command multiple times will cause different files to fail to copy.

Include any warning/errors/backtraces from the system logs

depthcharge-strace-patch1.log

rincebrain commented 2 years ago

I see I should have been more precise.

I have two identical copies of a Gentoo root filesystem running on identical VMs on an identical host, with identical kernels. With Gentoo kernel booting an Ubuntu 18.04 userland and then chrooted into one copy, it does not reproduce, ever. With Gentoo kernel booting into a Gentoo userland and then chrooted into the other copy, it does.

thesamesam commented 2 years ago

I'm hitting a different issue since running with the patch committed to master on top of 2.1.1 (note that this didn't seem to happen with https://github.com/openzfs/zfs/commit/93e19c7b35b62c238839536fed0d0f23c0466136 when I was running it in the past, but I've not (yet?) tried downgrading to it):

The same messages occur for other kernel modules, like libata.ko.

Running file on a bad module yields:

# file /lib/modules/5.14.17-adry/kernel/drivers/ata/libata.ko
/lib/modules/5.14.17-adry/kernel/drivers/ata/libata.ko: data

Running file on a good module (from a previous kernel build) yields:

# file /lib/modules/5.14.16-adry/kernel/drivers/ata/libata.ko
/lib/modules/5.14.16-adry/kernel/drivers/ata/libata.ko: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), BuildID[sha1]=34d7fc7363ebace6b3bd5f1c6d124b8909da6806, not stripped

If I use a tmpfs to build the kernel and ZFS, these errors do not occur. This is all with coreutils 9 too. I'll try to reproduce with no-tmpfs + older coreutils now.

EDIT: If I downgrade coreutils, the modules aren't corrupted. EDIT: This seems to happen reliably (read: every time so far) with coreutils-9.0 and not at all with coreutils-8.32.

The modules are broken during the process of copying from the 'work dir' (where building is done) to the image. Same issue as with Go, IIRC.

# file /var/tmp/portage/sys-fs/zfs-kmod-2.1.1-r1/work/zfs-2.1.1/module/zstd/zzstd.ko 
/var/tmp/portage/sys-fs/zfs-kmod-2.1.1-r1/work/zfs-2.1.1/module/zstd/zzstd.ko: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), BuildID[sha1]=1801c62400d21bcaea7409e1eca7126133252bea, not stripped

# file /var/tmp/portage/sys-fs/zfs-kmod-2.1.1-r1/image/lib/modules/5.14.17-adry/extra/zstd/zzstd.ko 
/var/tmp/portage/sys-fs/zfs-kmod-2.1.1-r1/image/lib/modules/5.14.17-adry/extra/zstd/zzstd.ko: data

EDIT: Summary so far:

I should've tried the PR before it was merged, but the PR didn't seem to apply (when I tried) to 2.1.1 and I was happy with the commit I mentioned given it seemed to be working for me. Apologies.

rincebrain commented 2 years ago

Time to wake up the experiment VM again...

(Maybe #12710 would be the safe option for shipping 2.1.2/2.0.7 after all?)

rincebrain commented 2 years ago

One curious thing I just had happen.

Running 2.1.1 with that patch cherrypicked, I had it happen that I did an emerge of smartmontools, asked for file [tmp path to sbin]; du [tmp path to sbin] and got back:

/var/portage_tmp/portage/sys-apps/smartmontools-7.2-r1/image/usr/sbin/smartctl:             ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, stripped
/var/portage_tmp/portage/sys-apps/smartmontools-7.2-r1/image/usr/sbin/smartd:               ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, stripped
/var/portage_tmp/portage/sys-apps/smartmontools-7.2-r1/image/usr/sbin/update-smart-drivedb: POSIX shell script, ASCII text executable
1       /var/portage_tmp/portage/sys-apps/smartmontools-7.2-r1/image/usr/sbin/smartctl
1       /var/portage_tmp/portage/sys-apps/smartmontools-7.2-r1/image/usr/sbin/smartd

!?

So I ran du again and got: 459 /var/portage_tmp/portage/sys-apps/smartmontools-7.2-r1/image/usr/sbin/smartctl

So there's at least a race with reported size if you just cherrypick that patch to 2.1.1, I guess...

(This could be pre-existing, I just noticed it and went "...that's very weird.")

rincebrain commented 2 years ago

Yup, reproduced permanently differently mangled files great with zfs_dmu_offset_next_sync=1 (though not 0, so far, so that's something) and ebuild zfs-kmod-2.1.1.ebuild clean install with portage tmpdir on ZFS.

The new files are entirely empty:

# ~rich/zfs_cp/cmd/zdb/zdb -dbdbdbdbdbdb testpool/portage_tmp 112048
Dataset testpool/portage_tmp [ZPL], ID 259, cr_txg 12, 981M, 149884 objects, rootbp DVA[0]=<0:341076600:200> DVA[1]=<0:380095400:200> [L0 DMU objset] fletcher4 lz4 unencrypted LE contiguous unique double size=1000L/200P birth=64564L/64564P fill=149884 cksum=1ae04a529f:73bebec3954:132f6162a5765:26758bb97eb8b2

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
    112048    1   128K   128K      0     512   128K    0.00  ZFS plain file (K=inherit) (Z=inherit=lz4)
                                               176   bonus  System attributes
        dnode flags: USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
        dnode maxblkid: 0
        path    /portage/sys-fs/zfs-kmod-2.1.1/image/lib/modules/5.10.74-gentoo/extra/lua/zlua.ko
        uid     0
        gid     0
        atime   Mon Nov  8 04:47:42 2021
        mtime   Mon Nov  8 04:47:41 2021
        ctime   Mon Nov  8 04:47:41 2021
        crtime  Mon Nov  8 04:47:41 2021
        gen     64561
        mode    100644
        size    320496
        parent  111907
        links   1
        pflags  840800000004
Indirect blocks:

#

(and stay that way even across a pool export+import, this isn't just "zdb is catching state not being flushed yet".)

My guess is whatever is resulting in short sizes reported also results in seeing an empty length when trying to decide how much to copy, but that's just a guess.

@thesamesam reported (IIRC) that he couldn't reproduce this before 93c19c7 in limited testing, with either setting, so this seems like a worse outcome...

I'm...going to test it after I've exhaustively convinced myself this doesn't reproduce with =0, but unless that's broken in novel ways I'm not expecting, I'm pretty strongly going to suggest reverting 93c19c7 and pulling in #12710 until this gets run down more.

Gonna nominate this for the top of the bug blitz on Tuesday unless someone knocks it out first. :D

behlendorf commented 2 years ago

One possible explanation for this would be the slight change to how we're checking for a dirty dnode. Could someone who is already setup to test this try the following small change (on top of the previous fix). It effectively reverts the dirty check to the previous logic while preserving the rest of the mmap fix.

diff --git a/module/zfs/dnode.c b/module/zfs/dnode.c
index 6f87f49f8..db1a5d71d 100644
--- a/module/zfs/dnode.c
+++ b/module/zfs/dnode.c
@@ -1657,7 +1657,7 @@ dnode_is_dirty(dnode_t *dn)
        mutex_enter(&dn->dn_mtx);

        for (int i = 0; i < TXG_SIZE; i++) {
-               if (list_head(&dn->dn_dirty_records[i]) != NULL) {
+               if (multilist_link_active(&dn->dn_dirty_link[i])) {
                        mutex_exit(&dn->dn_mtx);
                        return (B_TRUE);
                }
thesamesam commented 2 years ago

@behlendorf I haven't been able to reproduce the issue with this new patch so far. Previously it would reliably happen within 3-4 attempts with one case and every attempt in the other cases, so it looks promising. I'll keep running with it.

rincebrain commented 2 years ago

Well, my spreadsheet says "2/10 builds of smartmontools with dmu_sync=1 and current master corrupted (and 10/10 for zfs-kmod)" and 0/10 and 0/10 with, so that seems promising...

(Space is still wrong for all the zfs-kmod runs and none of the smartmontools runs, so that's unfortunate, but for another day.)

edit to add: I think I'd advocate defaulting to dmu_offset_next_sync=1 for the tunable, not 0, since "this feature works but it's stubbed out silently by default" will rapidly become "this feature doesn't work", IMO...

behlendorf commented 2 years ago

Thanks for the quick testing, do I understand correctly then that with the patch in #12745 you're no longer able to reproduce the issue?

@rincebrain I tend to agree about the default. This was originally done to minimize the performance impact when checking for holes, but I think in practice it ended up being too confusing. I've gone ahead and opened #12746 to change the default.

rincebrain commented 2 years ago

I can absolutely not reproduce it any more with the patch in #12745 on any case I've tested.

thesamesam commented 2 years ago

Same here -- no issues with master + #12745. No opinion on the default although it may be worth revisiting it once things have calmed down.

robszy commented 2 years ago

For sure we need tests to cover ongoing scenarios as they are hard to reproduce in live systems

thesamesam commented 2 years ago

Bit of an out there idea: would be interesting to see if we could get a CI box fast enough to try doing emerges (in Gentoo) that could try hit such problems.

Of course, a more isolated test case would be great, even if it were just calling cp in a loop?

robszy commented 2 years ago

It would be very interesting to work out what happens actually in case of emerge an build test case around that not only guestimating what could be wrong in code and fix every possible slip probably introducing regression.

rincebrain commented 2 years ago

It would be very interesting to work out what happens actually in case of emerge an build test case around that not only guestimating what could be wrong in code and fix every possible slip probably introducing regression.

I presume you mean something much more constrained than what I'm taking from your comment, otherwise you seem to be proposing solving a somewhat general open AI problem in order to avoid this regression.

robszy commented 2 years ago

I mean if we know where we have bug in code we won't fix other parts that otherwise might look like as source of bug (of course we could find bugs in other places but they are irrelevant to searched bug) and could therefore introduce regression if we are wrong.

As a bonus we have real world test case just to implement not guestimating what test case to this bug could look like

rincebrain commented 2 years ago

This can probably be closed now, unless someone's reproduced it since the fixes?