openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.38k stars 1.72k forks source link

SEEK_DATA fails randomly #11900

Closed ismell closed 2 years ago

ismell commented 3 years ago

System information

Type Version/Name
Distribution Name Debian
Distribution Version rodete 20210324.05.05RL
Linux Kernel 5.10.19-1rodete1
Architecture amd64
ZFS Version zfs-2.0.3-1~bpo10+1
SPL Version 2.0.3-1~bpo10+1

Describe the problem you're observing

lseek + SEEK_DATA is randomly returning -ENXIO.

Code showing the problem: https://source.chromium.org/chromiumos/chromiumos/codesearch/+/main:src/third_party/portage_tool/src/portage_util_file_copy_reflink_linux.c;l=145

I patched it with the following:

$ git diff
diff --git a/src/portage_util_file_copy_reflink_linux.c b/src/portage_util_file_copy_reflink_linux.c
index 352342c06..f261777fc 100644
--- a/src/portage_util_file_copy_reflink_linux.c
+++ b/src/portage_util_file_copy_reflink_linux.c
@@ -140,8 +140,9 @@ do_lseek_data(int fd_out, int fd_in, off_t *off_out) {
     /* Use lseek SEEK_DATA/SEEK_HOLE for sparse file support,
      * as suggested in the copy_file_range man page.
      */
-    off_t offset_data, offset_hole;
+    off_t offset_data, offset_hole, current;

+    current = lseek(fd_in, 0, SEEK_CUR);
     offset_data = lseek(fd_in, *off_out, SEEK_DATA);
     if (offset_data < 0) {
         if (errno == ENXIO) {

On a successful run this is what the strace looks like(depthcharge.elf):

[pid 3014182] lseek(3, 0, SEEK_CUR)     = 0
[pid 3014182] lseek(3, 0, SEEK_DATA)    = 0
[pid 3014182] lseek(3, 0, SEEK_HOLE)    = 2439852
[pid 3014182] copy_file_range(3, [0], 4, [0], 2439852, 0) = 2439852
[pid 3014182] lseek(3, 0, SEEK_CUR)     = 2439852
[pid 3014182] lseek(3, 2439852, SEEK_DATA) = -1 ENXIO (No such device or address)
[pid 3014182] lseek(3, 0, SEEK_END)     = 2439852
[pid 3014182] ftruncate(4, 2439852)     = 0
[pid 3014182] close(4)                  = 0
[pid 3014182] close(3)                  = 0

When a failure happens this is what it looks like (netboot.elf):

[pid 3014182] lseek(3, 0, SEEK_CUR)     = 0
[pid 3014182] lseek(3, 0, SEEK_DATA)    = -1 ENXIO (No such device or address)
[pid 3014182] lseek(3, 0, SEEK_END)     = 1824516
[pid 3014182] lseek(4, 1824516, SEEK_SET) = 1824516
[pid 3014182] ftruncate(4, 1824516)     = 0
[pid 3014182] close(4)                  = 0
[pid 3014182] close(3)                  = 0

man lseek shows:

SEEK_DATA
              Adjust the file offset to the next location in the file greater than or equal to offset containing data.  If offset points to data, then the file offset is set to offset.

ENXIO  whence is SEEK_DATA or SEEK_HOLE, and offset is beyond the end of the file, or whence is SEEK_DATA and offset is within a hole at the end of the file.

Describe how to reproduce the problem

I don't have a simple way to reproduce this yet. It happens as part of a portage build. I have attached a full strace log. The problem shows up randomly, so running the same command multiple times will cause different files to fail to copy.

Include any warning/errors/backtraces from the system logs

depthcharge-strace-patch1.log

vapier commented 3 years ago

@zmedico fyi

h1z1 commented 3 years ago

Any chance #11910 is related?

ismell commented 3 years ago

See https://bugs.chromium.org/p/chromium/issues/detail?id=1198679 for more of the nitty-gritty debugging.

behlendorf commented 3 years ago

@ismell it's not clear to me from the comment above, but does the patch you posted above avoid the issue in your testing?

One plausible way you might try and reproduce this is to write to the file while performing the lseek(). This case is handled by the code but it's one of the less likely code paths so it's possible there might be an issue here.

ismell commented 3 years ago

No the patch above was just to confirm that the pointer was at the start of the file. I haven't yet been able to find an easy way to reproduce the problem. There should be no writers to the input file at this point. I'll keep working on trying to find an easy way to reproduce it. Is there anything else I should be tracing that will help?

behlendorf commented 3 years ago

One thing you could try to rule out the dirty data case would be to add an fsync(fd) before the the lseek() in the portage code. That would help us narrow it down a bit, if you no longer see failures we'll have a good idea where to look. Getting a good strace of exactly the IO pattern being issued to cause the problem would also be helpful.

ismell commented 3 years ago

Thanks for the tip.

So I wrote a script that calls emerge depthcharge in a loop and verifies the build files are not NULL.

No patch

$ test-zfs
>>> 10:51:18 Emerging (1 of 1) sys-boot/depthcharge-0.0.1-r3237::chromiumos for /build/guybrush/
>>> 10:51:38 Installing (1 of 1) sys-boot/depthcharge-0.0.1-r3237::chromiumos to /build/guybrush/
>>> 10:51:42 Completed (1 of 1) sys-boot/depthcharge-0.0.1-r3237::chromiumos to /build/guybrush/
>>> 10:51:42 Jobs: 1 of 1 complete                           Load avg: 2.13, 1.59, 1.44

>>> Using system located in ROOT tree /build/guybrush/

>>> No outdated packages were found on your system.
Iteration: 1, Success: 2, Failure: 1
>>> 10:51:44 Emerging (1 of 1) sys-boot/depthcharge-0.0.1-r3237::chromiumos for /build/guybrush/
>>> 10:52:04 Installing (1 of 1) sys-boot/depthcharge-0.0.1-r3237::chromiumos to /build/guybrush/
>>> 10:52:08 Completed (1 of 1) sys-boot/depthcharge-0.0.1-r3237::chromiumos to /build/guybrush/
>>> 10:52:08 Jobs: 1 of 1 complete                           Load avg: 2.09, 1.62, 1.46

>>> Using system located in ROOT tree /build/guybrush/

>>> No outdated packages were found on your system.
Iteration: 2, Success: 3, Failure: 3
>>> 10:52:10 Emerging (1 of 1) sys-boot/depthcharge-0.0.1-r3237::chromiumos for /build/guybrush/
>>> 10:52:30 Installing (1 of 1) sys-boot/depthcharge-0.0.1-r3237::chromiumos to /build/guybrush/
>>> 10:52:34 Completed (1 of 1) sys-boot/depthcharge-0.0.1-r3237::chromiumos to /build/guybrush/
>>> 10:52:34 Jobs: 1 of 1 complete                           Load avg: 1.85, 1.60, 1.45

>>> Using system located in ROOT tree /build/guybrush/

>>> No outdated packages were found on your system.
Iteration: 3, Success: 4, Failure: 5
>>> 10:52:36 Emerging (1 of 1) sys-boot/depthcharge-0.0.1-r3237::chromiumos for /build/guybrush/
>>> 10:52:56 Installing (1 of 1) sys-boot/depthcharge-0.0.1-r3237::chromiumos to /build/guybrush/
>>> 10:53:00 Completed (1 of 1) sys-boot/depthcharge-0.0.1-r3237::chromiumos to /build/guybrush/
>>> 10:53:00 Jobs: 1 of 1 complete                           Load avg: 1.56, 1.55, 1.44

>>> Using system located in ROOT tree /build/guybrush/

>>> No outdated packages were found on your system.
Iteration: 4, Success: 5, Failure: 7 <-- Everything's not awesome :(

With fsync

diff --git a/src/portage_util_file_copy_reflink_linux.c b/src/portage_util_file_copy_reflink_linux.c
index 352342c06..004b8618a 100644
--- a/src/portage_util_file_copy_reflink_linux.c
+++ b/src/portage_util_file_copy_reflink_linux.c
@@ -140,8 +140,10 @@ do_lseek_data(int fd_out, int fd_in, off_t *off_out) {
     /* Use lseek SEEK_DATA/SEEK_HOLE for sparse file support,
      * as suggested in the copy_file_range man page.
      */
-    off_t offset_data, offset_hole;
+    off_t offset_data, offset_hole, current;

+    fsync(fd_in);
+    current = lseek(fd_in, 0, SEEK_CUR);
     offset_data = lseek(fd_in, *off_out, SEEK_DATA);
     if (offset_data < 0) {
         if (errno == ENXIO) {
$ test-zfs
>>> 10:55:17 Emerging (1 of 1) sys-boot/depthcharge-0.0.1-r3237::chromiumos for /build/guybrush/
>>> 10:55:37 Installing (1 of 1) sys-boot/depthcharge-0.0.1-r3237::chromiumos to /build/guybrush/
>>> 10:55:41 Completed (1 of 1) sys-boot/depthcharge-0.0.1-r3237::chromiumos to /build/guybrush/
>>> 10:55:41 Jobs: 1 of 1 complete                           Load avg: 3.36, 2.60, 1.87

>>> Using system located in ROOT tree /build/guybrush/

>>> No outdated packages were found on your system.
Iteration: 1, Success: 3, Failure: 0
>>> 10:55:43 Emerging (1 of 1) sys-boot/depthcharge-0.0.1-r3237::chromiumos for /build/guybrush/
>>> 10:56:06 Installing (1 of 1) sys-boot/depthcharge-0.0.1-r3237::chromiumos to /build/guybrush/
>>> 10:56:10 Completed (1 of 1) sys-boot/depthcharge-0.0.1-r3237::chromiumos to /build/guybrush/
>>> 10:56:10 Jobs: 1 of 1 complete                           Load avg: 2.57, 2.48, 1.85

>>> Using system located in ROOT tree /build/guybrush/

>>> No outdated packages were found on your system.
Iteration: 2, Success: 6, Failure: 0
>>> 10:56:12 Emerging (1 of 1) sys-boot/depthcharge-0.0.1-r3237::chromiumos for /build/guybrush/
>>> 10:56:32 Installing (1 of 1) sys-boot/depthcharge-0.0.1-r3237::chromiumos to /build/guybrush/
>>> 10:56:36 Completed (1 of 1) sys-boot/depthcharge-0.0.1-r3237::chromiumos to /build/guybrush/
>>> 10:56:36 Jobs: 1 of 1 complete                           Load avg: 3.10, 2.59, 1.91

>>> Using system located in ROOT tree /build/guybrush/

>>> No outdated packages were found on your system.
Iteration: 3, Success: 9, Failure: 0
>>> 10:56:39 Emerging (1 of 1) sys-boot/depthcharge-0.0.1-r3237::chromiumos for /build/guybrush/
>>> 10:56:58 Installing (1 of 1) sys-boot/depthcharge-0.0.1-r3237::chromiumos to /build/guybrush/
>>> 10:57:02 Completed (1 of 1) sys-boot/depthcharge-0.0.1-r3237::chromiumos to /build/guybrush/
>>> 10:57:02 Jobs: 1 of 1 complete                           Load avg: 4.32, 2.96, 2.05

>>> Using system located in ROOT tree /build/guybrush/

>>> No outdated packages were found on your system.
Iteration: 4, Success: 12, Failure: 0
>>> 10:57:05 Emerging (1 of 1) sys-boot/depthcharge-0.0.1-r3237::chromiumos for /build/guybrush/
>>> 10:57:24 Installing (1 of 1) sys-boot/depthcharge-0.0.1-r3237::chromiumos to /build/guybrush/
>>> 10:57:28 Completed (1 of 1) sys-boot/depthcharge-0.0.1-r3237::chromiumos to /build/guybrush/
>>> 10:57:28 Jobs: 1 of 1 complete                           Load avg: 3.42, 2.86, 2.05

>>> Using system located in ROOT tree /build/guybrush/

>>> No outdated packages were found on your system.
Iteration: 5, Success: 15, Failure: 0 <--- Everything is awesome!

It looks like adding fsync fixes the problem. straceing emerge results in a huge log. I'll try and find the relevant parts.

behlendorf commented 3 years ago

@ismell if you're comfortable rebuilding ZFS for the testing I could provide a debug patch which should give us a little more insight.

ismell commented 3 years ago

I'm sure I can figure it out :)

behlendorf commented 3 years ago

In that case, would you mind testing commit f73e0a63e9a33649509d27dad116afe9a6a22b59 from the https://github.com/behlendorf/zfs/tree/zfs-holey branch. It implements a possible fix by closing a small race. It's only been lightly tested since I wasn't able to reproduce the issue.

ismell commented 3 years ago

Sorry for the long cycle time. I was waiting until I was forced to reboot my computer.

So I tried f73e0a6~1 to verify I could reproduce the issue without your patch. I was not able to reproduce it....

$ zfs version
zfs-2.1.99-117_g44287ffa0
zfs-kmod-2.1.99-117_g44287ffa0

$ ./test-zfs
Iteration: 100, Success: 300, Failure: 0
Done
  Success: 300
  Failure: 0

This normally reproduced within a few iterations.

So maybe it was fixed already?

behlendorf commented 3 years ago

So I tried f73e0a6~1 to verify I could reproduce the issue without your patch. I was not able to reproduce it....

To be clear you're saying that even without the path you're unable to reproduce the issue now? There weren't any other recent changes in this area which I would have expected to fix this.

ismell commented 3 years ago

So I went back to the original version:

$ zfs version
zfs-2.0.3-1~bpo10+1
zfs-kmod-2.0.3-1~bpo10+1

It looks like I can't reproduce the issue anymore...

>>> 11:31:58 Emerging (1 of 1) sys-boot/depthcharge-9999::chromiumos for /build/guybrush/
>>> 11:32:20 Installing (1 of 1) sys-boot/depthcharge-9999::chromiumos to /build/guybrush/
>>> 11:32:25 Completed (1 of 1) sys-boot/depthcharge-9999::chromiumos to /build/guybrush/
>>> 11:32:25 Jobs: 1 of 1 complete                           Load avg: 12.2, 10.9, 8.3

>>> Using system located in ROOT tree /build/guybrush/

>>> No outdated packages were found on your system.
Iteration: 7, Success: 21, Failure: 0

I'm wondering if rebooting helped clear what ever bad state I was in...

thesamesam commented 2 years ago

I've started hitting a similar issue which manifests as go segfaulting immediately on launch (initially reported at https://github.com/golang/go/issues/48636 - I'd assumed the Go compiler was somehow spitting out a completely malformed binary).

More precisely:

I eventually somehow figured I should try downgrading coreutils from 9.0 -> 8.32(-r1 in Gentoo) which made the issue disappear: rebuilding Go worked! coreutils 9.0 changed some default behaviours around reflinking, which ZFS doesn't have, and sparse files, but I tried forcing them off in the cp invocation in the ebuilds with no luck anyway.

I then decided to bisect and arrived at:

a6eaee501f6ec0c152abe88640203a64c390993e is the first bad commit
commit a6eaee501f6ec0c152abe88640203a64c390993e
Author: Paul Eggert <eggert@cs.ucla.edu>
Date:   Thu Jun 25 16:31:44 2020 -0700

    cp: use SEEK_DATA/SEEK_HOLE if available

    If it works, prefer lseek with SEEK_DATA and SEEK_HOLE to FIEMAP,
    as lseek is simpler and more portable (will be in next POSIX).
    Problem reported in 2011 by Jeff Liu (Bug#8061).
    * NEWS: Mention this.
    * src/copy.c (lseek_copy) [SEEK_HOLE]: New function.
    (enum scantype): New constants ERROR_SCANTYPE, LSEEK_SCANTYPE.
    (union scan_inference): New type.
    (infer_scantype): Last arg is now union scan_inference *,
    not struct extent_scan *.  All callers changed.
    Prefer SEEK_HOLE to FIEMAP if both work, since
    SEEK_HOLE is simpler and more portable.
    (copy_reg): Do the fdadvise after initial scan, in case the scan
    fails.  Report an error if the initial scan fails.
    (copy_reg) [SEEK_HOLE]: Use lseek_copy if scantype says so.

 NEWS       |   3 +
 src/copy.c | 209 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 201 insertions(+), 11 deletions(-)
bisect run success

I've straced Portage (just for the lseek syscall) while installing Go and it is filled with:

[...]
11327 lseek(3, 5310, SEEK_DATA)         = -1 ENXIO (No such device or address)
11327 lseek(3, 9987, SEEK_DATA)         = -1 ENXIO (No such device or address)
11327 lseek(3, 8334, SEEK_DATA)         = -1 ENXIO (No such device or address)
11327 lseek(3, 18386, SEEK_DATA)        = -1 ENXIO (No such device or address)
11327 lseek(3, 33472, SEEK_DATA)        = -1 ENXIO (No such device or address)
11327 lseek(3, 15270, SEEK_DATA)        = -1 ENXIO (No such device or address)
11327 lseek(3, 9132, SEEK_DATA)         = -1 ENXIO (No such device or address)
11327 lseek(3, 15471, SEEK_DATA)        = -1 ENXIO (No such device or address)
11327 lseek(3, 7389, SEEK_DATA)         = -1 ENXIO (No such device or address)
11327 lseek(3, 7876, SEEK_DATA)         = -1 ENXIO (No such device or address)
[...]

TL;DR: I seem to be hitting this bug as a result of changes in coreutils which exposes the problem. Please let me know if you feel this is a different issue and I should open a new bug.

system information (emerge --info)
Portage 3.0.23 (python 3.10.0-candidate-2, default/linux/amd64/17.1/hardened, gcc-11.2.0, glibc-2.33-r7, 5.14.7-adry x86_64)
=================================================================
System uname: Linux-5.14.7-adry-x86_64-11th_Gen_Intel-R-_Core-TM-_i7-11800H_@_2.30GHz-with-glibc2.33
KiB Mem:    16105692 total,   1991680 free
KiB Swap:          0 total,         0 free
Timestamp of repository gentoo: Sun, 26 Sep 2021 20:51:27 +0000
Head commit of repository gentoo: fe49ef8e5b7b42f9fcb1f7bf36fbf4c571b8b120

Timestamp of repository kde: Thu, 23 Sep 2021 11:06:23 +0000
Head commit of repository kde: 0bd2f887a7bdfe01ae0c753908bf1132f253263a

sh dash 0.5.11.5
ld GNU ld (Gentoo 2.37_p1 p0) 2.37
ccache version 4.4.1 [disabled]
app-shells/bash:          5.1_p8::gentoo
dev-java/java-config:     2.3.1::gentoo
dev-lang/perl:            5.34.0-r2::gentoo
dev-lang/python:          3.8.12::gentoo, 3.9.7::gentoo, 3.10.0_rc2::gentoo
dev-lang/rust-bin:        1.55.0::gentoo
dev-util/ccache:          4.4.1::gentoo
dev-util/cmake:           3.21.3::gentoo
sys-apps/baselayout:      2.7-r3::gentoo
sys-apps/openrc:          0.44.5::gentoo
sys-apps/sandbox:         2.25::gentoo
sys-devel/autoconf:       2.13-r1::gentoo, 2.71-r1::gentoo
sys-devel/automake:       1.16.4::gentoo
sys-devel/binutils:       2.37_p1::gentoo
sys-devel/gcc:            10.3.0-r2::gentoo, 11.2.0::gentoo
sys-devel/gcc-config:     2.4::gentoo
sys-devel/libtool:        2.4.6-r6::gentoo
sys-devel/make:           4.3::gentoo
sys-kernel/linux-headers: 5.14::gentoo (virtual/os-headers)
sys-libs/glibc:           2.33-r7::gentoo
Repositories:

gentoo
    location: /var/db/repos/gentoo
    sync-type: git
    sync-uri: git://github.com/gentoo-mirror/gentoo.git
    priority: -1000
    sync-git-clone-extra-opts: -b stable -c gc.reflogExpire=0 -c gc.reflogExpireUnreachable=0 -c gc.rerereresolved=0 -c gc.rerereunresolved=0 -c gc.pruneExpire=now
    sync-git-verify-commit-signature: yes

kde
    location: /var/db/repos/kde
    sync-type: git
    sync-uri: https://github.com/gentoo-mirror/kde.git
    masters: gentoo

local
    location: /var/db/repos/local
    masters: gentoo

ACCEPT_KEYWORDS="amd64 ~amd64"
ACCEPT_LICENSE="*"
CBUILD="x86_64-pc-linux-gnu"
CFLAGS="-O2 -pipe -march=native -fdiagnostics-color=always -frecord-gcc-switches"
CHOST="x86_64-pc-linux-gnu"
CONFIG_PROTECT="/etc /usr/share/config /usr/share/gnupg/qualified.txt"
CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/dconf /etc/env.d /etc/fonts/fonts.conf /etc/gconf /etc/gentoo-release /etc/revdep-rebuild /etc/sandbox.d /etc/terminfo"
CXXFLAGS="-O2 -pipe -march=native -fdiagnostics-color=always -frecord-gcc-switches"
DISTDIR="/var/cache/distfiles"
EMERGE_DEFAULT_OPTS="--keep-going=y --complete-graph --deep --changed-deps-report=y"
ENV_UNSET="CARGO_HOME DBUS_SESSION_BUS_ADDRESS DISPLAY GOBIN GOPATH PERL5LIB PERL5OPT PERLPREFIX PERL_CORE PERL_MB_OPT PERL_MM_OPT XAUTHORITY XDG_CACHE_HOME XDG_CONFIG_HOME XDG_DATA_HOME XDG_RUNTIME_DIR"
FCFLAGS="-O2 -pipe -march=native -fdiagnostics-color=always -frecord-gcc-switches"
FEATURES="assume-digests binpkg-docompress binpkg-dostrip binpkg-logs config-protect-if-modified distlocks ebuild-locks fixlafiles ipc-sandbox mount-sandbox multilib-strict network-sandbox news parallel-fetch parallel-install pid-sandbox preserve-libs protect-owned qa-unresolved-soname-deps sandbox sfperms strict unknown-features-warn unmerge-logs unmerge-orphans userfetch userpriv usersandbox usersync xattr"
FFLAGS="-O2 -pipe -march=native -fdiagnostics-color=always -frecord-gcc-switches"
GENTOO_MIRRORS="http://mirror.bytemark.co.uk/gentoo/ http://www.mirrorservice.org/sites/distfiles.gentoo.org/ http://mirrors.soeasyto.com/distfiles.gentoo.org/ http://mirrors.gethosted.online/gentoo"
LANG="en_GB.utf8"
LDFLAGS="-Wl,-O1 -Wl,--as-needed -Wl,--defsym=__gentoo_check_ldflags__=0"
LINGUAS="en en_GB"
MAKEOPTS="-j8"
PKGDIR="/var/cache/binpkgs"
PORTAGE_CONFIGROOT="/"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --omit-dir-times --compress --force --whole-file --delete --stats --human-readable --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages --exclude=/.git"
PORTAGE_TMPDIR="/var/tmp"
USE="PIC X a52 aac acl acpi activities aes alsa amd64 avx avx2 avx512bw avx512cd avx512dq avx512f avx512vbmi avx512vl bluetooth branding bzip2 cairo caps cdda cdr clang crypt dbus declarative dist-kernel dri dts dvd dvdr elogind emacs emboss encode exif f16c flac fma3 fortran gif gles2 gmp gpm graphite gtk gui hardened hunspell iconv icu ipv6 iwd jit jpeg kde kipi kwallet lcms libglvnd libnotify libtirpc mad mmx mmxext mng mp3 mp4 mpeg multilib ncurses networkmanager nptl ogg opengl openmp pam pango pclmul pcre pdf pic pie plasma png policykit popcnt ppds pulseaudio qml qt5 rdrand readline screencast sdl seccomp semantic-desktop sha smartcard spell split-usr sse sse2 sse3 sse4_1 sse4_2 ssl ssp ssse3 startup-notification svg syslog system-av1 system-binutils system-boost system-bootstrap system-cairo system-clang system-digest system-ffmpeg system-harfbuzz system-heimdal system-icu system-jpeg system-jsoncpp system-leveldb system-libevent system-libs system-libvpx system-libyaml system-lz4 system-mitkrb5 system-sqlite system-ssl system-tbb system-uulib system-webp system-zlib threads tiff truetype udev udisks unicode upower usb vaapi verify-sig vorbis wayland widgets x264 xattr xcb xml xtpax xv xvid zlib" ABI_X86="32 64" ADA_TARGET="gnat_2019" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" APACHE2_MODULES="authn_core authz_core socache_shmcb unixd actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache cgi cgid dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" CALLIGRA_FEATURES="karbon sheets words" COLLECTD_PLUGINS="df interface irq load memory rrdtool swap syslog" CPU_FLAGS_X86="aes avx avx2 avx512f avx512dq avx512cd avx512bw avx512vl avx512vbmi f16c fma3 mmx mmxext pclmul popcnt rdrand sha sse sse2 sse3 sse4_1 sse4_2 ssse3" ELIBC="glibc" GPSD_PROTOCOLS="ashtech aivdm earthmate evermore fv18 garmin garmintxt gpsclock greis isync itrax mtk3301 nmea ntrip navcom oceanserver oldstyle oncore rtcm104v2 rtcm104v3 sirf skytraq superstar2 timing tsip tripmate tnt ublox ubx" GRUB_PLATFORMS="efi-64" INPUT_DEVICES="libinput" KERNEL="linux" L10N="en en-GB" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LIBREOFFICE_EXTENSIONS="presenter-console presenter-minimizer" LUA_SINGLE_TARGET="lua5-1" LUA_TARGETS="lua5-1" OFFICE_IMPLEMENTATION="libreoffice" PHP_TARGETS="php7-3 php7-4" POSTGRES_TARGETS="postgres12 postgres13" PYTHON_SINGLE_TARGET="python3_9" PYTHON_TARGETS="python3_9 python3_10 python3_8" RUBY_TARGETS="ruby26" USERLAND="GNU" VIDEO_CARDS="intel i965 iris nvidia" XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipset ipp2p iface geoip fuzzy condition tee tarpit sysrq proto steal rawnat logmark ipmark dhcpmac delude chaos account"
Unset:  CC, CPPFLAGS, CTARGET, CXX, INSTALL_MASK, LC_ALL, PORTAGE_BINHOST, PORTAGE_BUNZIP2_COMMAND, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS, RUSTFLAGS

EDIT: Interestingly, upstream made a kind-of-related remark in a commit after the release:

Upstream commit mentioning ZFS tests: cp/sparse-perf: make more robust and add zfs comments * init.cfg (seek_data_capable_): Add a timeout to ensure failure for slow lseek(...SEEK_DATA) calls (even if that syscall isn't interrupted). * tests/cp/sparse-perf.sh: Run the SEEK_DATA check on the 1TiB empty file to exclude both FreeBSD 9.1 which takes 35s, and ZFS which requires a delay of about 5s between file creation and use of SEEK_DATA to correctly determine it's empty (return ENXIO). Also remove the stat size checks as they invalidate the test due to cp never writing data due to it being always zeros, and thus converted to holes in the output.

@behlendorf I've tried applying the patch you suggested to ZFS 2.1.1 but unfortunately, while it applies, it ends up failing at build time. Is there any chance you're able to please rebase it?

The failure looks like:

/bin/sh ../../libtool  --tag=CC --silent  --mode=compile x86_64-pc-linux-gnu-gcc -DHAVE_CONFIG_H -include ../../zfs_config.h -I../../include -I../../include -I../../module/icp/include -I../../lib/libspl/include -I../../lib/libspl/include/os/linux  -I../../include/os/linux/zfs  -D_GNU_SOURCE -D_REENTRANT -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE -DHAVE_LARGE_STACKS=1 -DLIBEXECDIR=\"/usr/libexec\" -DRUNSTATEDIR=\"/var/run\" -DSBINDIR=\"/sbin\" -DSYSCONFDIR=\"/etc\" -DPKGDATADIR=\"/usr/share/zfs\" -UDEBUG -DNDEBUG  -DTEXT_DOMAIN=\"zfs-linux-user\"  -DDEBUG -UNDEBUG -DZFS_DEBUG  -std=gnu99 -Wall -Wstrict-prototypes -Wmissing-prototypes -fno-strict-aliasing -fno-omit-frame-pointer    -Wno-format-zero-length  -Wno-unused-but-set-variable -Wframe-larger-than=4096  -DLIB_ZPOOL_BUILD -O2 -pipe -march=native -fdiagnostics-color=always -frecord-gcc-switches -c -o dnode.lo ../../module/zfs/dnode.c
../../module/zfs/dnode.c: In function ‘dnode_is_dirty’:
../../module/zfs/dnode.c:1661:29: error: incompatible types when assigning to type ‘multilist_t *’ {aka ‘struct multilist *’} from type ‘multilist_t’ {aka ‘struct multilist’}
 1661 |                 dirtylist = dn->dn_objset->os_dirty_dnodes[i];
      |                             ^~
make[3]: *** [Makefile:1326: dnode.lo] Error 1
make[3]: *** Waiting for unfinished jobs....
behlendorf commented 2 years ago

@thesamesam thanks for continuing the investigation of this issue.

I've rebased my https://github.com/behlendorf/zfs/tree/zfs-holey branch and resolved the build failure (commit 1722451cf7e0588632f116c488009e68dd1f8dcf). If you can give it a try that would be great.

thesamesam commented 2 years ago

@behlendorf Thanks!

(Note: I hadn't been able to reproduce this on other ZFS'd machines with pretty similar setups, so I/O speed might be a factor, which I guess is obvious given it's racey).

Applied to 2.1.1 and built fine (rebuilt initramfs and so on, and rebooted) but no luck. Same corrupted /usr/bin/go.

I'm happy to enable more debugging as necessary, especially given I know this is tricky given you can't reproduce it on your end.

Here's a partial strace during this cp line from the ebuild (I've just picked out the bits around copying the corrupted binaries, but I can get you more if it's useful):

8005  newfstatat(AT_FDCWD, "/var/tmp/portage/dev-lang/go-1.17.1/image/usr/lib/go/api", {st_mode=S_IFDIR|0755, st_size=23, ...}, AT_SYMLINK_NOFOLLOW) = 0
8005  newfstatat(AT_FDCWD, "bin", {st_mode=S_IFDIR|0755, st_size=4, ...}, AT_SYMLINK_NOFOLLOW) = 0
8005  newfstatat(AT_FDCWD, "/var/tmp/portage/dev-lang/go-1.17.1/image/usr/lib/go/bin", 0x7ffdc15b9730, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)
8005  newfstatat(AT_FDCWD, "/var/tmp/portage/dev-lang/go-1.17.1/image/usr/lib/go/bin", 0x7ffdc15b9850, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)
8005  mkdir("/var/tmp/portage/dev-lang/go-1.17.1/image/usr/lib/go/bin", 0755) = 0
8005  newfstatat(AT_FDCWD, "/var/tmp/portage/dev-lang/go-1.17.1/image/usr/lib/go/bin", {st_mode=S_IFDIR|0755, st_size=2, ...}, AT_SYMLINK_NOFOLLOW) = 0
8005  openat(AT_FDCWD, "bin", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
8005  newfstatat(3, "", {st_mode=S_IFDIR|0755, st_size=4, ...}, AT_EMPTY_PATH) = 0
8005  getdents64(3, 0x5619626c61c0 /* 4 entries */, 131072) = 104
8005  getdents64(3, 0x5619626c61c0 /* 0 entries */, 131072) = 0
8005  close(3)                          = 0
8005  newfstatat(AT_FDCWD, "bin/gofmt", {st_mode=S_IFREG|0755, st_size=3318813, ...}, AT_SYMLINK_NOFOLLOW) = 0
8005  openat(AT_FDCWD, "bin/gofmt", O_RDONLY|O_NOFOLLOW) = 3
8005  newfstatat(3, "", {st_mode=S_IFREG|0755, st_size=3318813, ...}, AT_EMPTY_PATH) = 0
8005  openat(AT_FDCWD, "/var/tmp/portage/dev-lang/go-1.17.1/image/usr/lib/go/bin/gofmt", O_WRONLY|O_CREAT|O_EXCL, 0755) = 4
8005  newfstatat(4, "", {st_mode=S_IFREG|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0
8005  ioctl(4, BTRFS_IOC_CLONE or FICLONE, 3) = -1 EOPNOTSUPP (Operation not supported)
8005  lseek(3, 0, SEEK_DATA)            = 0
8005  fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
8005  lseek(3, 0, SEEK_HOLE)            = 131072
8005  lseek(3, 0, SEEK_SET)             = 0
8005  read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\2\0>\0\1\0\0\0\300\33F\0\0\0\0\0"..., 131072) = 131072
8005  write(4, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\2\0>\0\1\0\0\0\300\33F\0\0\0\0\0"..., 131072) = 131072
8005  lseek(3, 131072, SEEK_DATA)       = -1 ENXIO (No such device or address)
8005  ftruncate(4, 3318813)             = 0
8005  close(4)                          = 0
8005  close(3)                          = 0
8005  newfstatat(AT_FDCWD, "bin/go", {st_mode=S_IFREG|0755, st_size=14013626, ...}, AT_SYMLINK_NOFOLLOW) = 0
8005  openat(AT_FDCWD, "bin/go", O_RDONLY|O_NOFOLLOW) = 3
8005  newfstatat(3, "", {st_mode=S_IFREG|0755, st_size=14013626, ...}, AT_EMPTY_PATH) = 0
8005  openat(AT_FDCWD, "/var/tmp/portage/dev-lang/go-1.17.1/image/usr/lib/go/bin/go", O_WRONLY|O_CREAT|O_EXCL, 0755) = 4
8005  newfstatat(4, "", {st_mode=S_IFREG|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0
8005  ioctl(4, BTRFS_IOC_CLONE or FICLONE, 3) = -1 EOPNOTSUPP (Operation not supported)
8005  lseek(3, 0, SEEK_DATA)            = 0
8005  fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
8005  lseek(3, 0, SEEK_HOLE)            = 131072
8005  lseek(3, 0, SEEK_SET)             = 0
8005  read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\2\0>\0\1\0\0\0\0jF\0\0\0\0\0"..., 131072) = 131072
8005  write(4, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\2\0>\0\1\0\0\0\0jF\0\0\0\0\0"..., 131072) = 131072
8005  lseek(3, 131072, SEEK_DATA)       = -1 ENXIO (No such device or address)
8005  lseek(3, 131072, SEEK_DATA)       = -1 ENXIO (No such device or address)
8005  ftruncate(4, 14013626)            = 0
8005  close(4)                          = 0
8005  close(3)                          = 0
8005  newfstatat(AT_FDCWD, "/var/tmp/portage/dev-lang/go-1.17.1/image/usr/lib/go/bin", {st_mode=S_IFDIR|0755, st_size=4, ...}, AT_SYMLINK_NOFOLLOW) = 0
8005  newfstatat(AT_FDCWD, "doc", {st_mode=S_IFDIR|0755, st_size=6, ...}, AT_SYMLINK_NOFOLLOW) = 0
8005  newfstatat(AT_FDCWD, "/var/tmp/portage/dev-lang/go-1.17.1/image/usr/lib/go/doc", 0x7ffdc15b96f0, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)
8005  newfstatat(AT_FDCWD, "/var/tmp/portage/dev-lang/go-1.17.1/image/usr/lib/go/doc", 0x7ffdc15b9810, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)
8005  mkdir("/var/tmp/portage/dev-lang/go-1.17.1/image/usr/lib/go/doc", 0755) = 0
8005  newfstatat(AT_FDCWD, "/var/tmp/portage/dev-lang/go-1.17.1/image/usr/lib/go/doc", {st_mode=S_IFDIR|0755, st_size=2, ...}, AT_SYMLINK_NOFOLLOW) = 0
8005  openat(AT_FDCWD, "doc", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
8005  newfstatat(3, "", {st_mode=S_IFDIR|0755, st_size=6, ...}, AT_EMPTY_PATH) = 0
8005  getdents64(3, 0x5619626c62b0 /* 6 entries */, 131072) = 176
8005  getdents64(3, 0x5619626c62b0 /* 0 entries */, 131072) = 0
8005  close(3)                          = 0
8005  newfstatat(AT_FDCWD, "doc/go_mem.html", {st_mode=S_IFREG|0644, st_size=13565, ...}, AT_SYMLINK_NOFOLLOW) = 0

EDIT: I did out of curiosity try echo 1 > /sys/module/zfs/parameters/zfs_dmu_offset_next_sync but it didn't make any difference.

behlendorf commented 2 years ago

out of curiosity try echo 1 > /sys/module/zfs/parameters/zfs_dmu_offset_next_sync but it didn't make any difference.

That's very surprising. When set ZFS should always report the correct holes at the cost of performance. Were you able to reproduce this behavior with both the unmodified and patched code?

thesamesam commented 2 years ago

Yeah, I just tried again unpatched and it happens regardless of the value of the toggle (0 or 1).

Happy to try enabling more debugging if can think of a way to toggle it on/off easily during the merge.

behlendorf commented 2 years ago

Looking at the strace output what appears to be happening is:

The story is similar for bin/gofmt. Let me take another look and see if I can provide you a small debugging patch to determine what's going wrong. If you still have the strace logs it would be helpful to know how the bin/go file being copied here was originally written (mmap or write syscalls) and when.

thesamesam commented 2 years ago

Ah, it gets renamed during the build process:

# grep -rsin '/var/tmp/portage/dev-lang/go-1.17.1/temp/go-build2081884169/b303/exe/a.out'
go_nuts_zfs.log:4923824:4931  newfstatat(AT_FDCWD, "/var/tmp/portage/dev-lang/go-1.17.1/temp/go-build2081884169/b303/exe/a.out", 0xc0000a92e8, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)
go_nuts_zfs.log:4923825:4931  unlinkat(AT_FDCWD, "/var/tmp/portage/dev-lang/go-1.17.1/temp/go-build2081884169/b303/exe/a.out", 0) = -1 ENOENT (No such file or directory)
go_nuts_zfs.log:4923826:4931  unlinkat(AT_FDCWD, "/var/tmp/portage/dev-lang/go-1.17.1/temp/go-build2081884169/b303/exe/a.out", AT_REMOVEDIR <unfinished ...>
go_nuts_zfs.log:4923832:4931  openat(AT_FDCWD, "/var/tmp/portage/dev-lang/go-1.17.1/temp/go-build2081884169/b303/exe/a.out", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0775 <unfinished ...>
go_nuts_zfs.log:4936086:1309  openat(AT_FDCWD, "/var/tmp/portage/dev-lang/go-1.17.1/temp/go-build2081884169/b303/exe/a.out", O_RDONLY|O_CLOEXEC) = 3
go_nuts_zfs.log:4936699:1309  openat(AT_FDCWD, "/var/tmp/portage/dev-lang/go-1.17.1/temp/go-build2081884169/b303/exe/a.out", O_RDWR|O_CLOEXEC) = 3
go_nuts_zfs.log:4936733:1309  fchmodat(AT_FDCWD, "/var/tmp/portage/dev-lang/go-1.17.1/temp/go-build2081884169/b303/exe/a.out", 0755) = 0
go_nuts_zfs.log:4936735:1309  renameat(AT_FDCWD, "/var/tmp/portage/dev-lang/go-1.17.1/temp/go-build2081884169/b303/exe/a.out", AT_FDCWD, "/var/tmp/portage/dev-lang/go-1.17.1/work/go/bin/go") = 0

Here's the output for 10 lines after every instance of the path being mentioned:

# grep -rsin '/var/tmp/portage/dev-lang/go-1.17.1/temp/go-build2081884169/b303/exe/a.out' -A 10
go_nuts_zfs.log:4923824:4931  newfstatat(AT_FDCWD, "/var/tmp/portage/dev-lang/go-1.17.1/temp/go-build2081884169/b303/exe/a.out", 0xc0000a92e8, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)
go_nuts_zfs.log:4923825:4931  unlinkat(AT_FDCWD, "/var/tmp/portage/dev-lang/go-1.17.1/temp/go-build2081884169/b303/exe/a.out", 0) = -1 ENOENT (No such file or directory)
go_nuts_zfs.log:4923826:4931  unlinkat(AT_FDCWD, "/var/tmp/portage/dev-lang/go-1.17.1/temp/go-build2081884169/b303/exe/a.out", AT_REMOVEDIR <unfinished ...>
go_nuts_zfs.log-4923827-716   <... epoll_pwait resumed>[], 128, 5, NULL, 1033473606543) = 0
go_nuts_zfs.log-4923828-4932  <... nanosleep resumed>NULL)      = 0
go_nuts_zfs.log-4923829-4931  <... unlinkat resumed>)           = -1 ENOENT (No such file or directory)
go_nuts_zfs.log-4923830-716   epoll_pwait(4,  <unfinished ...>
go_nuts_zfs.log-4923831-4932  nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
go_nuts_zfs.log:4923832:4931  openat(AT_FDCWD, "/var/tmp/portage/dev-lang/go-1.17.1/temp/go-build2081884169/b303/exe/a.out", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0775 <unfinished ...>
go_nuts_zfs.log-4923833-716   <... epoll_pwait resumed>[], 128, 0, NULL, 140732691746960) = 0
go_nuts_zfs.log-4923834-716   epoll_pwait(4,  <unfinished ...>
go_nuts_zfs.log-4923835-4931  <... openat resumed>)             = 3
go_nuts_zfs.log-4923836-4931  epoll_ctl(4, EPOLL_CTL_ADD, 3, {events=EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, data={u32=3840094264, u64=140698378776632}}) = -1 EPERM (Operation not permitted)
go_nuts_zfs.log-4923837-4932  <... nanosleep resumed>NULL)      = 0
go_nuts_zfs.log-4923838-4931  mmap(0xc000400000, 4194304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0 <unfinished ...>
go_nuts_zfs.log-4923839-4932  nanosleep({tv_sec=0, tv_nsec=20000},  <unfinished ...>
go_nuts_zfs.log-4923840-4931  <... mmap resumed>)               = 0xc000400000
go_nuts_zfs.log-4923841-4931  madvise(0xc000200000, 2097152, MADV_HUGEPAGE) = 0
go_nuts_zfs.log-4923842-4932  <... nanosleep resumed>NULL)      = 0
--
go_nuts_zfs.log:4936086:1309  openat(AT_FDCWD, "/var/tmp/portage/dev-lang/go-1.17.1/temp/go-build2081884169/b303/exe/a.out", O_RDONLY|O_CLOEXEC) = 3
go_nuts_zfs.log-4936087-1309  epoll_ctl(4, EPOLL_CTL_ADD, 3, {events=EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, data={u32=3136383272, u64=140470041798952}}) = -1 EPERM (Operation not permitted)
go_nuts_zfs.log-4936088-1309  pread64(3, "\177ELF", 4, 0)       = 4
go_nuts_zfs.log-4936089-1309  read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\2\0>\0\1\0\0\0\0jF\0\0\0\0\0"..., 31744) = 31744
go_nuts_zfs.log-4936090-1309  read(3, "H\205\322t$H\211\\$8H\213A H\211\336H\211\323H\211\361\350\4\323\0\0H\213\214$"..., 31744) = 31744
go_nuts_zfs.log-4936091-1309  read(3, "\0G\17\266\24\32I\203\372D\17\203\34\1\0\0L\215\35\311\24\222\0C\17\267\fS\353\"I\215"..., 31744) = 31744
go_nuts_zfs.log-4936092-1309  read(3, "L\211D$(L\211L$0L\211T$8L\211\\$@\350G\276\4\0H\213D$\10\213\\"..., 31744) = 31744
go_nuts_zfs.log-4936093-1309  read(3, "\0\273\21\0\0\0\350\225a\1\0L\213\204$\320\1\0\0M\213@(L\211\204$\320\1\0\0L"..., 31744) = 31744
go_nuts_zfs.log-4936094-1309  read(3, "i\1\0\0\0\0\0\203=\362X\231\0\0u\tH\211\220Xi\1\0\353\fH\215\270Xi\1\0"..., 31744) = 31744
go_nuts_zfs.log-4936095-1309  read(3, "\350{\344\375\377H\213l$\30H\203\304 \303\350LJ\3\0\351G\377\377\377\314\314\314\314\314\314\314"..., 31744) = 31744
go_nuts_zfs.log-4936096-1309  read(3, "\1H\213\5p\221Z\0\3503\271\1\0\350\16\306\1\0\211D$\24H\215\rS\221Z\0\204\1H"..., 31744) = 31744
--
go_nuts_zfs.log:4936699:1309  openat(AT_FDCWD, "/var/tmp/portage/dev-lang/go-1.17.1/temp/go-build2081884169/b303/exe/a.out", O_RDWR|O_CLOEXEC) = 3
go_nuts_zfs.log-4936700-1309  epoll_ctl(4, EPOLL_CTL_ADD, 3, {events=EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, data={u32=3136383272, u64=140470041798952}}) = -1 EPERM (Operation not permitted)
go_nuts_zfs.log-4936701-1309  pwrite64(3, "Zcbx5r99nvfz0xObj6_4/q7UMKiu3A_b"..., 83, 3984) = 83
go_nuts_zfs.log-4936702-1309  pread64(3, "\177ELF", 4, 0)       = 4
go_nuts_zfs.log-4936703-1309  close(3)                          = 0
go_nuts_zfs.log-4936704-1309  write(2, "", 0)                   = 0
go_nuts_zfs.log-4936705-1309  futex(0xc00089a550, FUTEX_WAKE_PRIVATE, 1) = 1
go_nuts_zfs.log-4936706-31945 <... futex resumed>)              = 0
go_nuts_zfs.log-4936707-1309  newfstatat(AT_FDCWD, "/var/tmp/portage/dev-lang/go-1.17.1/work/go/bin",  <unfinished ...>
go_nuts_zfs.log-4936708-31945 futex(0xc00089ad50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
go_nuts_zfs.log-4936709-1612  <... epoll_pwait resumed>[], 128, 1, NULL, 1033883350377) = 0
--
go_nuts_zfs.log:4936733:1309  fchmodat(AT_FDCWD, "/var/tmp/portage/dev-lang/go-1.17.1/temp/go-build2081884169/b303/exe/a.out", 0755) = 0
go_nuts_zfs.log-4936734-1309  newfstatat(AT_FDCWD, "/var/tmp/portage/dev-lang/go-1.17.1/work/go/bin/go", 0xc00048b7c8, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)
go_nuts_zfs.log:4936735:1309  renameat(AT_FDCWD, "/var/tmp/portage/dev-lang/go-1.17.1/temp/go-build2081884169/b303/exe/a.out", AT_FDCWD, "/var/tmp/portage/dev-lang/go-1.17.1/work/go/bin/go") = 0
go_nuts_zfs.log-4936736-1309  unlinkat(AT_FDCWD, "/var/tmp/portage/dev-lang/go-1.17.1/temp/go-build2081884169/b303/", 0) = -1 EISDIR (Is a directory)
go_nuts_zfs.log-4936737-1309  unlinkat(AT_FDCWD, "/var/tmp/portage/dev-lang/go-1.17.1/temp/go-build2081884169/b303/", AT_REMOVEDIR) = -1 ENOTEMPTY (Directory not empty)
go_nuts_zfs.log-4936738-1309  openat(AT_FDCWD, "/var/tmp/portage/dev-lang/go-1.17.1/temp/go-build2081884169", O_RDONLY|O_CLOEXEC) = 3
go_nuts_zfs.log-4936739-1309  epoll_ctl(4, EPOLL_CTL_ADD, 3, {events=EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, data={u32=3136383272, u64=140470041798952}}) = -1 EPERM (Operation not permitted)
go_nuts_zfs.log-4936740-1309  unlinkat(3, "b303", 0)            = -1 EISDIR (Is a directory)
go_nuts_zfs.log-4936741-1309  newfstatat(3, "b303", {st_mode=S_IFDIR|0755, st_size=6, ...}, AT_SYMLINK_NOFOLLOW) = 0
go_nuts_zfs.log-4936742-1309  openat(3, "b303", O_RDONLY|O_CLOEXEC) = 7
go_nuts_zfs.log-4936743-1309  epoll_ctl(4, EPOLL_CTL_ADD, 7, {events=EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, data={u32=3136383272, u64=140470041798952}}) = -1 EPERM (Operation not permitted)
go_nuts_zfs.log-4936744-1309  getdents64(7, 0xc000aa6000 /* 6 entries */, 8192) = 176
go_nuts_zfs.log-4936745-1309  getdents64(7, 0xc000aa6000 /* 0 entries */, 8192) = 0

You're welcome to the full 550MB of strace spew if you want it rather than me trying to clumsily filter, but I appreciate that's no holiday for you either!

behlendorf commented 2 years ago

@thesamesam would you mind trying this updated patch with your test setup, 63382271a36a69e074911eecd33bea500f9ac20a. It would be great if you could try it with zfs_dmu_offset_next_sync=0 first, if you still observe a failure then please try zfs_dmu_offset_next_sync=1.

thesamesam commented 2 years ago

@behlendorf No difference before/after with setting the toggle, but I'll try booting with it in case that somehow makes a difference.

I assume this won't make any difference but I've been applying these on top of 2.1.1. I can try master?

EDIT: Yeah, no difference with the option set from boot.

EDIT: ... and I tried with zfs_dmu_offset_next_sync=2 (noticed in the patch) set after booting which made the 'dangerous copy' a lot slower, but still ended up with corrupted go. Tried with master 0/1/2 as well but no luck.

eggert commented 2 years ago

One hypothesis is that the problem occurs when some other process creates a file via open+copy_file_range+ftruncate+close, and then soon afterwards GNU cp attempts to lseek(thatfile, 0, SEEK_DATA). Perhaps GNU cp sees the effect of the ftruncate without seeing the effect of the copy_file_range, due to a race in the kernel somewhere. For more details, please see my recent comment in GNU coreutils Bug#51433.

pixelb commented 2 years ago

I see a very similar issue and partial solution discussed with FreeBSD folks: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=256205

eggert commented 2 years ago

If we can't see our way to a proper fix quickly, one possibility would be to temporarily modify OpenZFS so that it reports no holes in any files. I suggested this in GNU coreutils Bug#51433#26. Yet another possibility would be to modify ZFS lseek SEEK_DATA to fail with errno == EOPNOTSUPP; this latter approach might be better for GNU cp as cp would find holes in the data via other (slower) means. Of course it'd be better to fix the ZFS bug, whatever it is; but at least these workarounds would avoid data loss.

rincebrain commented 2 years ago

I've got a test patch to add a tunable to disable SEEK_HOLE/DATA I'm going to try running through its paces now...obviously not ideal, but better than "distros patching coreutils to disable using SEEK_HOLE citing this bug".

Does anyone have a pointer to a good reproducer for this that doesn't involve installing Gentoo?

e: Or a good reproducer with Gentoo - I've got it trying to emerge go and see if go segfaults in a loop at the moment on an SSD...

rincebrain commented 2 years ago

So, just returning EOPNOTSUPP causes one or two test failures.

I'll try just using the generic fallback case, but that still needs a workaround equivalent for FBSD. Will look into that once I can reproduce it and see it not reproduce...

thesamesam commented 2 years ago

[...] Does anyone have a pointer to a good reproducer for this that doesn't involve installing Gentoo?

e: Or a good reproducer with Gentoo - I've got it trying to emerge go and see if go segfaults in a loop at the moment on an SSD...

This only happens for me reliably on one machine with NVME. If you're using Gentoo, please make sure you forcefully install coreutils 9.0 by running e.g. emerge -v1O ">=sys-apps/coreutils-9.0".

I'm happy to try out whatever, although I do wish I could easily share a way for others to reproduce this :(

eggert commented 2 years ago

So, just returning EOPNOTSUPP causes one or two test failures.

If it doesn't work even when lseek fails with EOPNOTSUPP, that suggests that the problem affects read as well as lseek, which would be an even more-serious bug in OpenZFS.

The following code in coreutils/src/copy.c should mean that an EOPNOTSUPP failure from lseek should cause cp to look for zeros merely by reading and discarding zeros, without using SEEK_HOLE or SEEK_DATA. If you have the time, please investigate whether this bit of code is returning ZERO_SCANTYPE like it should, even when lseek returns -1 with errno == EOPNOTSUPP. If it does return ZERO_SCANTYPE, that suggests the more-serious bug (which I hope is not the case).

#ifdef SEEK_HOLE
  scan_inference->ext_start = lseek (fd, 0, SEEK_DATA);
  if (0 <= scan_inference->ext_start)
    return LSEEK_SCANTYPE;
  else if (errno != EINVAL && !is_ENOTSUP (errno))
    return errno == ENXIO ? LSEEK_SCANTYPE : ERROR_SCANTYPE;
#endif

  return ZERO_SCANTYPE;
rincebrain commented 2 years ago

When I quickly looked, the test failures I looked at had failed with not handling lseek returning EOPNOTSUPP, not corruption.

eggert commented 2 years ago

When I quickly looked, the test failures I looked at had failed with not handling lseek returning EOPNOTSUPP, not corruption.

Oh, were these test failures due to GNU cp doing the wrong thing when lseek with SEEK_HOLE or SEEK_DATA failed with EOPNOTSUPP, or were they due to other programs getting confused by lseek doing that? If the former, it's a GNU coreutils bug that we should fix; if the latter, that suggests we should work around the OpenZFS lseek bug in a different way (e.g., by having OpenZFS always pretend that the file has no holes).

rincebrain commented 2 years ago

OpenZFS has a fallback codepath on Linux (of just calling the generic Linux llseek handler) if SEEK{DATA,HOLE} are not implemented, which I was going to just make it use, but I was concerned about whether it would handle it correctly versus just being called with no SEEK* support in the system at all.

The other reason I was concerned is that FBSD has no such fallback codepath that I could immediately see, since SEEK_* support has been there since the initial ZFS import on FBSD, AFAICT. So the failure issue on FBSD may be a different facet or a different problem entirely, but it may well wind up requiring a separate fix.

I believe it was the latter failure mode, yes, not coreutils. But I didn't read all the test failures yet.

eggert commented 2 years ago

Perhaps simpler than using the fallback Linux codepath, would be to pretend that the file has no holes, which OpenZFS already does if dmu_offset_next returns EBUSY. This should work on all platforms, not just Linux. Perhaps something like the following untested patch:

From 34e9d06fb8ba2140c1156b60b5d7ce25649cc411 Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert@cs.ucla.edu>
Date: Sun, 31 Oct 2021 19:08:55 -0700
Subject: [PATCH] Work around lseek+SEEK_HOLE bug

* module/zfs/zfs_vnops.c (zfs_holey_common):
For now, pretend that files never have holes.
This works around OpenZFS bug #11900.
---
 module/zfs/zfs_vnops.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/module/zfs/zfs_vnops.c b/module/zfs/zfs_vnops.c
index 9bd75c011..cbad5e641 100644
--- a/module/zfs/zfs_vnops.c
+++ b/module/zfs/zfs_vnops.c
@@ -100,7 +100,13 @@ zfs_holey_common(znode_t *zp, ulong_t cmd, loff_t *off)
    else
        hole = B_FALSE;

+   /* Work around OpenZFS bug 11900
+      <https://github.com/openzfs/zfs/issues/11900>.  */
+# if 0
    error = dmu_offset_next(ZTOZSB(zp)->z_os, zp->z_id, hole, &noff);
+# else
+   error = EBUSY;
+# endif

    if (error == ESRCH)
        return (SET_ERROR(ENXIO));
-- 
2.31.1
rincebrain commented 2 years ago

I've given a spin on that a whirl, and am testing it now locally while I let GH Actions push it through the basic test suite runs.

edited to note: reproduces great for my VM on a virtio-scsi disk backed by NVMe, problem goes away when I twiddle the first workaround from off to on. Second workaround TBD.

eggert commented 2 years ago

Line 121 of that spin doesn't look right to me. It says *off = file_sz; but shouldn't that be if (hole) *off = file_sz;?

rincebrain commented 2 years ago

It's true, which is why there's a followup commit changing it, though to my surprise, nothing failed so far from the test run without that fix.

eggert commented 2 years ago

to my surprise, nothing failed so far from the test run without that fix.

Yes, that puzzles me too. Perhaps the test run was misconfigured to use a patched coreutils (i.e., coreutils with a patch that disables coreutils's use of SEEK_DATA)? That would mean the test would need to be redone.

rincebrain commented 2 years ago

Well, I'm talking about two types of testing here.

One is me running emerge in a loop with I promise it's broken[1] and not changing between runs coreutils. That one I have not tried with the aforementioned commit, just the fixed one.

The other is the test suite runs on Github. I'm more expecting them to tell me if there's something unconditionally assuming this should work in some other way than to outright fail unless I EOPNOTSUPP.

The latter are the ones I was a little surprised did not fail. As said, I haven't tried the former.

[1] - not to slander coreutils here I know it's not coreutils misbehaving

rincebrain commented 2 years ago

As a data point, in the process of trying to see when things went awry (assuming this wasn't always the case), I tried Linux all the way back to 4.9.100 and it still repros happily with OZFS git master.

Next experiment, with a kernel old enough, is jumping back in OZFS git history...

behlendorf commented 2 years ago

Some progress to report, I was able to identify a flaw in the lseek() data/hole reporting. It turns out the code was not flushing any memory mapped file regions prior to detecting the next data/hole. I'm not sure if mmap(2) was involved in any of the failures reported above, but if it was that could explain things.

@rincebrain if possible would you mind testing this 93e19c7b35b62c238839536fed0d0f23c0466136 WIP fix with the reproducer? The critical change is the modification in zfs_holey_common() to start writeback on any dirty memory regions.

rincebrain commented 2 years ago

As fast as my testbed will rebuild, sure.

thesamesam commented 2 years ago

Preliminary results are that I can't reproduce the bug with Go using the new patch. Will keep running it though and try more.

rincebrain commented 2 years ago

Likewise preliminarily, it's built and installed Go 12 times without a mangled binary spotted, when usually it took 1-5 runs before it crashed and my record was 8...

rincebrain commented 2 years ago

Passed 22 runs without crashing.

After seeing you changed zfs_dmu_offset_next_sync to default to 1, out of curiosity, I set it to 0, and it still failed to crash in 10 runs.

pixelb commented 2 years ago

@behlendorf Looks like you found the issue, great! For programs that might need to work around this issue on older openzfs instances, do you have any suggestions for how best to distinguish working from non working? thanks

behlendorf commented 2 years ago

Thanks for testing this fix, I'll see about wrapping up a few loose ends and getting a PR opened today.

Yes, either setting for zfs_dmu_offset_next_sync should be fine, I'm glad you tested both. I switched the default from 0 to 1 to test that holes were correctly reported even for dirty files. We can leave the default unchanged to preserve the existing behavior, or switch it if the performance hit isn't too bad. Did you happen to notice is there was a significant difference in how long the test took to run?

@pixelb once we have the fix merged to master we'll get it backported to previous releases. If you can do an msync before checking for holes that would avoid the issue until it's fixed.

zx2c4 commented 2 years ago

I'm not sure if mmap(2) was involved in any of the failures reported above, but if it was that could explain things.

Go's linker writes binaries using mmap'd files, so I think this would explain what @thesamesam was seeing.

rincebrain commented 2 years ago

I didn't time it, no. I can.

rincebrain commented 2 years ago

I reran it just to be sure, but here's what I got for the first line from time from 10 runs each doing emerge go on my VM with zfs_dmu_offset_next_sync=0 and 1, in that order:

266.24user 169.93system 2:08.31elapsed 339%CPU (0avgtext+0avgdata 905616maxresident)k
264.99user 172.41system 2:08.78elapsed 339%CPU (0avgtext+0avgdata 901584maxresident)k
265.31user 172.64system 2:09.59elapsed 337%CPU (0avgtext+0avgdata 896992maxresident)k
265.94user 174.37system 2:09.35elapsed 340%CPU (0avgtext+0avgdata 909688maxresident)k
264.74user 173.37system 2:10.33elapsed 336%CPU (0avgtext+0avgdata 911112maxresident)k
264.00user 174.28system 2:10.30elapsed 336%CPU (0avgtext+0avgdata 885208maxresident)k
265.28user 174.65system 2:09.34elapsed 340%CPU (0avgtext+0avgdata 920204maxresident)k
265.44user 176.09system 2:10.09elapsed 339%CPU (0avgtext+0avgdata 884280maxresident)k
265.80user 174.61system 2:09.75elapsed 339%CPU (0avgtext+0avgdata 896716maxresident)k
264.20user 172.86system 2:09.09elapsed 338%CPU (0avgtext+0avgdata 923608maxresident)k
265.63user 174.20system 2:09.88elapsed 338%CPU (0avgtext+0avgdata 904192maxresident)k
265.12user 172.38system 2:08.60elapsed 340%CPU (0avgtext+0avgdata 886440maxresident)k
265.28user 174.53system 2:09.83elapsed 338%CPU (0avgtext+0avgdata 916408maxresident)k
266.57user 175.39system 2:10.42elapsed 338%CPU (0avgtext+0avgdata 915536maxresident)k
266.55user 176.39system 2:10.41elapsed 339%CPU (0avgtext+0avgdata 903176maxresident)k
264.78user 171.82system 2:08.48elapsed 339%CPU (0avgtext+0avgdata 907724maxresident)k
266.99user 176.07system 2:09.90elapsed 341%CPU (0avgtext+0avgdata 915456maxresident)k
266.57user 173.26system 2:09.43elapsed 339%CPU (0avgtext+0avgdata 907820maxresident)k
264.94user 171.56system 2:09.42elapsed 337%CPU (0avgtext+0avgdata 900348maxresident)k
264.30user 175.13system 2:09.26elapsed 339%CPU (0avgtext+0avgdata 878324maxresident)k

So I guess very much not the bottleneck, at least on my setup?

(I'm also still pretty curious why this is only reproducible in some environments...)

zx2c4 commented 2 years ago

(I'm also still pretty curious why this is only reproducible in some environments...)

Some folks mount /var/tmp/portage as tmpfs to speed up compilation. In those cases, the go linker would be mmaping files there, and tmpfs doesn't have this problem. Later, when merging the temporary directory into the root filesystem, portage detects that it's a cross-filesystem move, and makes a copy using ordinary techniques without mmap.