CKSUM and WRITE errors when receiving snapshots (2.2.2, LUKS, mpt3sas)

delan commented 7 months ago

This is probably still #15533, but making a separate issue just in case.

System information

Type	Version/Name
Distribution Name	NixOS
Distribution Version	23.11
Kernel Version	6.1.65
Architecture	x86_64
OpenZFS Version	zfs-2.2.2-1 zfs-kmod-2.2.2-1

Describe the problem you're observing

When receiving snapshots, I get hundreds of WRITE errors and many CKSUM errors, even after updating to 2.2.2. While all of my leaf vdevs are on LUKS, the errors only happen on vdevs that are connected to one of my mpt3sas controllers, never on vdevs connected to the onboard ahci.

Before updating to 2.2.2, I was briefly on 2.2.1, and I got many WRITE and CKSUM errors from normal writes (probably qbittorrent). More context about my situation, starting around a week ago:

Migrated pool “ocean” from FreeBSD to Linux
- from FreeBSD 13.2-RELEASE (zfs-2.1.9-FreeBSD_g92e0d9d18), with all leaf vdevs on geli(4)
- to NixOS 23.05 (zfs 2.1.12), with all leaf vdevs on LUKS2
- technically a new pool now (split + destroy + create + send/recv + destroy + attach)
Upgraded to NixOS 23.11 (zfs 2.2.1)
Lots of WRITE and CKSUM errors from normal writes
Updated to zfs 2.2.2
The export/reboot/import started a resilver with an ETA of over 10 days(!)
Managed to stop it by setting zfs_no_scrub_io, detaching, attaching, and unsetting
Waited a day or so for the sequential resilver and scrub to finish, no errors!
Started receiving snapshots from a machine running NixOS 23.05 (zfs-2.1.12-1)
Lots of WRITE and CKSUM errors from the receive into “ocean”
Started receiving snapshots from pool “cuffs” on the same machine (zfs-2.2.2-1)
Lots of WRITE and CKSUM errors from the receive into “ocean”
Started sending snapshots to pool “cuffs” on the same machine (zfs-2.2.2-1)
No errors from the receive into “cuffs”

Describe how to reproduce the problem

I have backups, so I’m happy to split my pool and create a test pool over half as many disks if needed.

Include any warning/errors/backtraces from the system logs

The leaf vdev names look funky in these logs, because the pool was imported with -d /dev/disk/by-id but the vdevs were reattached from /dev/mapper:

The rest look nicer because I reimported “ocean” with -d /dev/mapper:

My disks should be fine, and the UDMA CRC error counts while non-zero are stable:

$ smart. /dev/sd?
sda          sdb          sdc          sdd          sde          sdf          sdg          sdh          sdi          sdj          sdk          sdl          sdm          sdn
                          ocean2x0     ocean3x0     ocean1x0     ocean0x0     star1x0      star0x0      ocean4x0     ocean4x1     ocean3x1     ocean2x2     ocean1x1     ocean0x1
PASSED       PASSED       PASSED       PASSED       PASSED       PASSED       PASSED       PASSED       PASSED       PASSED       PASSED       PASSED       PASSED       PASSED       overall-health
0            0            0            0            0            0            0            0            0            0            0            0            0            0            Reallocated_Sector_Ct
54468        41829        59659        59644        59653        59428        19000        19000        1425         1425         59663        19964        59638        59225        Power_On_Hours
                          0            0            0            0            0            0            0            0            0            0            0            0            Reallocated_Event_Count
0            0            0            0            0            0            0            0            0            0            0            0            0            0            Current_Pending_Sector
                          0            0            0            0            0            0            0            0            0            0            0            0            Offline_Uncorrectable
162071       0            2495         158          1764         2515         0            0            0            0            944          0            199          0            UDMA_CRC_Error_Count
                          227527       242541       239113       9108         2108         2065         53           53           246864       246          241566       9002         Load_Cycle_Count
37           39           42           44           44           42           49           51           46           47           39           40           42           43           Temperature_Celsius

Aside from the special mirror /dev/mapper/oceanSx{0,1}, half of each mirror is connected to one mpt3sas (0000:01:00.0) and the other half on another mpt3sas (0000:06:00.0):

$ ls -l /sys/block/sd?
lrwxrwxrwx 1 root root 0 Dec  7 12:29 /sys/block/sda -> ../devices/pci0000:00/0000:00:1f.2/ata3/host3/target3:0:0/3:0:0:0/block/sda
lrwxrwxrwx 1 root root 0 Dec  7 12:29 /sys/block/sdb -> ../devices/pci0000:00/0000:00:1f.2/ata2/host2/target2:0:0/2:0:0:0/block/sdb
lrwxrwxrwx 1 root root 0 Dec  7 12:29 /sys/block/sdc -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/block/sdc
lrwxrwxrwx 1 root root 0 Dec  7 12:29 /sys/block/sdd -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/host0/port-0:1/end_device-0:1/target0:0:1/0:0:1:0/block/sdd
lrwxrwxrwx 1 root root 0 Dec  7 12:29 /sys/block/sde -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/host0/port-0:2/end_device-0:2/target0:0:2/0:0:2:0/block/sde
lrwxrwxrwx 1 root root 0 Dec  7 12:29 /sys/block/sdf -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/host0/port-0:3/end_device-0:3/target0:0:3/0:0:3:0/block/sdf
lrwxrwxrwx 1 root root 0 Dec  7 12:29 /sys/block/sdg -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/host0/port-0:4/end_device-0:4/target0:0:4/0:0:4:0/block/sdg
lrwxrwxrwx 1 root root 0 Dec  7 12:29 /sys/block/sdh -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/host0/port-0:5/end_device-0:5/target0:0:5/0:0:5:0/block/sdh
lrwxrwxrwx 1 root root 0 Dec  7 12:29 /sys/block/sdi -> ../devices/pci0000:00/0000:00:01.0/0000:01:00.0/host0/port-0:6/end_device-0:6/target0:0:6/0:0:6:0/block/sdi
lrwxrwxrwx 1 root root 0 Dec  7 12:29 /sys/block/sdj -> ../devices/pci0000:00/0000:00:1c.4/0000:06:00.0/host7/port-7:0/end_device-7:0/target7:0:0/7:0:0:0/block/sdj
lrwxrwxrwx 1 root root 0 Dec  7 12:29 /sys/block/sdk -> ../devices/pci0000:00/0000:00:1c.4/0000:06:00.0/host7/port-7:1/end_device-7:1/target7:0:1/7:0:1:0/block/sdk
lrwxrwxrwx 1 root root 0 Dec  7 12:29 /sys/block/sdl -> ../devices/pci0000:00/0000:00:1c.4/0000:06:00.0/host7/port-7:2/end_device-7:2/target7:0:2/7:0:2:0/block/sdl
lrwxrwxrwx 1 root root 0 Dec  7 12:29 /sys/block/sdm -> ../devices/pci0000:00/0000:00:1c.4/0000:06:00.0/host7/port-7:3/end_device-7:3/target7:0:3/7:0:3:0/block/sdm
lrwxrwxrwx 1 root root 0 Dec  7 12:29 /sys/block/sdn -> ../devices/pci0000:00/0000:00:1c.4/0000:06:00.0/host7/port-7:4/end_device-7:4/target7:0:4/7:0:4:0/block/sdn

$ lspci -tvnn
-[0000:00]-+-00.0  Intel Corporation Xeon E3-1200 v3 Processor DRAM Controller [8086:0c08]
           +-01.0-[01]----00.0  Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [1000:0072]
           +-14.0  Intel Corporation 8 Series/C220 Series Chipset Family USB xHCI [8086:8c31]
           +-16.0  Intel Corporation 8 Series/C220 Series Chipset Family MEI Controller #1 [8086:8c3a]
           +-16.1  Intel Corporation 8 Series/C220 Series Chipset Family MEI Controller #2 [8086:8c3b]
           +-1a.0  Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #2 [8086:8c2d]
           +-1c.0-[02-03]----00.0-[03]----00.0  ASPEED Technology, Inc. ASPEED Graphics Family [1a03:2000]
           +-1c.2-[04]----00.0  Intel Corporation I210 Gigabit Network Connection [8086:1533]
           +-1c.3-[05]----00.0  Intel Corporation I210 Gigabit Network Connection [8086:1533]
           +-1c.4-[06]----00.0  Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [1000:0072]
           +-1c.6-[07]----00.0  Intel Corporation I210 Gigabit Network Connection [8086:1533]
           +-1c.7-[08]----00.0  Intel Corporation I210 Gigabit Network Connection [8086:1533]
           +-1d.0  Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #1 [8086:8c26]
           +-1f.0  Intel Corporation C224 Series Chipset Family Server Standard SKU LPC Controller [8086:8c54]
           +-1f.2  Intel Corporation 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode] [8086:8c02]
           +-1f.3  Intel Corporation 8 Series/C220 Series Chipset Family SMBus Controller [8086:8c22]
           \-1f.6  Intel Corporation 8 Series Chipset Family Thermal Management Controller [8086:8c24]

rincebrain commented 7 months ago

Are you getting any errors in dmesg from the controller when this happens? Because if this is that bug, then I would expect the controller or disks to be complaining a lot.

delan commented 7 months ago

No controller or SCSI or block errors, only zio errors: dmesg.txt

rincebrain commented 7 months ago

That seems like it's coming back from LUKS without going out to the disk, then, at which point, it sounds like #15533, I guess.

robn commented 7 months ago

The only code in dm-crypt.c that returns EIO is two instances of:

    /* Reject unexpected unaligned bio. */
    if (unlikely(bv_in.bv_len & (cc->sector_size - 1)))
        return -EIO;

While not a smoking gun, that's a good enough reason to believe it is indeed #15533, hoped to be solve by #15588.

delan commented 6 months ago

I figured downgrading zfs to 2.1.x would help buy me some time until #15588 lands, but surprisingly I still see this issue going as far back as zfs 2.1.9.

distro	linux	zfs{,-kmod}-	affected?
NixOS 23.11 (https://github.com/NixOS/nixpkgs/commit/5de0b32be6e85dc1a9404c75131316e4ffbc634c)	6.1.65	2.2.2-1	yes
NixOS 23.11 (https://github.com/NixOS/nixpkgs/commit/0b3d618173114c64ab666f557504d6982665d328)	6.1.69	2.2.2-1	yes
NixOS 23.11 (https://github.com/NixOS/nixpkgs/commit/0b3d618173114c64ab666f557504d6982665d328)	6.1.69	2.1.14-1	yes
NixOS 23.11 (https://github.com/NixOS/nixpkgs/commit/0b3d618173114c64ab666f557504d6982665d328) (kernel + zfs from https://github.com/NixOS/nixpkgs/commit/2447a25f908c17964f224c839b5a5f6814265b6b)	6.1.64	2.1.13-1	yes
NixOS 23.11 (https://github.com/NixOS/nixpkgs/commit/0b3d618173114c64ab666f557504d6982665d328) (kernel + zfs from https://github.com/delan/nixpkgs/commit/970c5fb08bc7d137c74418ac04d386f3d77c9602)	6.1.69	2.1.12-1	yes
NixOS 23.11 (https://github.com/NixOS/nixpkgs/commit/0b3d618173114c64ab666f557504d6982665d328) (kernel + zfs from https://github.com/delan/nixpkgs/commit/70f148dd685f4e49d9e5d7176f38be8fe5702d9c)	6.1.69	2.1.9-1	yes

I suspect the upgrade from NixOS 23.05 has changed some confounding variable like the kernel version or downstream patches. I’ll test with other distro versions and maybe even other distros.

— 2023-12-26

distro	linux	zfs{,-kmod}-	affected?
NixOS 23.05 (https://github.com/NixOS/nixpkgs/commit/7790e078f8979a9fcd543f9a47427eeaba38f268)	6.1.69	2.1.14-1	yes
NixOS 23.05 (https://github.com/NixOS/nixpkgs/commit/18784aac1013da9b442adf29b6c7c228518b5d3f)	6.1.44	2.1.12-1	yes

Both the tip of NixOS 23.05 and the commit I was running before the upgrade in step 2 are affected. Not really contradictory, because I didn’t start receiving backups until step 8, but smells weird to me given step 1 also involved receiving snapshots without any problems.

— 2023-12-28

delan commented 6 months ago

I wanted to eliminate the possibility of a problem with pool “ocean” specifically, like the pool being somehow tainted by my testing, since that seemed more likely than the idea that #15533 affects zfs 2.1.x but no one before me knew.

In the end I wasn’t able to reproduce the problem on any test pool with the spare disks I had nearby, but I did discover a workaround that will at least help me get my backups going again in the meantime.

Results of non-incremental sends:

to ocean — failed (3/3 attempts)
to new zpool 186G — good (1/1 attempts)
to new zpool mirror 465G 465G — good (1/1 attempts)
to new zpool 186G via 4K loop beneath luks2 — good (1/1 attempts)
to ocean via 512 and 4K loops beneath luks2 — failed (1/1 attempts)
to ocean via 4K loop atop luks2 — good (1/1 attempts)
to new zpool 186G with luks2 forced to 4096 — good (1/1 attempts)
(block sizes)

In short, you can work around the problem by sticking a losetup -f between the zpool and luks2. The loop device seems to be more forgiving of unaligned writes.

$ for i in ocean{0x0,0x1,1x0,1x1,2x0,2x2,3x0,3x1,4x0,4x1,Sx0,Sx1,.arc}; do sudo cryptsetup open -d path/to/passphrase {/dev/disk/by-partlabel/,}$i; done

$ for i in ocean{0x0,0x1,1x0,1x1,2x0,2x2,3x0,3x1,4x0,4x1,Sx0,Sx1,.arc}; do sudo losetup -f --show /dev/mapper/$i; done

$ set --; for i in /dev/loop{0..12}; do set -- "$@" -d $i; done; echo "$@"
-d /dev/loop0 -d /dev/loop1 -d /dev/loop2 -d /dev/loop3 -d /dev/loop4 -d /dev/loop5 -d /dev/loop6 -d /dev/loop7 -d /dev/loop8 -d /dev/loop9 -d /dev/loop10 -d /dev/loop11 -d /dev/loop12

$ sudo zpool import "$@" ocean

More testing to reproduce the problem on another pool is possible, but for now I need a break :)

robn commented 2 months ago

FYI, 2.2.4 just shipped, with #15588 and followup patches included. If you are still having this problem, you might try setting zfs_vdev_disk_classic=0 in your zfs module parameters and seeing if that helps. If you do try this, please report back with your results, as our hope is to make this the default in the future.

delan commented 2 months ago

Thanks! I’ll test 2.2.4 and/or zfs_vdev_disk_classic=0 and let you know how I go :D

delan commented 1 month ago

No luck on zfs 2.2.4 alone:

Did you mean zfs_vdev_disk_classic=1? 0 is the default and means new, right?

behlendorf commented 1 month ago

@delan you'll need to set zfs_vdev_disk_classic=0 to test this. 2.2.4 defaults to 1 which means continue to use the previous code.

robn commented 1 month ago

@delan see your installed man 4 zfs for more info, or eebf00be.

delan commented 1 month ago

Thanks! So far, I’m unable to reproduce the original failure — CKSUM and WRITE errors when receiving snapshots — under zfs 2.2.4 with zfs_vdev_disk_classic=0, or even when I go back to the default zfs_vdev_disk_classic=1!

Last night was a bit more complicated, because although I received a few snapshots, I upgraded in the middle of rearranging and expanding my pool. Here’s the full timeline:

zpool history ocean | rg '^2024-05-(09|10|11)'
journalctl -S '2024-04-12 11:08:07' -t zed -t kernel
- [| rg ' kernel: Linux version |\]: ZFS Event Daemon | class=(vdev|resilver|scrub)_'](https://github.com/openzfs/zfs/files/15281513/journalctl.summary.txt)
- | rg ' kernel: zio '
May 09 19:56:47, booted zfs 2.2.3 with losetup workaround
May 09 20:19:04, zpool attach -s ocean loop109(ocean4x0) loop113(ocean4x2)
May 09 20:47:10, booted zfs 2.2.4 without losetup workaround, during the resilver phase
May 10 08:54:33, zed class=resilver_finish pool='ocean', class=scrub_start pool='ocean'
May 10 15:26:10, several zfs recv from one machine, no errors(!)
May 10 15:29:39, zpool detach ocean ocean4x1
May 10 15:45:36, zpool add ocean mirror mapper/ocean5x0 mapper/ocean5x1
May 10 19:39:24, first zio error pool=ocean vdev=/dev/mapper/ocean4x0 error=5 type=1
May 10 20:03:46, last zio error pool=ocean vdev=/dev/mapper/ocean0x0 error=5 type=1
May 10 21:00:02, several zfs recv from three machines, no errors(!)
May 10 22:00:03, several zfs recv from one machine, no errors(!)
May 11 00:01:36, zed class=scrub_finish pool='ocean'

Interestingly (to me), the errors didn’t happen anywhere near when I received snapshots, only in a half-hour window near the end of the scrub phase of the zpool attach -s. And this time, there were read errors, not just write errors.

delan commented 1 month ago

READ/WRITE/CKSUM errors reproducible when scrubbing, where all of the read and write errors are EIO:

$ journalctl -t zed -t kernel | rg ' kernel: zio | kernel: Linux version |\]: ZFS Event Daemon | class=(vdev|resilver|scrub)_'

Should I create a separate issue for that and close this one, since receiving snapshots seems to work for me now?

robn commented 1 month ago

I'm content to keep it here, since we're already here and I'm not sure that it's unrelated yet.

So this is interesting! If I'm reading the logs right, the change from classic to new gets rid of most of the noise, and what remains is scrub/repair IO that appears identical in both modes (same flags, offsets, sizes, etc). The small IO sizes suggests to me it's not directly to do with splitting or misalignment - those seem way too small to trip those issues.

I assume there's nothing more useful in the kernel log, like errors from intermediate layers?

Does your trick of inserting a loop device still work?

Regardless of whether or not the loop helps, could you describe the layout of your devices, DM layers, etc, and then the pool on top? Possibly that's just the output from lsblk -t and zdb -C -e ocean. Mostly what I'm looking to understand is the start, size and block size of each layer in the stack, and the kind (driver) of each. Then I can read through the driver code and hopefully figure out where the IO went and what happened at each step.

(If the loop does help, then that output without and with would be great too!).

Thanks! Sorry this has gone on so long. I do know it sucks when you're the one person with the setup that tickles things just so.

delan commented 1 month ago

So this is interesting! If I'm reading the logs right, the change from classic to new gets rid of most of the noise, and what remains is scrub/repair IO that appears identical in both modes (same flags, offsets, sizes, etc). The small IO sizes suggests to me it's not directly to do with splitting or misalignment - those seem way too small to trip those issues.

I accidentally included the zfs_vdev_disk_classic=0 logs (May 13 10:31:51 onwards) in the zfs_vdev_disk_classic=1 logs above, so they may have been a bit misleading. With the logs fixed (zfs 2.2.4, zfs_vdev_disk_classic=0):

$ < scrub.classic.txt rg -o 'offset=[^ ]+' | sort -u > offsets.classic.txt
$ < scrub.new.txt rg -o 'offset=[^ ]+' | sort -u > offsets.new.txt
$ < scrub.classic.txt rg -o 'offset=[^ ]+ size=[^ ]+ flags=[^ ]+' | sort -u > identical.classic.txt
$ < scrub.new.txt rg -o 'offset=[^ ]+ size=[^ ]+ flags=[^ ]+' | sort -u > identical.new.txt

$ comm -23 offsets.classic.txt offsets.new.txt | wc -l
1021    (offsets found in classic only)
$ comm -12 offsets.classic.txt offsets.new.txt | wc -l
48      (offsets found in both)
$ comm -13 offsets.classic.txt offsets.new.txt | wc -l
0       (offsets found in new only)

$ comm -23 identical.classic.txt identical.new.txt | wc -l
1112    ((offset,size,flag)s found in classic only)
$ comm -12 identical.classic.txt identical.new.txt | wc -l
50      ((offset,size,flag)s found in both)
$ comm -13 identical.classic.txt identical.new.txt | wc -l
46      ((offset,size,flag)s found in new only)

I assume there's nothing more useful in the kernel log, like errors from intermediate layers?

None from what I can see. Here are my unfiltered zed and kernel logs (journalctl -t zed -t kernel), though like the logs above, the “classic” logs include some surrounding boots for resilvering and changing settings:

Does your trick of inserting a loop device still work?

Testing that now with zfs_vdev_disk_classic=0, I’ll keep you posted.

Regardless of whether or not the loop helps, could you describe the layout of your devices, DM layers, etc, and then the pool on top? Possibly that's just the output from lsblk -t and zdb -C -e ocean. Mostly what I'm looking to understand is the start, size and block size of each layer in the stack, and the kind (driver) of each. Then I can read through the driver code and hopefully figure out where the IO went and what happened at each step.

zfs 2.2.4, zfs_vdev_disk_classic=0, with losetup workaround:

zdb -C -e ocean
lsblk --bytes -to +START,SIZE,TYPE,MODEL
cryptsetup status
losetup -l
0000:01:00.0 Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [1000:0072]
- disks for one half of each mirror vdev (ocean0x0, ocean1x0, ocean2x0, ocean3x0, ocean4x0, ocean5x0)
- disks are 4096 min-io, 4096 phy-sec, 512 log-sec
- gpt partitions have offset 40×512 = 20480 (4TB disks) or 2048×512 = 1048576 (8TB disks)
  - luks devs have block size 4096, offset 32768×512 = 16777216
  - loop devs have block size 512, offset 0
0000:06:00.0 Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [1000:0072]
- disks for other half of each mirror vdev (ocean0x1, ocean1x1, ocean2x2, ocean3x1, ocean4x2, ocean5x1)
- layout as above
0000:00:1f.2 Intel Corporation 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode] [8086:8c02]
- disks for special vdevs (oceanSx0, oceanSx1)
- disks are 512 min-io, 512 phy-sec, 512 log-sec
- gpt partitions have different offsets, but MiB-aligned
  - luks devs have block size 512, offset 32768×512 = 16777216
  - loop devs have block size 512, offset 0

Details for each layer (zfs-15646-tables.sh)

| leaf | loop dev | losetup
“log-sec” | losetup
“offset” | lsblk
“min-io” | lsblk
“phy-sec” | lsblk
“log-sec” | lsblk
“size” | | --- | --- | --- | --- | --- | --- | --- | --- | | ocean0x0 | loop100 | 512 | 0 | 512 | 512 | 512 | 4000770211840 | | ocean1x0 | loop102 | 512 | 0 | 512 | 512 | 512 | 4000770211840 | | ocean2x0 | loop104 | 512 | 0 | 512 | 512 | 512 | 4000770211840 | | ocean3x0 | loop106 | 512 | 0 | 512 | 512 | 512 | 4000770211840 | | ocean4x0 | loop108 | 512 | 0 | 512 | 512 | 512 | 8001545043968 | | ocean5x0 | loop110 | 512 | 0 | 512 | 512 | 512 | 8001545043968 | | ocean0x1 | loop101 | 512 | 0 | 512 | 512 | 512 | 4000770211840 | | ocean1x1 | loop103 | 512 | 0 | 512 | 512 | 512 | 4000770211840 | | ocean2x2 | loop105 | 512 | 0 | 512 | 512 | 512 | 4000770211840 | | ocean3x1 | loop107 | 512 | 0 | 512 | 512 | 512 | 4000770211840 | | ocean4x2 | loop109 | 512 | 0 | 512 | 512 | 512 | 8001545043968 | | ocean5x1 | loop111 | 512 | 0 | 512 | 512 | 512 | 8001545043968 | | oceanSx0 | loop112 | 512 | 0 | 512 | 512 | 512 | 77292634112 | | oceanSx1 | loop113 | 512 | 0 | 512 | 512 | 512 | 77292634112 | | leaf | luks dev | cryptsetup
“sector size” | cryptsetup
“offset” | cryptsetup
“size” | | --- | --- | --- | --- | --- | | ocean0x0 | dm-3 | 4096 | 32768 | 7814004320 | | ocean1x0 | dm-5 | 4096 | 32768 | 7814004320 | | ocean2x0 | dm-7 | 4096 | 32768 | 7814004320 | | ocean3x0 | dm-9 | 4096 | 32768 | 7814004320 | | ocean4x0 | dm-11 | 4096 | 32768 | 15628017664 | | ocean5x0 | dm-13 | 4096 | 32768 | 15628017664 | | ocean0x1 | dm-4 | 4096 | 32768 | 7814004320 | | ocean1x1 | dm-6 | 4096 | 32768 | 7814004320 | | ocean2x2 | dm-8 | 4096 | 32768 | 7814004320 | | ocean3x1 | dm-10 | 4096 | 32768 | 7814004320 | | ocean4x2 | dm-12 | 4096 | 32768 | 15628017664 | | ocean5x1 | dm-14 | 4096 | 32768 | 15628017664 | | oceanSx0 | dm-15 | 512 | 32768 | 150962176 | | oceanSx1 | dm-16 | 512 | 32768 | 150962176 | | leaf | partition dev | lsblk
“min-io” | lsblk
“phy-sec” | lsblk
“log-sec” | lsblk
“start” | lsblk
“size” | | --- | --- | --- | --- | --- | --- | --- | | ocean0x0 | sdf1 | 4096 | 4096 | 512 | 40 | 4000786989056 | | ocean1x0 | sde1 | 4096 | 4096 | 512 | 40 | 4000786989056 | | ocean2x0 | sdc1 | 4096 | 4096 | 512 | 40 | 4000786989056 | | ocean3x0 | sdd1 | 4096 | 4096 | 512 | 40 | 4000786989056 | | ocean4x0 | sdj1 | 4096 | 4096 | 512 | 2048 | 8001561821184 | | ocean5x0 | sdi1 | 4096 | 4096 | 512 | 2048 | 8001561821184 | | ocean0x1 | sdp1 | 4096 | 4096 | 512 | 40 | 4000786989056 | | ocean1x1 | sdn1 | 4096 | 4096 | 512 | 40 | 4000786989056 | | ocean2x2 | sdo1 | 4096 | 4096 | 512 | 40 | 4000786989056 | | ocean3x1 | sdm1 | 4096 | 4096 | 512 | 40 | 4000786989056 | | ocean4x2 | sdl1 | 4096 | 4096 | 512 | 2048 | 8001561821184 | | ocean5x1 | sdk1 | 4096 | 4096 | 512 | 2048 | 8001561821184 | | oceanSx0 | sda2 | 512 | 512 | 512 | 2099200 | 77309411328 | | oceanSx1 | sdb1 | 512 | 512 | 512 | 2048 | 77309411328 | | leaf | disk dev | lsblk
“min-io” | lsblk
“phy-sec” | lsblk
“log-sec” | lsblk
“size” | | --- | --- | --- | --- | --- | --- | | ocean0x0 | sdf | 4096 | 4096 | 512 | 4000787030016 | | ocean1x0 | sde | 4096 | 4096 | 512 | 4000787030016 | | ocean2x0 | sdc | 4096 | 4096 | 512 | 4000787030016 | | ocean3x0 | sdd | 4096 | 4096 | 512 | 4000787030016 | | ocean4x0 | sdj | 4096 | 4096 | 512 | 8001563222016 | | ocean5x0 | sdi | 4096 | 4096 | 512 | 8001563222016 | | ocean0x1 | sdp | 4096 | 4096 | 512 | 4000787030016 | | ocean1x1 | sdn | 4096 | 4096 | 512 | 4000787030016 | | ocean2x2 | sdo | 4096 | 4096 | 512 | 4000787030016 | | ocean3x1 | sdm | 4096 | 4096 | 512 | 4000787030016 | | ocean4x2 | sdl | 4096 | 4096 | 512 | 8001563222016 | | ocean5x1 | sdk | 4096 | 4096 | 512 | 8001563222016 | | oceanSx0 | sda | 512 | 512 | 512 | 480103981056 | | oceanSx1 | sdb | 512 | 512 | 512 | 240057409536 | | leaf | disk dev | disk path | driver | controller | | --- | --- | --- | --- | --- | | ocean0x0 | sdf | pci-0000:01:00.0-sas-phy3-lun-0 | mpt3sas | x16 slot | | ocean1x0 | sde | pci-0000:01:00.0-sas-phy2-lun-0 | mpt3sas | x16 slot | | ocean2x0 | sdc | pci-0000:01:00.0-sas-phy1-lun-0 | mpt3sas | x16 slot | | ocean3x0 | sdd | pci-0000:01:00.0-sas-phy0-lun-0 | mpt3sas | x16 slot | | ocean4x0 | sdj | pci-0000:01:00.0-sas-phy7-lun-0 | mpt3sas | x16 slot | | ocean5x0 | sdi | pci-0000:01:00.0-sas-phy6-lun-0 | mpt3sas | x16 slot | | ocean0x1 | sdp | pci-0000:06:00.0-sas-phy7-lun-0 | mpt3sas | x2 slot | | ocean1x1 | sdn | pci-0000:06:00.0-sas-phy6-lun-0 | mpt3sas | x2 slot | | ocean2x2 | sdo | pci-0000:06:00.0-sas-phy5-lun-0 | mpt3sas | x2 slot | | ocean3x1 | sdm | pci-0000:06:00.0-sas-phy4-lun-0 | mpt3sas | x2 slot | | ocean4x2 | sdl | pci-0000:06:00.0-sas-phy3-lun-0 | mpt3sas | x2 slot | | ocean5x1 | sdk | pci-0000:06:00.0-sas-phy2-lun-0 | mpt3sas | x2 slot | | oceanSx0 | sda | pci-0000:00:1f.2-ata-3.0 | ahci | onboard | | oceanSx1 | sdb | pci-0000:00:1f.2-ata-2.0 | ahci | onboard |

Thanks! Sorry this has gone on so long. I do know it sucks when you're the one person with the setup that tickles things just so.

No worries! I really appreciate the time you’ve spent investigating this.

delan commented 1 month ago

No errors when scrubbing in zfs 2.2.4, zfs_vdev_disk_classic=0, with losetup workaround:

journalctl -t zed -t kernel

$ journalctl -b -t zed -t kernel | rg ' kernel: zio | kernel: Linux version |\]: ZFS Event Daemon | class=(vdev|resilver|scrub)_'
May 15 01:44:12 venus kernel: Linux version 6.1.90 (nixbld@localhost) (gcc (GCC) 12.3.0, GNU ld (GNU Binutils) 2.40) #1-NixOS SMP PREEMPT_DYNAMIC Thu May  2 14:29:32 UTC 2024
May 15 01:44:14 venus zed[4231]: ZFS Event Daemon 2.2.4-1 (PID 4231)
May 15 01:44:14 venus zed[4335]: eid=13 class=resilver_start pool='ocean'
May 15 01:46:00 venus zed[7907]: eid=16 class=resilver_finish pool='ocean'
May 15 10:54:34 venus zed[67152]: eid=18 class=scrub_start pool='ocean'
May 16 03:00:35 venus zed[1724077]: eid=122 class=scrub_finish pool='ocean'

(If the loop does help, then that output without and with would be great too!).

zfs 2.2.4, zfs_vdev_disk_classic=0, without losetup workaround:

zdb -C -e ocean
lsblk --bytes -to +START,SIZE,TYPE,MODEL
cryptsetup status
losetup -l is empty
layout is the same as in this comment, minus loop devs

Details for each layer (zfs-15646-tables.sh)

| leaf | loop dev | losetup
“log-sec” | losetup
“offset” | lsblk
“min-io” | lsblk
“phy-sec” | lsblk
“log-sec” | lsblk
“size” | | --- | --- | --- | --- | --- | --- | --- | --- | | leaf | luks dev | cryptsetup
“sector size” | cryptsetup
“offset” | cryptsetup
“size” | | --- | --- | --- | --- | --- | | ocean0x0 | dm-3 | 4096 | 32768 | 7814004320 | | ocean1x0 | dm-5 | 4096 | 32768 | 7814004320 | | ocean2x0 | dm-7 | 4096 | 32768 | 7814004320 | | ocean3x0 | dm-9 | 4096 | 32768 | 7814004320 | | ocean4x0 | dm-11 | 4096 | 32768 | 15628017664 | | ocean5x0 | dm-13 | 4096 | 32768 | 15628017664 | | ocean0x1 | dm-4 | 4096 | 32768 | 7814004320 | | ocean1x1 | dm-6 | 4096 | 32768 | 7814004320 | | ocean2x2 | dm-8 | 4096 | 32768 | 7814004320 | | ocean3x1 | dm-10 | 4096 | 32768 | 7814004320 | | ocean4x2 | dm-12 | 4096 | 32768 | 15628017664 | | ocean5x1 | dm-14 | 4096 | 32768 | 15628017664 | | oceanSx0 | dm-15 | 512 | 32768 | 150962176 | | oceanSx1 | dm-16 | 512 | 32768 | 150962176 | | leaf | partition dev | lsblk
“min-io” | lsblk
“phy-sec” | lsblk
“log-sec” | lsblk
“start” | lsblk
“size” | | --- | --- | --- | --- | --- | --- | --- | | ocean0x0 | sde1 | 4096 | 4096 | 512 | 40 | 4000786989056 | | ocean1x0 | sdf1 | 4096 | 4096 | 512 | 40 | 4000786989056 | | ocean2x0 | sdc1 | 4096 | 4096 | 512 | 40 | 4000786989056 | | ocean3x0 | sdd1 | 4096 | 4096 | 512 | 40 | 4000786989056 | | ocean4x0 | sdj1 | 4096 | 4096 | 512 | 2048 | 8001561821184 | | ocean5x0 | sdi1 | 4096 | 4096 | 512 | 2048 | 8001561821184 | | ocean0x1 | sdp1 | 4096 | 4096 | 512 | 40 | 4000786989056 | | ocean1x1 | sdn1 | 4096 | 4096 | 512 | 40 | 4000786989056 | | ocean2x2 | sdo1 | 4096 | 4096 | 512 | 40 | 4000786989056 | | ocean3x1 | sdm1 | 4096 | 4096 | 512 | 40 | 4000786989056 | | ocean4x2 | sdl1 | 4096 | 4096 | 512 | 2048 | 8001561821184 | | ocean5x1 | sdk1 | 4096 | 4096 | 512 | 2048 | 8001561821184 | | oceanSx0 | sda2 | 512 | 512 | 512 | 2099200 | 77309411328 | | oceanSx1 | sdb1 | 512 | 512 | 512 | 2048 | 77309411328 | | leaf | disk dev | lsblk
“min-io” | lsblk
“phy-sec” | lsblk
“log-sec” | lsblk
“size” | | --- | --- | --- | --- | --- | --- | | ocean0x0 | sde | 4096 | 4096 | 512 | 4000787030016 | | ocean1x0 | sdf | 4096 | 4096 | 512 | 4000787030016 | | ocean2x0 | sdc | 4096 | 4096 | 512 | 4000787030016 | | ocean3x0 | sdd | 4096 | 4096 | 512 | 4000787030016 | | ocean4x0 | sdj | 4096 | 4096 | 512 | 8001563222016 | | ocean5x0 | sdi | 4096 | 4096 | 512 | 8001563222016 | | ocean0x1 | sdp | 4096 | 4096 | 512 | 4000787030016 | | ocean1x1 | sdn | 4096 | 4096 | 512 | 4000787030016 | | ocean2x2 | sdo | 4096 | 4096 | 512 | 4000787030016 | | ocean3x1 | sdm | 4096 | 4096 | 512 | 4000787030016 | | ocean4x2 | sdl | 4096 | 4096 | 512 | 8001563222016 | | ocean5x1 | sdk | 4096 | 4096 | 512 | 8001563222016 | | oceanSx0 | sda | 512 | 512 | 512 | 480103981056 | | oceanSx1 | sdb | 512 | 512 | 512 | 240057409536 | | leaf | disk dev | disk path | driver | controller | | --- | --- | --- | --- | --- | | ocean0x0 | sde | pci-0000:01:00.0-sas-phy3-lun-0 | mpt3sas | x16 slot | | ocean1x0 | sdf | pci-0000:01:00.0-sas-phy2-lun-0 | mpt3sas | x16 slot | | ocean2x0 | sdc | pci-0000:01:00.0-sas-phy1-lun-0 | mpt3sas | x16 slot | | ocean3x0 | sdd | pci-0000:01:00.0-sas-phy0-lun-0 | mpt3sas | x16 slot | | ocean4x0 | sdj | pci-0000:01:00.0-sas-phy7-lun-0 | mpt3sas | x16 slot | | ocean5x0 | sdi | pci-0000:01:00.0-sas-phy6-lun-0 | mpt3sas | x16 slot | | ocean0x1 | sdp | pci-0000:06:00.0-sas-phy7-lun-0 | mpt3sas | x2 slot | | ocean1x1 | sdn | pci-0000:06:00.0-sas-phy6-lun-0 | mpt3sas | x2 slot | | ocean2x2 | sdo | pci-0000:06:00.0-sas-phy5-lun-0 | mpt3sas | x2 slot | | ocean3x1 | sdm | pci-0000:06:00.0-sas-phy4-lun-0 | mpt3sas | x2 slot | | ocean4x2 | sdl | pci-0000:06:00.0-sas-phy3-lun-0 | mpt3sas | x2 slot | | ocean5x1 | sdk | pci-0000:06:00.0-sas-phy2-lun-0 | mpt3sas | x2 slot | | oceanSx0 | sda | pci-0000:00:1f.2-ata-3.0 | ahci | onboard | | oceanSx1 | sdb | pci-0000:00:1f.2-ata-2.0 | ahci | onboard |

delan commented 1 month ago

I can now reproduce the problem immediately, rather than having to wait >10 hours each time, by patching zfs to only scrub a dataset known to be especially vulnerable to this bug. Anything else I can do to help, let me know :)

robn commented 1 month ago

Thanks for all the extra info.

I wasn't able to reproduce this, but also I couldn't exactly reproduce the structure either, because I don't have any drives that report physical 4096 / logical 512, and I couldn't persuade qemu to invent one that also worked with the partition offsets.

I don't know if that is significant, but changing block sizes in the stack causing IOs at one level to not have the right shape at another level feels like it might be on the right track. So I suspect something around that. I did a bit of a code read through the layers but without having a good sense of what the input looks like, it's hard to guess which way the branches fall out.

The next step is likely to spend some time with blktrace/blkparse, looking for weird split/remap/merge ops on those failing IOs as they move up and down the stack. Possibly we'll also need something to show the page structure within the IO, which I think means an OpenZFS, unless we get really good at bpftrace on hardcore mode. Probably there's other/better ways, but I don't know them.

I'll work on getting a patch done and some commands to run as soon as I can. However, I'm hopping on a plane tomorrow and won't be very consistently available for the next couple of weeks. I'll try my best but it might be a little while before I have anything for you. (I mean, no warranty and all that, but still, there's a real problem here somewhere and you've been very patient and I don't wanna take that for granted :) ).

delan commented 2 weeks ago

Hi, any updates on this? I ended up having to detach and reattach a few of the disks to get the resilvering to finish after my last round of tests, but now I should be ready for further testing.

robn commented 2 weeks ago

Hi, sorry, I was away for a bit longer than expected, and am only just starting to get back onto this.

Right now I don't have anything else to give you. I'm part way through setting up a test to build semi-random DM stacks with different block sizes and then running various known-stressful workloads against them, trying to find a scenario that reproduces the problem that I can then study. I'm pretty swamped on dayjob and home stuff, but I'm trying to push it along a little each evening. I would hope to be running actual tests sometime on the weekend, but I'm not sure - lot going on with not-computers atm :grimacing:

I'll do my best to let you know something in the next week or so, hopefully more than "I didn't get to it yet". Sorry, and thanks again!

openzfs / zfs