Open delan opened 7 months ago
Are you getting any errors in dmesg from the controller when this happens? Because if this is that bug, then I would expect the controller or disks to be complaining a lot.
That seems like it's coming back from LUKS without going out to the disk, then, at which point, it sounds like #15533, I guess.
The only code in dm-crypt.c
that returns EIO
is two instances of:
/* Reject unexpected unaligned bio. */
if (unlikely(bv_in.bv_len & (cc->sector_size - 1)))
return -EIO;
While not a smoking gun, that's a good enough reason to believe it is indeed #15533, hoped to be solve by #15588.
I figured downgrading zfs to 2.1.x would help buy me some time until #15588 lands, but surprisingly I still see this issue going as far back as zfs 2.1.9.
I suspect the upgrade from NixOS 23.05 has changed some confounding variable like the kernel version or downstream patches. I’ll test with other distro versions and maybe even other distros.
— 2023-12-26
distro | linux | zfs{,-kmod}- | affected? |
---|---|---|---|
NixOS 23.05 (https://github.com/NixOS/nixpkgs/commit/7790e078f8979a9fcd543f9a47427eeaba38f268) | 6.1.69 | 2.1.14-1 | yes |
NixOS 23.05 (https://github.com/NixOS/nixpkgs/commit/18784aac1013da9b442adf29b6c7c228518b5d3f) | 6.1.44 | 2.1.12-1 | yes |
Both the tip of NixOS 23.05 and the commit I was running before the upgrade in step 2 are affected. Not really contradictory, because I didn’t start receiving backups until step 8, but smells weird to me given step 1 also involved receiving snapshots without any problems.
— 2023-12-28
I wanted to eliminate the possibility of a problem with pool “ocean” specifically, like the pool being somehow tainted by my testing, since that seemed more likely than the idea that #15533 affects zfs 2.1.x but no one before me knew.
In the end I wasn’t able to reproduce the problem on any test pool with the spare disks I had nearby, but I did discover a workaround that will at least help me get my backups going again in the meantime.
Results of non-incremental sends:
In short, you can work around the problem by sticking a losetup -f between the zpool and luks2. The loop device seems to be more forgiving of unaligned writes.
$ for i in ocean{0x0,0x1,1x0,1x1,2x0,2x2,3x0,3x1,4x0,4x1,Sx0,Sx1,.arc}; do sudo cryptsetup open -d path/to/passphrase {/dev/disk/by-partlabel/,}$i; done
$ for i in ocean{0x0,0x1,1x0,1x1,2x0,2x2,3x0,3x1,4x0,4x1,Sx0,Sx1,.arc}; do sudo losetup -f --show /dev/mapper/$i; done
$ set --; for i in /dev/loop{0..12}; do set -- "$@" -d $i; done; echo "$@"
-d /dev/loop0 -d /dev/loop1 -d /dev/loop2 -d /dev/loop3 -d /dev/loop4 -d /dev/loop5 -d /dev/loop6 -d /dev/loop7 -d /dev/loop8 -d /dev/loop9 -d /dev/loop10 -d /dev/loop11 -d /dev/loop12
$ sudo zpool import "$@" ocean
More testing to reproduce the problem on another pool is possible, but for now I need a break :)
FYI, 2.2.4 just shipped, with #15588 and followup patches included. If you are still having this problem, you might try setting zfs_vdev_disk_classic=0
in your zfs
module parameters and seeing if that helps. If you do try this, please report back with your results, as our hope is to make this the default in the future.
Thanks! I’ll test 2.2.4 and/or zfs_vdev_disk_classic=0 and let you know how I go :D
@delan you'll need to set zfs_vdev_disk_classic=0
to test this. 2.2.4 defaults to 1
which means continue to use the previous code.
@delan see your installed man 4 zfs
for more info, or eebf00be.
Thanks! So far, I’m unable to reproduce the original failure — CKSUM and WRITE errors when receiving snapshots — under zfs 2.2.4 with zfs_vdev_disk_classic=0, or even when I go back to the default zfs_vdev_disk_classic=1!
Last night was a bit more complicated, because although I received a few snapshots, I upgraded in the middle of rearranging and expanding my pool. Here’s the full timeline:
zpool history ocean | rg '^2024-05-(09|10|11)'
journalctl -S '2024-04-12 11:08:07' -t zed -t kernel
| rg ' kernel: Linux version |\]: ZFS Event Daemon | class=(vdev|resilver|scrub)_'
](https://github.com/openzfs/zfs/files/15281513/journalctl.summary.txt)| rg ' kernel: zio '
zpool attach -s ocean loop109(ocean4x0) loop113(ocean4x2)
class=resilver_finish pool='ocean'
, class=scrub_start pool='ocean'
zfs recv
from one machine, no errors(!)zpool detach ocean ocean4x1
zpool add ocean mirror mapper/ocean5x0 mapper/ocean5x1
pool=ocean vdev=/dev/mapper/ocean4x0 error=5 type=1
pool=ocean vdev=/dev/mapper/ocean0x0 error=5 type=1
zfs recv
from three machines, no errors(!)zfs recv
from one machine, no errors(!)class=scrub_finish pool='ocean'
Interestingly (to me), the errors didn’t happen anywhere near when I received snapshots, only in a half-hour window near the end of the scrub phase of the zpool attach -s
. And this time, there were read errors, not just write errors.
READ/WRITE/CKSUM errors reproducible when scrubbing, where all of the read and write errors are EIO:
$ journalctl -t zed -t kernel | rg ' kernel: zio | kernel: Linux version |\]: ZFS Event Daemon | class=(vdev|resilver|scrub)_'
Should I create a separate issue for that and close this one, since receiving snapshots seems to work for me now?
I'm content to keep it here, since we're already here and I'm not sure that it's unrelated yet.
So this is interesting! If I'm reading the logs right, the change from classic to new gets rid of most of the noise, and what remains is scrub/repair IO that appears identical in both modes (same flags, offsets, sizes, etc). The small IO sizes suggests to me it's not directly to do with splitting or misalignment - those seem way too small to trip those issues.
I assume there's nothing more useful in the kernel log, like errors from intermediate layers?
Does your trick of inserting a loop device still work?
Regardless of whether or not the loop helps, could you describe the layout of your devices, DM layers, etc, and then the pool on top? Possibly that's just the output from lsblk -t
and zdb -C -e ocean
. Mostly what I'm looking to understand is the start, size and block size of each layer in the stack, and the kind (driver) of each. Then I can read through the driver code and hopefully figure out where the IO went and what happened at each step.
(If the loop does help, then that output without and with would be great too!).
Thanks! Sorry this has gone on so long. I do know it sucks when you're the one person with the setup that tickles things just so.
So this is interesting! If I'm reading the logs right, the change from classic to new gets rid of most of the noise, and what remains is scrub/repair IO that appears identical in both modes (same flags, offsets, sizes, etc). The small IO sizes suggests to me it's not directly to do with splitting or misalignment - those seem way too small to trip those issues.
I accidentally included the zfs_vdev_disk_classic=0 logs (May 13 10:31:51 onwards) in the zfs_vdev_disk_classic=1 logs above, so they may have been a bit misleading. With the logs fixed (zfs 2.2.4, zfs_vdev_disk_classic=0):
$ < scrub.classic.txt rg -o 'offset=[^ ]+' | sort -u > offsets.classic.txt
$ < scrub.new.txt rg -o 'offset=[^ ]+' | sort -u > offsets.new.txt
$ < scrub.classic.txt rg -o 'offset=[^ ]+ size=[^ ]+ flags=[^ ]+' | sort -u > identical.classic.txt
$ < scrub.new.txt rg -o 'offset=[^ ]+ size=[^ ]+ flags=[^ ]+' | sort -u > identical.new.txt
$ comm -23 offsets.classic.txt offsets.new.txt | wc -l
1021 (offsets found in classic only)
$ comm -12 offsets.classic.txt offsets.new.txt | wc -l
48 (offsets found in both)
$ comm -13 offsets.classic.txt offsets.new.txt | wc -l
0 (offsets found in new only)
$ comm -23 identical.classic.txt identical.new.txt | wc -l
1112 ((offset,size,flag)s found in classic only)
$ comm -12 identical.classic.txt identical.new.txt | wc -l
50 ((offset,size,flag)s found in both)
$ comm -13 identical.classic.txt identical.new.txt | wc -l
46 ((offset,size,flag)s found in new only)
I assume there's nothing more useful in the kernel log, like errors from intermediate layers?
None from what I can see. Here are my unfiltered zed and kernel logs (journalctl -t zed -t kernel
), though like the logs above, the “classic” logs include some surrounding boots for resilvering and changing settings:
Does your trick of inserting a loop device still work?
Testing that now with zfs_vdev_disk_classic=0, I’ll keep you posted.
Regardless of whether or not the loop helps, could you describe the layout of your devices, DM layers, etc, and then the pool on top? Possibly that's just the output from
lsblk -t
andzdb -C -e ocean
. Mostly what I'm looking to understand is the start, size and block size of each layer in the stack, and the kind (driver) of each. Then I can read through the driver code and hopefully figure out where the IO went and what happened at each step.
zfs 2.2.4, zfs_vdev_disk_classic=0, with losetup workaround:
zdb -C -e ocean
lsblk --bytes -to +START,SIZE,TYPE,MODEL
cryptsetup status
losetup -l
Thanks! Sorry this has gone on so long. I do know it sucks when you're the one person with the setup that tickles things just so.
No worries! I really appreciate the time you’ve spent investigating this.
No errors when scrubbing in zfs 2.2.4, zfs_vdev_disk_classic=0, with losetup workaround:
$ journalctl -b -t zed -t kernel | rg ' kernel: zio | kernel: Linux version |\]: ZFS Event Daemon | class=(vdev|resilver|scrub)_'
May 15 01:44:12 venus kernel: Linux version 6.1.90 (nixbld@localhost) (gcc (GCC) 12.3.0, GNU ld (GNU Binutils) 2.40) #1-NixOS SMP PREEMPT_DYNAMIC Thu May 2 14:29:32 UTC 2024
May 15 01:44:14 venus zed[4231]: ZFS Event Daemon 2.2.4-1 (PID 4231)
May 15 01:44:14 venus zed[4335]: eid=13 class=resilver_start pool='ocean'
May 15 01:46:00 venus zed[7907]: eid=16 class=resilver_finish pool='ocean'
May 15 10:54:34 venus zed[67152]: eid=18 class=scrub_start pool='ocean'
May 16 03:00:35 venus zed[1724077]: eid=122 class=scrub_finish pool='ocean'
(If the loop does help, then that output without and with would be great too!).
zfs 2.2.4, zfs_vdev_disk_classic=0, without losetup workaround:
zdb -C -e ocean
lsblk --bytes -to +START,SIZE,TYPE,MODEL
cryptsetup status
losetup -l
is emptyI can now reproduce the problem immediately, rather than having to wait >10 hours each time, by patching zfs to only scrub a dataset known to be especially vulnerable to this bug. Anything else I can do to help, let me know :)
Thanks for all the extra info.
I wasn't able to reproduce this, but also I couldn't exactly reproduce the structure either, because I don't have any drives that report physical 4096 / logical 512, and I couldn't persuade qemu to invent one that also worked with the partition offsets.
I don't know if that is significant, but changing block sizes in the stack causing IOs at one level to not have the right shape at another level feels like it might be on the right track. So I suspect something around that. I did a bit of a code read through the layers but without having a good sense of what the input looks like, it's hard to guess which way the branches fall out.
The next step is likely to spend some time with blktrace/blkparse, looking for weird split/remap/merge ops on those failing IOs as they move up and down the stack. Possibly we'll also need something to show the page structure within the IO, which I think means an OpenZFS, unless we get really good at bpftrace on hardcore mode. Probably there's other/better ways, but I don't know them.
I'll work on getting a patch done and some commands to run as soon as I can. However, I'm hopping on a plane tomorrow and won't be very consistently available for the next couple of weeks. I'll try my best but it might be a little while before I have anything for you. (I mean, no warranty and all that, but still, there's a real problem here somewhere and you've been very patient and I don't wanna take that for granted :) ).
Hi, any updates on this? I ended up having to detach and reattach a few of the disks to get the resilvering to finish after my last round of tests, but now I should be ready for further testing.
Hi, sorry, I was away for a bit longer than expected, and am only just starting to get back onto this.
Right now I don't have anything else to give you. I'm part way through setting up a test to build semi-random DM stacks with different block sizes and then running various known-stressful workloads against them, trying to find a scenario that reproduces the problem that I can then study. I'm pretty swamped on dayjob and home stuff, but I'm trying to push it along a little each evening. I would hope to be running actual tests sometime on the weekend, but I'm not sure - lot going on with not-computers atm :grimacing:
I'll do my best to let you know something in the next week or so, hopefully more than "I didn't get to it yet". Sorry, and thanks again!
This is probably still #15533, but making a separate issue just in case.
System information
zfs-kmod-2.2.2-1
Describe the problem you're observing
When receiving snapshots, I get hundreds of WRITE errors and many CKSUM errors, even after updating to 2.2.2. While all of my leaf vdevs are on LUKS, the errors only happen on vdevs that are connected to one of my mpt3sas controllers, never on vdevs connected to the onboard ahci.
Before updating to 2.2.2, I was briefly on 2.2.1, and I got many WRITE and CKSUM errors from normal writes (probably qbittorrent). More context about my situation, starting around a week ago:
Describe how to reproduce the problem
I have backups, so I’m happy to split my pool and create a test pool over half as many disks if needed.
Include any warning/errors/backtraces from the system logs
The leaf vdev names look funky in these logs, because the pool was imported with -d /dev/disk/by-id but the vdevs were reattached from /dev/mapper:
The rest look nicer because I reimported “ocean” with -d /dev/mapper:
My disks should be fine, and the UDMA CRC error counts while non-zero are stable:
Aside from the special mirror /dev/mapper/oceanSx{0,1}, half of each mirror is connected to one mpt3sas (0000:01:00.0) and the other half on another mpt3sas (0000:06:00.0):