Open codyps opened 1 year ago
https://github.com/openzfs/zfs/issues/13431 describes an issue that might be the same or related (it involves likely newer versions of cryptsetup, which will use luks2). It shows similar zio errors.
https://github.com/openzfs/zfs/issues/13362 also involves luks, and has similar zio errors, but is observing other symptoms (io hangs, though perhaps this is from different operations causing the io).
Can you share what zdb -C
thinks of the situation, and what z4.2
's settings are?
Can you share what
zdb -C
thinks of the situation, and whatz4.2
's settings are?
Here's the zdb -C tank
(tank
is the pool where the replace is occurring). Is there some other zdb info needed?
In the details, one can see the replacement of z4.2
with z14.3
. I previously replaced another disk (now z14.0
) in the same way, which is why I know using the --type luks1
option resolves the IO errors.
Here are the details for z4.2
(sdy
is the backing device for z4.2
)
y@arnold ~ % sudo blockdev --report /dev/mapper/z4.2
RO RA SSZ BSZ StartSec Size Device
rw 256 512 512 0 4000751378944 /dev/mapper/z4.2
y@arnold ~ % sudo blockdev --report /dev/sdy
RO RA SSZ BSZ StartSec Size Device
rw 256 512 512 0 4000753476096 /dev/sdy
y@arnold ~ % sudo cryptsetup luksDump /dev/sdy
LUKS header information for /dev/sdy
Version: 1
Cipher name: aes
Cipher mode: xts-plain64
Hash spec: sha512
Payload offset: 4096
MK bits: 256
MK digest: xxxx
MK salt: xxxx
MK iterations: 24750
UUID: 3b6a0f4b-fd99-47e6-916c-4d3419ff8757
Key Slot 0: ENABLED
Iterations: 99533
Salt: xxxx
Key material offset: 8
AF stripes: 4000
Key Slot 1: DISABLED
Key Slot 2: DISABLED
Key Slot 3: DISABLED
Key Slot 4: DISABLED
Key Slot 5: DISABLED
Key Slot 6: DISABLED
Key Slot 7: DISABLED
y@arnold ~ % sudo dmsetup table | grep z4.2
z4.2: 0 7813967537 crypt aes-xts-plain64 0000000000000000000000000000000000000000000000000000000000000000 0 65:128 4096 1 allow_discards
From the zdb -C
output in the previous comment, one can see that some luks2 devices are in the pool (and have not been observed to emit io errors).
Taking z12.0
as a example device, here's some extra info about it:
y@arnold ~ % sudo blockdev --report /dev/mapper/z12.0
RO RA SSZ BSZ StartSec Size Device
rw 256 512 4096 0 12000121847808 /dev/mapper/z12.0
y@arnold ~ % sudo blockdev --report /dev/sdr
RO RA SSZ BSZ StartSec Size Device
rw 256 512 4096 0 12000138625024 /dev/sdr
y@arnold ~ % sudo cryptsetup luksDump /dev/sdr
LUKS header information
Version: 2
Epoch: 3
Metadata area: 16384 [bytes]
Keyslots area: 16744448 [bytes]
UUID: bb711a67-822b-4a50-8e52-2664de603c12
Label: (no label)
Subsystem: (no subsystem)
Flags: (no flags)
Data segments:
0: crypt
offset: 16777216 [bytes]
length: (whole device)
cipher: aes-xts-plain64
sector: 512 [bytes]
Keyslots:
0: luks2
Key: 512 bits
Priority: normal
Cipher: aes-xts-plain64
Cipher key: 512 bits
PBKDF: argon2i
Time cost: 6
Memory: 1048576
Threads: 4
Salt: xxxx
AF stripes: 4000
AF hash: sha256
Area offset:32768 [bytes]
Area length:258048 [bytes]
Digest ID: 0
Tokens:
Digests:
0: pbkdf2
Hash: sha256
Iterations: 141241
Salt: xxxx
Digest: xxxx
y@arnold ~ % sudo dmsetup table | grep z12.0
z12.0: 0 23437737984 crypt aes-xts-plain64 :64:logon:cryptsetup:bb711a67-822b-4a50-8e52-2664de603c12-d0 0 65:16 32768 1 allow_discards
Note that while z12.0
/sdr
(the working fine device) is using luks2, it has a sector size of 512B (and not 4096B like the devices that emit errors)
I did notice that, yes.
While trying to reproduce this, I discovered I can convince ZFS without much work to try replacing a 512n device with a 4kn device on an ashift 9 vdev, which, uh, Does Not End Well At All.
But that doesn't seem to be what happened to you, here.
I'm now wondering if somehow it stashed somethings on the vdev not 4k aligned and because it was a 512n device it went fine, but now trying to 1:1 mirror is going bonkers.
...o-oh. I had a bad idea, actually. I wonder if the partition isn't 4k aligned, and in trying to replicate the partition table it's resulting in non-4k aligned accesses...let me go read those IO errors you pasted again.
e: well, LUKS, so not exactly a partition, but like, the leading offset...anyway.
edit 2: are you seeing any errors not from ZFS in your syslog from the disk itself?
edit 2: are you seeing any errors not from ZFS in your syslog from the disk itself?
Watching dmesg -w | grep -v audit:
(to remove all the noisy audit messages on my system) shows only the zio errors occurring.
Here's another set: (detached z14.3 and re-luksFormatted it in luks2 to get this output)
y@arnold ~ % sudo dmesg -w | grep -v 'audit:'
[12188887.373045] sctp: Hash tables configured (bind 1024/1024)
[12188914.429696] kauditd_printk_skb: 10 callbacks suppressed
[12188919.981251] kauditd_printk_skb: 35 callbacks suppressed
[12191505.790358] kauditd_printk_skb: 126 callbacks suppressed
[12195028.411209] kauditd_printk_skb: 22 callbacks suppressed
[12198268.934797] kauditd_printk_skb: 27 callbacks suppressed
[12198402.341190] kauditd_printk_skb: 27 callbacks suppressed
[12198494.544891] kauditd_printk_skb: 27 callbacks suppressed
[12212089.961177] kauditd_printk_skb: 30 callbacks suppressed
[12212632.706882] kauditd_printk_skb: 6 callbacks suppressed
[12213959.402093] zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=1782776037376 size=20480 flags=1808aa
[12213965.097787] zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=1784842436608 size=4096 flags=1808aa
[12213968.581657] zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=1778357694464 size=12288 flags=1808aa
[12213979.887437] zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=1779879501824 size=4096 flags=1808aa
[12214226.435283] zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3849360289792 size=20480 flags=1808aa
[12214227.015763] zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3850519625728 size=32768 flags=1808aa
[12214756.511994] zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3808536850432 size=4096 flags=1808aa
so: no errors from the device, sd, etc about unaligned writes (or any other error of any kind). And the offsets listed in the zio messages are aligned to 4096, and all the sizes logged are also multiples of 4096.
As far as offsets in the luks data, the luksDump
output above includes the offset used for the data segment as 16777216, which is divisible by 4096.
The dmsetup table
output (again, included in previous comments) shows a start of 32768, which is correct if the units for it are 512B sectors (which it might be, the documentation here is unclear, and I haven't been able to make out what the linux kernel intends)
All of the luks formatting is done on the entire disk (iow, running cryptsetup luksFormat /dev/sdv
), without any partitions. So the offset luks provides is the only offset in the physical disk before it is handed off to zfs.
An interesting data point would be to see if the first 100 or so errors you get when doing the replace with a LUKS2 header are the same every time, as that might tell us more about whether it's deterministic or something very strange...
@jmesmon thank you for this thread, I really was confused if I have problem with XHCI, disk, zfs, LUKS or all together.
It seems tho that LUKS device sector size doesn't matter, I use 4K physical/logical sector size disks so my LUKS device is 4K sector size event with --type luks1
blockdev --report /dev/sdb
RO RA SSZ BSZ StartSec Size Device
rw 1048544 4096 16384 0 16000900661248 /dev/sdb
blockdev --report /dev/mapper/crypt_disk-a
RO RA SSZ BSZ StartSec Size Device
rw 1048544 4096 4096 0 16000632229888 /dev/mapper/crypt_disk-a
I use -o ashift=12, but the math is simple: LUKS1 header = no problem LUKS2 header = constant stream of zio errors like
[ 5816.334320] zio pool=vault vdev=/dev/mapper/crypt_disk-a error=5 type=2 offset=88591015936 size=32768 flags=40080c80
[ 5816.334337] zio pool=vault vdev=/dev/mapper/crypt_disk-a error=5 type=2 offset=183667879936 size=8192 flags=40080c80
[ 5816.334353] zio pool=vault vdev=/dev/mapper/crypt_disk-a error=5 type=2 offset=1506734080 size=8192 flags=40080c80
[ 5816.335251] zio pool=vault vdev=/dev/mapper/crypt_disk-a error=5 type=2 offset=1506734080 size=4096 flags=188881
[ 5816.335556] zio pool=vault vdev=/dev/mapper/crypt_disk-a error=5 type=2 offset=88591028224 size=4096 flags=188881
[ 5816.335731] zio pool=vault vdev=/dev/mapper/crypt_disk-a error=5 type=2 offset=183667879936 size=4096 flags=188881
[ 5816.342210] zio pool=vault vdev=/dev/mapper/crypt_disk-a error=5 type=2 offset=88591106048 size=49152 flags=40080c80
The problem, when I use LUKS2 header, occurs when constantly writing to the device with max speed (ie. copy large set of data). It's funny that it is less probable to occur if writing with lower speed, but eventually it is happening also.
I really have no idea if this is a problem with zfs of LUKS. Any ideas?
The problem, when I use LUKS2 header, occurs when constantly writing to the device with max speed (ie. copy large set of data). It's funny that it is less probable to occur if writing with lower speed, but eventually it is happening also.
How full is your pool?
I hit this issue a good while back and thought it might be my SSD (WD SN850) randomly disconnecting since I could find some reports about that on the WD forums. Though it seemed to be related to how full the pool/drive was, with enough free space I couldn't trigger it. Switching to another (and larger) drive solved the issue, until now that I've filled it too.
The best way for me to trigger it also hasn't been writing at full (sequential) speed but decompressing and compiling chromium, which is ~906000 files over 19GB.
Currently on a 3200GiB partition it seems like the breakpoint when it starts occuring is somewhere around 90% allocated, and it came to mind that maybe it's here that zfs changes allocation method? Slightly over 200GB is reserved for zvols though, so the free space for filesystems is ~100GB. Unclear if this affects it.
# blockdev --report /dev/mapper/root
RO RA SSZ BSZ StartSec Size Device
rw 256 4096 4096 0 3435957059584 /dev/mapper/root
# blockdev --report /dev/nvme0n1p3
RO RA SSZ BSZ StartSec Size Device
rw 256 4096 4096 270534656 3435973836800 /dev/nvme0n1p3
The drive itself is formated to 4K sectors so misalignment shouldn't be possible (?)
# fdisk -l /dev/nvme0n1
Disk /dev/nvme0n1: 3,73 TiB, 4096805658624 bytes, 1000196694 sectors
Disk model: KINGSTON SKC3000D4096G
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 081BA126-DA19-42BD-AEAB-CF1A6583CA0B
Device Start End Sectors Size Type
/dev/nvme0n1p1 256 262399 262144 1G EFI System
/dev/nvme0n1p2 262400 33816831 33554432 128G Linux swap
/dev/nvme0n1p3 33816832 872677631 838860800 3,1T Linux filesystem
# cryptsetup luksDump /dev/nvme0n1p3
LUKS header information
Version: 2
Epoch: 3
Metadata area: 16384 [bytes]
Keyslots area: 16744448 [bytes]
How full is your pool?
Almost empty, when triggering the bug. It's a new system (brand new disks). I admit that it's a bit unusual - Gentoo arm64 /w Asahi kernel @ Mac Mini M1 ;) But otherwise it is rock stable, zfs-2.1.13.
Also I don't see any problems with drives in dmesg.
Currently, with LUKS1 header I'm seeing 4TB usage: couple of zvol's and 10 697 081 files in fs - no zio errors whatsoever.
Hitting the recent bug 15533 I saw the same zio error=5 type=2 as in this bug, so I started to search for this issue to check similarities. While searching I happened upon this pull request:
The identified triggers there interestingly are:
- a write-heavy load, such that aggregation past 512K happened quite frequently
- a pool approaching 90% fragmentation, such that gang blocks were being produced (this is significant only insofar as gang blocks are backed by small memory allocations, which exacerbate the problem)
which seem to match the identified triggers in this issue too. The PR is closed in favour of a upcoming new take though.
@robn I can't find any new PR, but I'd be interested in testing if what you got fixes this issue too.
I should be posting a significant rework of vdev_disk
in a few days (it was written for a client, and is just finishing testing). That at least will fix up the problem described in #15414, and I suspect this too. But I don't really recommend waiting when a revert will sort it out for now; its not a given that my patch will be right, or be accepted.
Its not totally clear to me that this is a result of misaligned aggregation, but you might try drastically lowering zfs_vdev_aggregation_limit
(make it, say, 131072
; if it "fixes" it, try raising it, if not, try lowering it further). This will reduce throughput considerably, but things might still work.
(more in https://github.com/openzfs/zfs/issues/15533#issuecomment-1825326626).
PR with possible fix from robn: https://github.com/openzfs/zfs/pull/15588 (linking for my reference)
FYI, 2.2.4 just shipped, with #15588 and followup patches included. If you are still having this problem, you might try setting zfs_vdev_disk_classic=0
in your zfs
module parameters and seeing if that helps. If you do try this, please report back with your results, as our hope is to make this the default in the future.
on 6.8.10-asahi nixos, zfs 2.2.4, macbook air m2, zfs_vdev_disk_classic=0 and zfs_vdev_disk_classic=1 both result in several hundred zio error=5 type=2 with a luks2 header while trying to install.
LUKS1 results in no errors. fyi @robn
Please see here for a debugging patch that I hope will reveal more info about what's going on: https://github.com/openzfs/zfs/issues/15646#issuecomment-2283206150
(if possible, I would prefer to keep discussion going in #15646, so its all in one place).
System information
Describe the problem you're observing
With an existing pool with ashift 12:
I create a new vdev with
cryptsetup luksFormat /dev/sdw
, with cryptsetup version 2.6.1This results in a luks (luks2) device (used as a vdev) with a sector size of 4096. Note that
/dev/sdw
is the underlying device (a 14 TB hard drive), and/dev/mapper/z14.3
is the cryptsetup device usingsdw
.I then add it to my existing pool with
zpool replace tank z4.2 z14.3
. Eventually (before the replace/resliver completes), zio reports errors and the vdev is considered failedNext, after detaching the vdev from the pool and
cryptsetup close
, usecryptsetup luksFormat --type luks1
instead (To force the use of luks1 instead of a luks2 header).This results in a vdev with 512 byte sectors:
With this vdev (luks1, 512B sectors), no zio errors are observed and the
zpool replace
completes successfully.