openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.4k stars 1.73k forks source link

pool with ashift 12 on luks2 devices with sector size 4k causes repeated io errors #14533

Open codyps opened 1 year ago

codyps commented 1 year ago

System information

Type Version/Name
Distribution Name arch
Distribution Version n/a
Kernel Version 5.19.13-arch1-1
Architecture x86_64
OpenZFS Version 2.1.6-1

Describe the problem you're observing

With an existing pool with ashift 12:

y@arnold % zpool get ashift tank
NAME  PROPERTY  VALUE   SOURCE
tank  ashift    12      local

I create a new vdev with cryptsetup luksFormat /dev/sdw, with cryptsetup version 2.6.1

y@arnold % cryptsetup -V
cryptsetup 2.6.1 flags: UDEV BLKID KEYRING KERNEL_CAPI 

This results in a luks (luks2) device (used as a vdev) with a sector size of 4096. Note that /dev/sdw is the underlying device (a 14 TB hard drive), and /dev/mapper/z14.3 is the cryptsetup device using sdw.

y@arnold % sudo blockdev --report /dev/sdw
RO    RA   SSZ   BSZ        StartSec            Size   Device
rw   256   512  4096               0  14000519643136   /dev/sdw
y@arnold % sudo blockdev --report /dev/mapper/z14.3
RO    RA   SSZ   BSZ        StartSec            Size   Device
rw   256  4096  4096               0  14000502865920   /dev/mapper/z14.3
y@arnold % sudo dmsetup table | grep z14.3
z14.3: 0 27344732160 crypt aes-xts-plain64 :64:logon:cryptsetup:e5b69a87-e02c-4b14-be58-6109488e172a-d0 0 65:96 32768 2 allow_discards sector_size:4096
y@arnold console % sudo cryptsetup luksDump /dev/sdw
LUKS header information
Version:        2
Epoch:          3
Metadata area:  16384 [bytes]
Keyslots area:  16744448 [bytes]
UUID:           e5b69a87-e02c-4b14-be58-6109488e172a
Label:          (no label)
Subsystem:      (no subsystem)
Flags:          (no flags)

Data segments:
  0: crypt
        offset: 16777216 [bytes]
        length: (whole device)
        cipher: aes-xts-plain64
        sector: 4096 [bytes]

Keyslots:
  0: luks2
        Key:        512 bits
        Priority:   normal
        Cipher:     aes-xts-plain64
        Cipher key: 512 bits
        PBKDF:      argon2id
        Time cost:  4
        Memory:     1048576
        Threads:    4
        Salt:       xxxx
        AF stripes: 4000
        AF hash:    sha256
        Area offset:32768 [bytes]
        Area length:258048 [bytes]
        Digest ID:  0
Tokens:
Digests:
  0: pbkdf2
        Hash:       sha256
        Iterations: 67632
        Salt:       xxxx
        Digest:     xxxxx
y@arnold console % 

I then add it to my existing pool with zpool replace tank z4.2 z14.3. Eventually (before the replace/resliver completes), zio reports errors and the vdev is considered failed

y@arnold ~ % zpool status tank | grep -E '(z14.3|NAME)'
        NAME             STATE     READ WRITE CKSUM
              z14.3      FAULTED      0    20     0  too many errors
y@arnold ~ % journalctl -f | grep zio
Feb 25 14:52:19 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3772528418816 size=4096 flags=1808aa
Feb 25 14:52:19 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3706396516352 size=4096 flags=1808aa
Feb 25 14:52:19 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3790784475136 size=4096 flags=1808aa
Feb 25 14:52:19 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3980643770368 size=4096 flags=1808aa
Feb 25 14:52:19 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3338247639040 size=4096 flags=1808aa
Feb 25 14:52:19 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3602675863552 size=4096 flags=1808aa
Feb 25 14:52:19 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3504724443136 size=4096 flags=1808aa
Feb 25 14:52:19 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3495903948800 size=4096 flags=1808aa
Feb 25 14:52:19 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3628836184064 size=4096 flags=1808aa
Feb 25 14:56:17 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=2447128170496 size=4096 flags=1808aa
Feb 25 14:56:17 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=2447125979136 size=4096 flags=1808aa
Feb 25 14:56:17 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=2447126523904 size=4096 flags=1808aa
Feb 25 14:56:17 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=2447126519808 size=4096 flags=1808aa
Feb 25 14:56:17 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=2447126515712 size=4096 flags=1808aa
Feb 25 14:56:17 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=2447130460160 size=4096 flags=1808aa
Feb 25 14:56:17 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=2447139495936 size=4096 flags=1808aa
Feb 25 14:56:17 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=2447147290624 size=4096 flags=1808aa
Feb 25 14:56:17 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3011713654784 size=4096 flags=1808aa
Feb 25 14:56:17 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3011713769472 size=4096 flags=1808aa
Feb 25 14:56:17 arnold kernel: zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3011714031616 size=4096 flags=1808aa

Next, after detaching the vdev from the pool and cryptsetup close, use cryptsetup luksFormat --type luks1 instead (To force the use of luks1 instead of a luks2 header).

This results in a vdev with 512 byte sectors:

y@arnold ~ % sudo cryptsetup luksDump /dev/sdw
LUKS header information for /dev/sdw

Version:        1
Cipher name:    aes
Cipher mode:    xts-plain64
Hash spec:      sha256
Payload offset: 4096
MK bits:        512
MK digest:      xxxx
MK salt:        xxxx
MK iterations:  70468
UUID:           e5b69a87-e02c-4b14-be58-6109488e172a

Key Slot 0: ENABLED
        Iterations:             1034098
        Salt:                   xxxx
        Key material offset:    8
        AF stripes:             4000
Key Slot 1: DISABLED
Key Slot 2: DISABLED
Key Slot 3: DISABLED
Key Slot 4: DISABLED
Key Slot 5: DISABLED
Key Slot 6: DISABLED
Key Slot 7: DISABLED
y@arnold ~ % sudo dmsetup table | grep z14.3
z14.3: 0 27344760832 crypt aes-xts-plain64 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 0 65:96 4096 1 allow_discards
y@arnold ~ % sudo blockdev --report /dev/mapper/z14.3
RO    RA   SSZ   BSZ        StartSec            Size   Device
rw   256   512  4096               0  14000517545984   /dev/mapper/z14.3

With this vdev (luks1, 512B sectors), no zio errors are observed and the zpool replace completes successfully.

codyps commented 1 year ago

https://github.com/openzfs/zfs/issues/13431 describes an issue that might be the same or related (it involves likely newer versions of cryptsetup, which will use luks2). It shows similar zio errors.

https://github.com/openzfs/zfs/issues/13362 also involves luks, and has similar zio errors, but is observing other symptoms (io hangs, though perhaps this is from different operations causing the io).

rincebrain commented 1 year ago

Can you share what zdb -C thinks of the situation, and what z4.2's settings are?

codyps commented 1 year ago

Can you share what zdb -C thinks of the situation, and what z4.2's settings are?

Here's the zdb -C tank (tank is the pool where the replace is occurring). Is there some other zdb info needed?

`zdb -C tank` ``` MOS Configuration: version: 5000 name: 'tank' state: 0 txg: 402676914 pool_guid: 17643633366051522574 errata: 0 hostname: 'arnold' com.delphix:has_per_vdev_zaps vdev_children: 11 vdev_tree: type: 'root' id: 0 guid: 17643633366051522574 children[0]: type: 'mirror' id: 0 guid: 2801673922231546480 metaslab_array: 34 metaslab_shift: 34 ashift: 12 asize: 8001556316160 is_log: 0 create_txg: 4 com.delphix:vdev_zap_top: 177 children[0]: type: 'disk' id: 0 guid: 9621628441388751899 path: '/dev/mapper/z8.3' devid: 'dm-uuid-CRYPT-LUKS1-cb092159d4cb402fa46d68ecab77120a-d8.0' whole_disk: 0 DTL: 44228 create_txg: 4 com.delphix:vdev_zap_leaf: 454 children[1]: type: 'disk' id: 1 guid: 16125960469162989774 path: '/dev/mapper/z8.1' devid: 'dm-uuid-CRYPT-LUKS1-6d64f9c2c7c144b5be9958fdc2251c37-d8.1' whole_disk: 0 DTL: 44226 create_txg: 4 com.delphix:vdev_zap_leaf: 388 children[1]: type: 'mirror' id: 1 guid: 15581715015551336503 metaslab_array: 383 metaslab_shift: 34 ashift: 12 asize: 8001556316160 is_log: 0 create_txg: 5671340 com.delphix:vdev_zap_top: 180 children[0]: type: 'disk' id: 0 guid: 7035198730005854437 path: '/dev/mapper/z8.5' devid: 'dm-uuid-CRYPT-LUKS1-b0d4731323ee47cba452ba4a65899afe-d8.2' whole_disk: 0 DTL: 44234 create_txg: 5671340 com.delphix:vdev_zap_leaf: 459 children[1]: type: 'disk' id: 1 guid: 516353021259712754 path: '/dev/mapper/z8.4' devid: 'dm-uuid-CRYPT-LUKS1-b1981692b6d248658979dc934a772f50-d8.3' whole_disk: 0 DTL: 44233 create_txg: 5671340 com.delphix:vdev_zap_leaf: 178 children[2]: type: 'mirror' id: 2 guid: 4796807142241909275 metaslab_array: 260516 metaslab_shift: 35 ashift: 12 asize: 4000746569728 is_log: 0 create_txg: 342952041 com.delphix:vdev_zap_top: 260789 children[0]: type: 'replacing' id: 0 guid: 11828178693859806967 whole_disk: 0 create_txg: 342952041 children[0]: type: 'disk' id: 0 guid: 11005066825353160123 path: '/dev/mapper/z4.2' devid: 'dm-uuid-CRYPT-LUKS1-3b6a0f4bfd9947e6916c4d3419ff8757-d4.0' whole_disk: 0 DTL: 44232 create_txg: 342952041 com.delphix:vdev_zap_leaf: 260790 children[1]: type: 'disk' id: 1 guid: 5900828425635517331 path: '/dev/mapper/z14.3' whole_disk: 0 DTL: 6443865 create_txg: 342952041 com.delphix:vdev_zap_leaf: 6444336 com.datto:resilver_defer resilver_txg: 402676914 children[1]: type: 'disk' id: 1 guid: 4100891465398702400 path: '/dev/mapper/z14.0' whole_disk: 0 DTL: 6100055 create_txg: 342952041 com.delphix:vdev_zap_leaf: 6100053 children[3]: type: 'mirror' id: 3 guid: 5898261406907484203 metaslab_array: 260385 metaslab_shift: 34 ashift: 12 asize: 3000552128512 is_log: 0 create_txg: 342952048 com.delphix:vdev_zap_top: 260792 children[0]: type: 'disk' id: 0 guid: 17322213120823058356 path: '/dev/mapper/z12.8' whole_disk: 0 DTL: 5124768 create_txg: 342952048 com.delphix:vdev_zap_leaf: 5124424 children[1]: type: 'disk' id: 1 guid: 7505559382419031062 path: '/dev/mapper/z8.2' devid: 'dm-uuid-CRYPT-LUKS1-d353aef34afa4f8a9f604c233dd3feae-z8.2' whole_disk: 0 DTL: 7174672 create_txg: 342952048 com.delphix:vdev_zap_leaf: 7174796 children[4]: type: 'mirror' id: 4 guid: 16229305079822685651 metaslab_array: 85023 metaslab_shift: 36 ashift: 12 asize: 10000824532992 is_log: 0 create_txg: 356648774 com.delphix:vdev_zap_top: 85300 children[0]: type: 'disk' id: 0 guid: 16663140545990328976 path: '/dev/mapper/z10.2' devid: 'dm-uuid-CRYPT-LUKS1-f5cc909895b94592870402de317ff4a8-d10.0' whole_disk: 0 DTL: 94579 create_txg: 356648774 com.delphix:vdev_zap_leaf: 85304 children[1]: type: 'disk' id: 1 guid: 16893451224641453851 path: '/dev/mapper/z10.3' devid: 'dm-uuid-CRYPT-LUKS1-999355cf19c443beaff84ba5b30f82c7-d10.1' whole_disk: 0 DTL: 94578 create_txg: 356648774 com.delphix:vdev_zap_leaf: 85308 children[5]: type: 'mirror' id: 5 guid: 4319439574465811368 metaslab_array: 117706 metaslab_shift: 34 ashift: 12 asize: 10000824532992 is_log: 0 create_txg: 358806971 com.delphix:vdev_zap_top: 117703 children[0]: type: 'disk' id: 0 guid: 4473711824735588132 path: '/dev/mapper/z10.4' devid: 'dm-uuid-CRYPT-LUKS1-a2c194b0e0cb45da88e2b363ade7f274-d10.2' whole_disk: 0 DTL: 4017 create_txg: 358806971 com.delphix:vdev_zap_leaf: 117704 children[1]: type: 'disk' id: 1 guid: 11559068356103030222 path: '/dev/mapper/z10.6' devid: 'dm-uuid-CRYPT-LUKS1-daba9c630d4a41a79dd416eb50357bea-d10.3' whole_disk: 0 DTL: 4016 create_txg: 358806971 com.delphix:vdev_zap_leaf: 117705 children[6]: type: 'mirror' id: 6 guid: 2479621637901868764 metaslab_array: 331711 metaslab_shift: 34 ashift: 12 asize: 10000824532992 is_log: 0 create_txg: 359781354 com.delphix:vdev_zap_top: 332160 children[0]: type: 'disk' id: 0 guid: 6722608118199084570 path: '/dev/mapper/z10.5' devid: 'dm-uuid-CRYPT-LUKS1-531546f7f1104507bf8756c7323a4be8-d10.4' whole_disk: 0 DTL: 654007 create_txg: 359781354 com.delphix:vdev_zap_leaf: 331593 children[1]: type: 'disk' id: 1 guid: 12670541344625969211 path: '/dev/mapper/z10.1' devid: 'dm-uuid-CRYPT-LUKS1-0ba5ea88895d416cae69a011852c1afd-d10.5' whole_disk: 0 DTL: 654004 create_txg: 359781354 com.delphix:vdev_zap_leaf: 331594 children[7]: type: 'mirror' id: 7 guid: 2296358563542467893 metaslab_array: 3388903 metaslab_shift: 34 ashift: 12 asize: 12000117129216 is_log: 0 create_txg: 363157204 com.delphix:vdev_zap_top: 3389322 children[0]: type: 'disk' id: 0 guid: 6283371758920790449 path: '/dev/mapper/z12.0' devid: 'dm-uuid-CRYPT-LUKS2-bb711a67822b4a508e522664de603c12-z12.0' whole_disk: 0 DTL: 3411984 create_txg: 363157204 com.delphix:vdev_zap_leaf: 3389323 children[1]: type: 'disk' id: 1 guid: 4263933087880235999 path: '/dev/mapper/z12.2' devid: 'dm-uuid-CRYPT-LUKS2-9aadb9076c4a469bb795f15ae210a421-z12.2' whole_disk: 0 DTL: 3411983 create_txg: 363157204 com.delphix:vdev_zap_leaf: 3389324 children[8]: type: 'mirror' id: 8 guid: 5328519653678850623 metaslab_array: 3388157 metaslab_shift: 34 ashift: 12 asize: 12000117129216 is_log: 0 create_txg: 363157208 com.delphix:vdev_zap_top: 3388279 children[0]: type: 'disk' id: 0 guid: 7274434324538727781 path: '/dev/mapper/z12.1' devid: 'dm-uuid-CRYPT-LUKS2-5f80c3664ef84441b924042630b8b5e2-z12.1' whole_disk: 0 DTL: 3411986 create_txg: 363157208 com.delphix:vdev_zap_leaf: 3388280 children[1]: type: 'disk' id: 1 guid: 13649946414862804989 path: '/dev/mapper/z12.3' devid: 'dm-uuid-CRYPT-LUKS2-43faab20d5ee4a0d9bc806515ccd67ad-z12.3' whole_disk: 0 DTL: 3411985 create_txg: 363157208 com.delphix:vdev_zap_leaf: 3388281 children[9]: type: 'mirror' id: 9 guid: 11144534952233533298 metaslab_array: 7474010 metaslab_shift: 34 ashift: 12 asize: 12000117129216 is_log: 0 create_txg: 371518314 com.delphix:vdev_zap_top: 7474463 children[0]: type: 'disk' id: 0 guid: 13495647563734244283 path: '/dev/mapper/z12.5' devid: 'dm-uuid-CRYPT-LUKS2-7328ac33263b4b35addda5a1934f1cea-z12.5' whole_disk: 0 DTL: 7412238 create_txg: 371518314 com.delphix:vdev_zap_leaf: 7474464 children[1]: type: 'disk' id: 1 guid: 8601802118682992312 path: '/dev/mapper/z12.6' devid: 'dm-uuid-CRYPT-LUKS2-329d228eec274842b0aba3e029044aba-z12.6' whole_disk: 0 DTL: 7412237 create_txg: 371518314 com.delphix:vdev_zap_leaf: 7474342 children[10]: type: 'mirror' id: 10 guid: 5574456015292304252 metaslab_array: 7613777 metaslab_shift: 34 ashift: 12 asize: 12000117129216 is_log: 0 create_txg: 371857357 com.delphix:vdev_zap_top: 7613774 children[0]: type: 'disk' id: 0 guid: 11633448460096827575 path: '/dev/mapper/z12.4' whole_disk: 0 DTL: 7414491 create_txg: 371857357 com.delphix:vdev_zap_leaf: 7613775 children[1]: type: 'disk' id: 1 guid: 16104453434150649249 path: '/dev/mapper/z12.7' whole_disk: 0 DTL: 7414490 create_txg: 371857357 com.delphix:vdev_zap_leaf: 7613776 features_for_read: com.delphix:hole_birth com.delphix:embedded_data space map refcount mismatch: expected 5484 != actual 5436 ```

In the details, one can see the replacement of z4.2 with z14.3. I previously replaced another disk (now z14.0) in the same way, which is why I know using the --type luks1 option resolves the IO errors.

Here are the details for z4.2 (sdy is the backing device for z4.2)

y@arnold ~ % sudo blockdev --report /dev/mapper/z4.2
RO    RA   SSZ   BSZ        StartSec            Size   Device
rw   256   512   512               0   4000751378944   /dev/mapper/z4.2
y@arnold ~ % sudo blockdev --report /dev/sdy        
RO    RA   SSZ   BSZ        StartSec            Size   Device
rw   256   512   512               0   4000753476096   /dev/sdy
y@arnold ~ % sudo cryptsetup luksDump /dev/sdy
LUKS header information for /dev/sdy

Version:        1
Cipher name:    aes
Cipher mode:    xts-plain64
Hash spec:      sha512
Payload offset: 4096
MK bits:        256
MK digest:      xxxx
MK salt:        xxxx
MK iterations:  24750
UUID:           3b6a0f4b-fd99-47e6-916c-4d3419ff8757

Key Slot 0: ENABLED
    Iterations:             99533
    Salt:                   xxxx 
    Key material offset:    8
    AF stripes:             4000
Key Slot 1: DISABLED
Key Slot 2: DISABLED
Key Slot 3: DISABLED
Key Slot 4: DISABLED
Key Slot 5: DISABLED
Key Slot 6: DISABLED
Key Slot 7: DISABLED
y@arnold ~ % sudo dmsetup table  | grep z4.2 
z4.2: 0 7813967537 crypt aes-xts-plain64 0000000000000000000000000000000000000000000000000000000000000000 0 65:128 4096 1 allow_discards
codyps commented 1 year ago

From the zdb -C output in the previous comment, one can see that some luks2 devices are in the pool (and have not been observed to emit io errors).

Taking z12.0 as a example device, here's some extra info about it:

y@arnold ~ % sudo blockdev --report /dev/mapper/z12.0
RO    RA   SSZ   BSZ        StartSec            Size   Device
rw   256   512  4096               0  12000121847808   /dev/mapper/z12.0
y@arnold ~ % sudo blockdev --report /dev/sdr
RO    RA   SSZ   BSZ        StartSec            Size   Device
rw   256   512  4096               0  12000138625024   /dev/sdr
y@arnold ~ % sudo cryptsetup luksDump /dev/sdr
LUKS header information
Version:        2
Epoch:          3
Metadata area:  16384 [bytes]
Keyslots area:  16744448 [bytes]
UUID:           bb711a67-822b-4a50-8e52-2664de603c12
Label:          (no label)
Subsystem:      (no subsystem)
Flags:          (no flags)

Data segments:
  0: crypt
    offset: 16777216 [bytes]
    length: (whole device)
    cipher: aes-xts-plain64
    sector: 512 [bytes]

Keyslots:
  0: luks2
    Key:        512 bits
    Priority:   normal
    Cipher:     aes-xts-plain64
    Cipher key: 512 bits
    PBKDF:      argon2i
    Time cost:  6
    Memory:     1048576
    Threads:    4
    Salt:      xxxx 
    AF stripes: 4000
    AF hash:    sha256
    Area offset:32768 [bytes]
    Area length:258048 [bytes]
    Digest ID:  0
Tokens:
Digests:
  0: pbkdf2
    Hash:       sha256
    Iterations: 141241
    Salt:       xxxx
    Digest:     xxxx
y@arnold ~ % sudo dmsetup table  | grep z12.0
z12.0: 0 23437737984 crypt aes-xts-plain64 :64:logon:cryptsetup:bb711a67-822b-4a50-8e52-2664de603c12-d0 0 65:16 32768 1 allow_discards

Note that while z12.0/sdr (the working fine device) is using luks2, it has a sector size of 512B (and not 4096B like the devices that emit errors)

rincebrain commented 1 year ago

I did notice that, yes.

While trying to reproduce this, I discovered I can convince ZFS without much work to try replacing a 512n device with a 4kn device on an ashift 9 vdev, which, uh, Does Not End Well At All.

But that doesn't seem to be what happened to you, here.

I'm now wondering if somehow it stashed somethings on the vdev not 4k aligned and because it was a 512n device it went fine, but now trying to 1:1 mirror is going bonkers.

...o-oh. I had a bad idea, actually. I wonder if the partition isn't 4k aligned, and in trying to replicate the partition table it's resulting in non-4k aligned accesses...let me go read those IO errors you pasted again.

e: well, LUKS, so not exactly a partition, but like, the leading offset...anyway.

edit 2: are you seeing any errors not from ZFS in your syslog from the disk itself?

codyps commented 1 year ago

edit 2: are you seeing any errors not from ZFS in your syslog from the disk itself?

Watching dmesg -w | grep -v audit: (to remove all the noisy audit messages on my system) shows only the zio errors occurring.

Here's another set: (detached z14.3 and re-luksFormatted it in luks2 to get this output)

y@arnold ~ % sudo dmesg -w | grep -v 'audit:'
[12188887.373045] sctp: Hash tables configured (bind 1024/1024)
[12188914.429696] kauditd_printk_skb: 10 callbacks suppressed
[12188919.981251] kauditd_printk_skb: 35 callbacks suppressed
[12191505.790358] kauditd_printk_skb: 126 callbacks suppressed
[12195028.411209] kauditd_printk_skb: 22 callbacks suppressed
[12198268.934797] kauditd_printk_skb: 27 callbacks suppressed
[12198402.341190] kauditd_printk_skb: 27 callbacks suppressed
[12198494.544891] kauditd_printk_skb: 27 callbacks suppressed
[12212089.961177] kauditd_printk_skb: 30 callbacks suppressed
[12212632.706882] kauditd_printk_skb: 6 callbacks suppressed
[12213959.402093] zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=1782776037376 size=20480 flags=1808aa
[12213965.097787] zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=1784842436608 size=4096 flags=1808aa
[12213968.581657] zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=1778357694464 size=12288 flags=1808aa
[12213979.887437] zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=1779879501824 size=4096 flags=1808aa
[12214226.435283] zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3849360289792 size=20480 flags=1808aa
[12214227.015763] zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3850519625728 size=32768 flags=1808aa
[12214756.511994] zio pool=tank vdev=/dev/mapper/z14.3 error=5 type=2 offset=3808536850432 size=4096 flags=1808aa

so: no errors from the device, sd, etc about unaligned writes (or any other error of any kind). And the offsets listed in the zio messages are aligned to 4096, and all the sizes logged are also multiples of 4096.

As far as offsets in the luks data, the luksDump output above includes the offset used for the data segment as 16777216, which is divisible by 4096.

The dmsetup table output (again, included in previous comments) shows a start of 32768, which is correct if the units for it are 512B sectors (which it might be, the documentation here is unclear, and I haven't been able to make out what the linux kernel intends)

All of the luks formatting is done on the entire disk (iow, running cryptsetup luksFormat /dev/sdv), without any partitions. So the offset luks provides is the only offset in the physical disk before it is handed off to zfs.

rincebrain commented 1 year ago

An interesting data point would be to see if the first 100 or so errors you get when doing the replace with a LUKS2 header are the same every time, as that might tell us more about whether it's deterministic or something very strange...

calypso2k commented 11 months ago

@jmesmon thank you for this thread, I really was confused if I have problem with XHCI, disk, zfs, LUKS or all together.

It seems tho that LUKS device sector size doesn't matter, I use 4K physical/logical sector size disks so my LUKS device is 4K sector size event with --type luks1

blockdev --report /dev/sdb
RO    RA   SSZ   BSZ        StartSec            Size   Device
rw 1048544  4096 16384               0  16000900661248   /dev/sdb

blockdev --report /dev/mapper/crypt_disk-a 
RO    RA   SSZ   BSZ        StartSec            Size   Device
rw 1048544  4096  4096               0  16000632229888   /dev/mapper/crypt_disk-a

I use -o ashift=12, but the math is simple: LUKS1 header = no problem LUKS2 header = constant stream of zio errors like

[ 5816.334320] zio pool=vault vdev=/dev/mapper/crypt_disk-a error=5 type=2 offset=88591015936 size=32768 flags=40080c80
[ 5816.334337] zio pool=vault vdev=/dev/mapper/crypt_disk-a error=5 type=2 offset=183667879936 size=8192 flags=40080c80
[ 5816.334353] zio pool=vault vdev=/dev/mapper/crypt_disk-a error=5 type=2 offset=1506734080 size=8192 flags=40080c80
[ 5816.335251] zio pool=vault vdev=/dev/mapper/crypt_disk-a error=5 type=2 offset=1506734080 size=4096 flags=188881
[ 5816.335556] zio pool=vault vdev=/dev/mapper/crypt_disk-a error=5 type=2 offset=88591028224 size=4096 flags=188881
[ 5816.335731] zio pool=vault vdev=/dev/mapper/crypt_disk-a error=5 type=2 offset=183667879936 size=4096 flags=188881
[ 5816.342210] zio pool=vault vdev=/dev/mapper/crypt_disk-a error=5 type=2 offset=88591106048 size=49152 flags=40080c80

The problem, when I use LUKS2 header, occurs when constantly writing to the device with max speed (ie. copy large set of data). It's funny that it is less probable to occur if writing with lower speed, but eventually it is happening also.

I really have no idea if this is a problem with zfs of LUKS. Any ideas?

thulle commented 10 months ago

The problem, when I use LUKS2 header, occurs when constantly writing to the device with max speed (ie. copy large set of data). It's funny that it is less probable to occur if writing with lower speed, but eventually it is happening also.

How full is your pool?

I hit this issue a good while back and thought it might be my SSD (WD SN850) randomly disconnecting since I could find some reports about that on the WD forums. Though it seemed to be related to how full the pool/drive was, with enough free space I couldn't trigger it. Switching to another (and larger) drive solved the issue, until now that I've filled it too.

The best way for me to trigger it also hasn't been writing at full (sequential) speed but decompressing and compiling chromium, which is ~906000 files over 19GB.

Currently on a 3200GiB partition it seems like the breakpoint when it starts occuring is somewhere around 90% allocated, and it came to mind that maybe it's here that zfs changes allocation method? Slightly over 200GB is reserved for zvols though, so the free space for filesystems is ~100GB. Unclear if this affects it.

# blockdev --report /dev/mapper/root 
RO    RA   SSZ   BSZ        StartSec            Size   Device
rw   256  4096  4096               0   3435957059584   /dev/mapper/root

# blockdev --report /dev/nvme0n1p3
RO    RA   SSZ   BSZ        StartSec            Size   Device
rw   256  4096  4096       270534656   3435973836800   /dev/nvme0n1p3

The drive itself is formated to 4K sectors so misalignment shouldn't be possible (?)

# fdisk -l /dev/nvme0n1
Disk /dev/nvme0n1: 3,73 TiB, 4096805658624 bytes, 1000196694 sectors
Disk model: KINGSTON SKC3000D4096G                  
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 081BA126-DA19-42BD-AEAB-CF1A6583CA0B

Device            Start       End   Sectors  Size Type
/dev/nvme0n1p1      256    262399    262144    1G EFI System
/dev/nvme0n1p2   262400  33816831  33554432  128G Linux swap
/dev/nvme0n1p3 33816832 872677631 838860800  3,1T Linux filesystem

# cryptsetup luksDump /dev/nvme0n1p3
LUKS header information
Version:        2
Epoch:          3
Metadata area:  16384 [bytes]
Keyslots area:  16744448 [bytes]
calypso2k commented 10 months ago

How full is your pool?

Almost empty, when triggering the bug. It's a new system (brand new disks). I admit that it's a bit unusual - Gentoo arm64 /w Asahi kernel @ Mac Mini M1 ;) But otherwise it is rock stable, zfs-2.1.13.

Also I don't see any problems with drives in dmesg.

Currently, with LUKS1 header I'm seeing 4TB usage: couple of zvol's and 10 697 081 files in fs - no zio errors whatsoever.

thulle commented 9 months ago

Hitting the recent bug 15533 I saw the same zio error=5 type=2 as in this bug, so I started to search for this issue to check similarities. While searching I happened upon this pull request:

15414

The identified triggers there interestingly are:

  • a write-heavy load, such that aggregation past 512K happened quite frequently
  • a pool approaching 90% fragmentation, such that gang blocks were being produced (this is significant only insofar as gang blocks are backed by small memory allocations, which exacerbate the problem)

which seem to match the identified triggers in this issue too. The PR is closed in favour of a upcoming new take though.

@robn I can't find any new PR, but I'd be interested in testing if what you got fixes this issue too.

robn commented 9 months ago

I should be posting a significant rework of vdev_disk in a few days (it was written for a client, and is just finishing testing). That at least will fix up the problem described in #15414, and I suspect this too. But I don't really recommend waiting when a revert will sort it out for now; its not a given that my patch will be right, or be accepted.

Its not totally clear to me that this is a result of misaligned aggregation, but you might try drastically lowering zfs_vdev_aggregation_limit (make it, say, 131072; if it "fixes" it, try raising it, if not, try lowering it further). This will reduce throughput considerably, but things might still work.

(more in https://github.com/openzfs/zfs/issues/15533#issuecomment-1825326626).

codyps commented 9 months ago

PR with possible fix from robn: https://github.com/openzfs/zfs/pull/15588 (linking for my reference)

robn commented 4 months ago

FYI, 2.2.4 just shipped, with #15588 and followup patches included. If you are still having this problem, you might try setting zfs_vdev_disk_classic=0 in your zfs module parameters and seeing if that helps. If you do try this, please report back with your results, as our hope is to make this the default in the future.

iakat commented 2 months ago

on 6.8.10-asahi nixos, zfs 2.2.4, macbook air m2, zfs_vdev_disk_classic=0 and zfs_vdev_disk_classic=1 both result in several hundred zio error=5 type=2 with a luks2 header while trying to install.

LUKS1 results in no errors. fyi @robn

robn commented 2 weeks ago

Please see here for a debugging patch that I hope will reveal more info about what's going on: https://github.com/openzfs/zfs/issues/15646#issuecomment-2283206150

(if possible, I would prefer to keep discussion going in #15646, so its all in one place).