CKSUM and WRITE errors with 2.2.1 stable, when vdevs are atop LUKS

Rudd-O commented 10 months ago

I build and regularly test ZFS from the master branch. A few days go I built and tested the commit specified in the headline of this issue, deploying it to three machines.

On two of them (the ones that had mirrored pools), a data corruption issue arose where many WRITE errors (hundreds) would accumulate when deleting snapshots, but no CKSUM errors took place, nor was there evidence that hardware was the issue. I tried a scrub, and that just made the problem worse.

Initially I assumed I had gotten extremely unlucky and hardware was dying, because two mirrors of one leg were experiencing the issue, but none of the drives of the other leg were -- so I decided best to be safe and attach a third mirror drive to the first leg (that was $200, oof). Since I had no more drive bays, I popped the new drive into a USB port (USB 2.0!) and attached it to the first leg.

During the resilvering process, the third drive also began experiencing WRITE errors, and the first CKSUM errors.

    NAME                                                                                                STATE     READ WRITE CKSUM
    chest                                                                                               DEGRADED     0     0     0
      mirror-0                                                                                          DEGRADED     0   308     0
        dm-uuid-CRYPT-LUKS2-ad4ac4d72da84b6a866caeff621301f4-luks-ad4ac4d7-2da8-4b6a-866c-aeff621301f4  DEGRADED     0   363     2  too many errors
        dm-uuid-CRYPT-LUKS2-f0670720ae6440dab1618965f9e01718-luks-f0670720-ae64-40da-b161-8965f9e01718  DEGRADED     0   369     0  too many errors
        dm-uuid-CRYPT-LUKS2-01776eeb5259431f971aa6a12a9bd1fb-luks-01776eeb-5259-431f-971a-a6a12a9bd1fb  DEGRADED     0   423     0  too many errors
      mirror-3                                                                                          ONLINE       0     0     0
        dm-uuid-CRYPT-LUKS2-602229e893a34cc7aa889f19deedbeb1-luks-602229e8-93a3-4cc7-aa88-9f19deedbeb1  ONLINE       0     0     0
        dm-uuid-CRYPT-LUKS2-12c9127aa687463ab335b2e49adbacc7-luks-12c9127a-a687-463a-b335-b2e49adbacc7  ONLINE       0     0     0
    logs    
      dm-uuid-CRYPT-LUKS2-9210c7657b8a460ba031aab973ca37c5-luks-9210c765-7b8a-460b-a031-aab973ca37c5    ONLINE       0     0     0
    cache
      dm-uuid-CRYPT-LUKS2-06a5560306c24ad58d3feb3a03cb0a20-luks-06a55603-06c2-4ad5-8d3f-eb3a03cb0a20    ONLINE       0     0     0

I tried different kernels (6.4, 6.5 from Fedora) to no avail. The error was present either way. zpool clear was followed by a few errors whenever disks were written to, and hundreds of errors whenever snapshots were deleted (I have zfs-auto-snapshot running in the background).

Then, my backup machine began experiencing the same WRITE errors. I can't have this backup die on me, especially not that I have actual data corruption on the big data file server.

At this point I concluded there must be some serious issue with the code, and decided to downgrade all machines to a known-good build. After downgrading the most severely affected machine (whose logs are above) to my build of e47e9bbe86f2e8fe5da0fc7c3a9014e1f8c132a9, everything appears nominal and the resilvering is progressing without issues. Deleting snapshots also is no longer causing issues.

Nonetheless, I have forever lost what appears to be "who knows what" metadata, and of course four days trying to resilver unsuccessfully:

Every 2.0s: zpool status -v chest                                                                                                                                                                                                         penny.dragonfear: Thu Nov 16 15:01:22 2023

  pool: chest
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Nov 16 14:51:20 2023
        486G / 6.70T scanned at 826M/s, 13.4G / 6.66T issued at 22.8M/s
        13.4G resilvered, 0.20% done, 3 days 12:43:30 to go
config:

        NAME                                                                                                STATE     READ WRITE CKSUM
        chest                                                                                               ONLINE   0     0     0
          mirror-0                                                                                          ONLINE   0     0     0
            dm-uuid-CRYPT-LUKS2-ad4ac4d72da84b6a866caeff621301f4-luks-ad4ac4d7-2da8-4b6a-866c-aeff621301f4  ONLINE   0     0     0  (resilvering)
            dm-uuid-CRYPT-LUKS2-f0670720ae6440dab1618965f9e01718-luks-f0670720-ae64-40da-b161-8965f9e01718  ONLINE   0     0     0  (resilvering)
            dm-uuid-CRYPT-LUKS2-01776eeb5259431f971aa6a12a9bd1fb-luks-01776eeb-5259-431f-971a-a6a12a9bd1fb  ONLINE   0     0     0  (resilvering)
          mirror-3                                                                                          ONLINE   0     0     0
            dm-uuid-CRYPT-LUKS2-602229e893a34cc7aa889f19deedbeb1-luks-602229e8-93a3-4cc7-aa88-9f19deedbeb1  ONLINE   0     0     0
            dm-uuid-CRYPT-LUKS2-12c9127aa687463ab335b2e49adbacc7-luks-12c9127a-a687-463a-b335-b2e49adbacc7  ONLINE   0     0     0
        logs
          dm-uuid-CRYPT-LUKS2-9210c7657b8a460ba031aab973ca37c5-luks-9210c765-7b8a-460b-a031-aab973ca37c5    ONLINE   0     0     0
        cache
          dm-uuid-CRYPT-LUKS2-06a5560306c24ad58d3feb3a03cb0a20-luks-06a55603-06c2-4ad5-8d3f-eb3a03cb0a20    ONLINE   0     0     0

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x16>
        <metadata>:<0x11d>
        <metadata>:<0x34>
        <metadata>:<0x2838>
        <metadata>:<0x3c>
        <metadata>:<0x44>
        <metadata>:<0x656>
        <metadata>:<0x862>
        <metadata>:<0x594>
        <metadata>:<0x3cf>
        <metadata>:<0x2df>
        <metadata>:<0x1f5>

In conclusion, something added between e47e9bbe86f2e8fe5da0fc7c3a9014e1f8c132a9..786641dcf9a7e35f26a1b4778fc710c7ec0321bf is causing this issue.

sempervictus commented 9 months ago

Looks like we're hitting this too, across the board, as all of our ZFS pools are on dm-crypt volumes. Unfortunately we caught this in send/recv with catastrophic results for the destination pools. Rolled everything back to 2.1.14; but i think it merits having heavy load testing atop various common VDEV types in the CI stack since some of these bugs are not reproducible with fast tests - we see this hit ~250G of writes. Our recv targets are all flash (usually Intel) with the CloudFlare dm-crypt sync-io mechanisms enabled to avoid aggregation by the crypto block layer - cryptsetup --perf-no_read_workqueue --perf-no_write_workqueue --allow-discards --persistent ...

Tsuroerusu commented 9 months ago

Looks like we're hitting this too, across the board, as all of our ZFS pools are on dm-crypt volumes. Unfortunately we caught this in send/recv with catastrophic results for the destination pools. Rolled everything back to 2.1.14; but i think it merits having heavy load testing atop various common VDEV types in the CI stack since some of these bugs are not reproducible with fast tests - we see this hit ~250G of writes. Our recv targets are all flash (usually Intel) with the CloudFlare dm-crypt sync-io mechanisms enabled to avoid aggregation by the crypto block layer - cryptsetup --perf-no_read_workqueue --perf-no_write_workqueue --allow-discards --persistent ...

As was mentioned in your bug report (Wherein you mentioned that you are on 2.2.1), you should definitely update to 2.2.2 which reverts a change introduced in 2.2.1 which triggered an underlying problem. About 24 hours after upgrading to 2.2.1, I did "zpool status" and noticed 24.000 write errors and my pool in a degraded state with one faulted device. I then shut down my system and lived on my laptop for a week until 2.2.2 came out because my pool would have failed entirely had I not done so. After installing 2.2.2, I did a "zpool clear" and then ran a scrub on my pool. Fortunately, it seems that I have not lost any data and now my machine works normally.

Rudd-O commented 9 months ago

Boy am I glad I tested this directly from master — but I am sad that the bug snuck through to release.

Tsuroerusu commented 9 months ago

Boy am I glad I tested this directly from master — but I am sad that the bug snuck through to release.

I hear you on that, especially since 2.2.1 was a recommended update because of the issues related to block cloning, so a bunch of people, like me, upgraded immediately.

Just a thought for the record: Might it not be a good idea to make running ZFS on top of LUKS something which is automatically tested as part of the ZFS test suite if possible? If that had been the case, then this issue could have been caught given that it is trivial to reproduce. Considering the issues around the native encryption (Especially the send/receive issues and keys being erased), a good number of people are probably more comfortable with LUKS for the encryption-part.

Rudd-O commented 9 months ago

Years ago, I offered to wire zfs-fedora-installer to test the whole process down to LUKS decryption to boot and poweroff, but I couldn't get around doing so. My fault.

Not that this would necessarily help, since (sadly) I actually ran this test, and it revealed no issues. It seems like you need a very specific config (4K sector size in LUKS, plus mirroring), and zfs-fedora-installer doesn't do that.

LUKS is nice because no pool metadata is leaked with it, unlike ZFS encryption.

RinCat commented 9 months ago

(4K sector size in LUKS, plus mirroring)

You don't need mirroring for this, I had a single disk system that still have this issue. But 4K sector size is needed.

However, LUKS also works on file, so add a simple test may able to catch this, without a full VM installation.

Tsuroerusu commented 9 months ago

(4K sector size in LUKS, plus mirroring)

You don't need mirroring for this, I had a single disk system that still have this issue. But 4K sector size is needed.

However, LUKS also works on file, so add a simple test may able to catch this, without a full VM installation.

Given this, it seems like even just doing two simple LUKS tests with 512 and 4K sectors, respectively, would have caught this rather disastrous problem which literally was killing/faulting users' pools. Not saying this to point fingers or blame anybody, it just strikes me that until native encryption is sufficiently robust to be the obvious choice (i.e. a situation where "Why even use LUKS?" is a relevant question), LUKS+ZFS is something that ought to be tested for as it is not hard to see why somebody would want to use it.

AllKind commented 9 months ago

I'm pretty sure the devs would welcome it, if someone would write a test for the test suite.

behlendorf commented 9 months ago

If someone would like to write some tests for the test suite that would be welcome.

Tsuroerusu commented 9 months ago

If someone would like to write some tests for the test suite that would be welcome.

Good to hear that there's openness to something like this even though LUKS is not, specifically, a ZFS matter. :-) Oh, and thanks for getting 2.2.2 out so quickly, Brian!

Rudd-O commented 9 months ago

Don't think of LUKS as a "not a ZFS matter" thing. Think of LUKS as a particular type of disk drive that makes ZFS exercise code paths which are valid and should be tested.

awused commented 9 months ago

Do we know if this also affects similar setups with GELI or is it specific to LUKS?

behlendorf commented 9 months ago

This issue is specific to how ZFS submits IO to the block layer on Linux. LUKS just happens to do a good job of exposing it. The root cause is understood and the long term fix is being worked on in https://github.com/openzfs/zfs/pull/15588.

delan commented 9 months ago

Still getting this kind of failure on 2.2.2 when receiving snapshots: #15646

gene-hightower commented 8 months ago

I am also getting these spurious errors when resilvering; causing a resilvering loop that seems to go on forever. LUKS partitions, zfs-2.2.2.

Rudd-O commented 8 months ago

~No fix yet? Fedora 39 and 38 have shipped kernel 6.9 and the version I know to be stable is not actually compatible with that kernel.~

I am testing master in production hardware. So far two machines, both LUKS (one mirrored), no issues.

Rudd-O commented 8 months ago

The revert of bd7a02c251d8c119937e847d5161b512913667e6 was not included in master. Just tested — master continues to corrupt data under the filed circumstances, but only on mirrored LUKS devices (as before).

Now testing a custom master-derived branch with the revert applied.

Rudd-O commented 8 months ago

No errors so far with master + revert. I will have to fully run a scrub and a bunch of snapshots before I can decide this is good. If good, I would recommend reverting the commit in question ASAP on master.

Rudd-O commented 8 months ago

OK. master plus revert does not cause any issues anymore.

RichardBelzer commented 8 months ago

So if you're on LUKS, is it best to stay on 2.1.x? In other words, if you are on LUKS, you can't switch to 2.2.x unless:

https://github.com/openzfs/zfs/commit/bd7a02c251d8c119937e847d5161b512913667e6 gets reverted OR
https://github.com/openzfs/zfs/pull/15588 gets pulled into a 2.2.x release

Anyone know if that's correct? Trying to figure out the best guidance to provide to people when asked, given what we know today.

amotin commented 8 months ago

@RichardBelzer https://github.com/openzfs/zfs/commit/bd7a02c251d8c119937e847d5161b512913667e6 is reverted from 2.2.2. Master indeed should get https://github.com/openzfs/zfs/pull/15588 instead.

robn commented 4 months ago

FYI, 2.2.4 just shipped, with #15588 and followup patches included. If you are still having this problem, you might try setting zfs_vdev_disk_classic=0 in your zfs module parameters and seeing if that helps. If you do try this, please report back with your results, as our hope is to make this the default in the future.

gene-hightower commented 3 months ago

I still encountered spurious WRITE errors adding a mirror on a LUKS partitioned device both with zfs_vdev_disk_classic set to 0 and to 1.

iakat commented 2 months ago

on 6.8.10-asahi nixos, zfs 2.2.4, macbook air m2, zfs_vdev_disk_classic=0 and zfs_vdev_disk_classic=1 both result in several hundred zio error=5 type=2 with a luks2 header while trying to install.

LUKS1 results in no errors

ryantrinkle commented 2 months ago

I encountered this on Linux 6.1 with ZFS 2.2.4 after replacing a disk in a mirror. I tried a ton of different ZFS versions and the different kernel module parameters from this issue, and they did not help. Then, I noticed that LUKS on the new disk had defaulted to 4k sectors rather than 512:

After reformatting the LUKS volume with 512 byte sectors, resilvering completed without errors and a scrub looks to be going well also, so this seems to have resolved it for me. This pool does have ashift=12, for what it's worth.

someplaceguy commented 2 months ago

For what it's worth, I'm using ZFS 2.2.4 and Linux 6.6 and I have more than a dozen ZFS pools with ashift=12 across several machines, the vast majority of which using LUKS2 (with 4K LUKS sectors, despite 512-byte sectors in the underlying physical device) and I've never encountered these errors (I'm only subscribed to this issue because I find them concerning). I run scrubs weekly.

sempervictus commented 2 months ago

I encountered this on Linux 6.1 with ZFS 2.2.4 after replacing a disk in a mirror. I tried a ton of different ZFS versions and the different kernel module parameters from this issue, and they did not help. Then, I noticed that LUKS on the new disk had defaulted to 4k sectors rather than 512:

After reformatting the LUKS volume with 512 byte sectors, resilvering completed without errors and a scrub looks to be going well also, so this seems to have resolved it for me. This pool does have ashift=12, for what it's worth.

That's an odd case and any fs on that might not be too happy. Thanks for the report

ryantrinkle commented 2 months ago

@sempervictus Agreed. I think at some point LUKS must have changed how it decides default sector size, or perhaps there is some difference between my disks that affects that. Both of my disks advertise 512 byte sectors, but they are SSDs from different manufacturers (Samsung 990 Pro and Sabrent Rocket 4.0 Plus).

blind-oracle commented 2 months ago

@ryantrinkle SSDs under the hood use some 4k-16k NAND page size, so 512 is just a catch-all default. Using 4k with them is very much Ok and probably LUKS tries to be smart here (if SSD -> set 4k)

robn commented 1 month ago

Please see here for a debugging patch that I hope will reveal more info about what's going on: https://github.com/openzfs/zfs/issues/15646#issuecomment-2283206150

(if possible, I would prefer to keep discussion going in #15646, so its all in one place).

openzfs / zfs

CKSUM and WRITE errors with 2.2.1 stable, when vdevs are atop LUKS #15533