Closed Rudd-O closed 5 months ago
Looks like we're hitting this too, across the board, as all of our ZFS pools are on dm-crypt volumes. Unfortunately we caught this in send/recv with catastrophic results for the destination pools. Rolled everything back to 2.1.14; but i think it merits having heavy load testing atop various common VDEV types in the CI stack since some of these bugs are not reproducible with fast tests - we see this hit ~250G of writes.
Our recv targets are all flash (usually Intel) with the CloudFlare dm-crypt sync-io mechanisms enabled to avoid aggregation by the crypto block layer - cryptsetup --perf-no_read_workqueue --perf-no_write_workqueue --allow-discards --persistent ...
Looks like we're hitting this too, across the board, as all of our ZFS pools are on dm-crypt volumes. Unfortunately we caught this in send/recv with catastrophic results for the destination pools. Rolled everything back to 2.1.14; but i think it merits having heavy load testing atop various common VDEV types in the CI stack since some of these bugs are not reproducible with fast tests - we see this hit ~250G of writes. Our recv targets are all flash (usually Intel) with the CloudFlare dm-crypt sync-io mechanisms enabled to avoid aggregation by the crypto block layer -
cryptsetup --perf-no_read_workqueue --perf-no_write_workqueue --allow-discards --persistent ...
As was mentioned in your bug report (Wherein you mentioned that you are on 2.2.1), you should definitely update to 2.2.2 which reverts a change introduced in 2.2.1 which triggered an underlying problem. About 24 hours after upgrading to 2.2.1, I did "zpool status" and noticed 24.000 write errors and my pool in a degraded state with one faulted device. I then shut down my system and lived on my laptop for a week until 2.2.2 came out because my pool would have failed entirely had I not done so. After installing 2.2.2, I did a "zpool clear" and then ran a scrub on my pool. Fortunately, it seems that I have not lost any data and now my machine works normally.
Boy am I glad I tested this directly from master
— but I am sad that the bug snuck through to release.
Boy am I glad I tested this directly from
master
— but I am sad that the bug snuck through to release.
I hear you on that, especially since 2.2.1 was a recommended update because of the issues related to block cloning, so a bunch of people, like me, upgraded immediately.
Just a thought for the record: Might it not be a good idea to make running ZFS on top of LUKS something which is automatically tested as part of the ZFS test suite if possible? If that had been the case, then this issue could have been caught given that it is trivial to reproduce. Considering the issues around the native encryption (Especially the send/receive issues and keys being erased), a good number of people are probably more comfortable with LUKS for the encryption-part.
Years ago, I offered to wire zfs-fedora-installer to test the whole process down to LUKS decryption to boot and poweroff, but I couldn't get around doing so. My fault.
Not that this would necessarily help, since (sadly) I actually ran this test, and it revealed no issues. It seems like you need a very specific config (4K sector size in LUKS, plus mirroring), and zfs-fedora-installer doesn't do that.
LUKS is nice because no pool metadata is leaked with it, unlike ZFS encryption.
(4K sector size in LUKS, plus mirroring)
You don't need mirroring for this, I had a single disk system that still have this issue. But 4K sector size is needed.
However, LUKS also works on file, so add a simple test may able to catch this, without a full VM installation.
(4K sector size in LUKS, plus mirroring)
You don't need mirroring for this, I had a single disk system that still have this issue. But 4K sector size is needed.
However, LUKS also works on file, so add a simple test may able to catch this, without a full VM installation.
Given this, it seems like even just doing two simple LUKS tests with 512 and 4K sectors, respectively, would have caught this rather disastrous problem which literally was killing/faulting users' pools. Not saying this to point fingers or blame anybody, it just strikes me that until native encryption is sufficiently robust to be the obvious choice (i.e. a situation where "Why even use LUKS?" is a relevant question), LUKS+ZFS is something that ought to be tested for as it is not hard to see why somebody would want to use it.
I'm pretty sure the devs would welcome it, if someone would write a test for the test suite.
If someone would like to write some tests for the test suite that would be welcome.
If someone would like to write some tests for the test suite that would be welcome.
Good to hear that there's openness to something like this even though LUKS is not, specifically, a ZFS matter. :-) Oh, and thanks for getting 2.2.2 out so quickly, Brian!
Don't think of LUKS as a "not a ZFS matter" thing. Think of LUKS as a particular type of disk drive that makes ZFS exercise code paths which are valid and should be tested.
Do we know if this also affects similar setups with GELI or is it specific to LUKS?
This issue is specific to how ZFS submits IO to the block layer on Linux. LUKS just happens to do a good job of exposing it. The root cause is understood and the long term fix is being worked on in https://github.com/openzfs/zfs/pull/15588.
Still getting this kind of failure on 2.2.2 when receiving snapshots: #15646
I am also getting these spurious errors when resilvering; causing a resilvering loop that seems to go on forever. LUKS partitions, zfs-2.2.2.
~No fix yet? Fedora 39 and 38 have shipped kernel 6.9 and the version I know to be stable is not actually compatible with that kernel.~
I am testing master
in production hardware. So far two machines, both LUKS (one mirrored), no issues.
The revert of bd7a02c251d8c119937e847d5161b512913667e6 was not included in master
. Just tested — master
continues to corrupt data under the filed circumstances, but only on mirrored LUKS devices (as before).
Now testing a custom master
-derived branch with the revert applied.
No errors so far with master
+ revert. I will have to fully run a scrub and a bunch of snapshots before I can decide this is good. If good, I would recommend reverting the commit in question ASAP on master
.
OK. master
plus revert does not cause any issues anymore.
So if you're on LUKS, is it best to stay on 2.1.x? In other words, if you are on LUKS, you can't switch to 2.2.x unless:
Anyone know if that's correct? Trying to figure out the best guidance to provide to people when asked, given what we know today.
@RichardBelzer https://github.com/openzfs/zfs/commit/bd7a02c251d8c119937e847d5161b512913667e6 is reverted from 2.2.2. Master indeed should get https://github.com/openzfs/zfs/pull/15588 instead.
FYI, 2.2.4 just shipped, with #15588 and followup patches included. If you are still having this problem, you might try setting zfs_vdev_disk_classic=0
in your zfs
module parameters and seeing if that helps. If you do try this, please report back with your results, as our hope is to make this the default in the future.
I still encountered spurious WRITE errors adding a mirror on a LUKS partitioned device both with zfs_vdev_disk_classic set to 0 and to 1.
on 6.8.10-asahi nixos, zfs 2.2.4, macbook air m2, zfs_vdev_disk_classic=0 and zfs_vdev_disk_classic=1 both result in several hundred zio error=5 type=2 with a luks2 header while trying to install.
LUKS1 results in no errors
I encountered this on Linux 6.1 with ZFS 2.2.4 after replacing a disk in a mirror. I tried a ton of different ZFS versions and the different kernel module parameters from this issue, and they did not help. Then, I noticed that LUKS on the new disk had defaulted to 4k sectors rather than 512:
After reformatting the LUKS volume with 512 byte sectors, resilvering completed without errors and a scrub looks to be going well also, so this seems to have resolved it for me. This pool does have ashift=12, for what it's worth.
For what it's worth, I'm using ZFS 2.2.4 and Linux 6.6 and I have more than a dozen ZFS pools with ashift=12 across several machines, the vast majority of which using LUKS2 (with 4K LUKS sectors, despite 512-byte sectors in the underlying physical device) and I've never encountered these errors (I'm only subscribed to this issue because I find them concerning). I run scrubs weekly.
I encountered this on Linux 6.1 with ZFS 2.2.4 after replacing a disk in a mirror. I tried a ton of different ZFS versions and the different kernel module parameters from this issue, and they did not help. Then, I noticed that LUKS on the new disk had defaulted to 4k sectors rather than 512:
After reformatting the LUKS volume with 512 byte sectors, resilvering completed without errors and a scrub looks to be going well also, so this seems to have resolved it for me. This pool does have ashift=12, for what it's worth.
That's an odd case and any fs on that might not be too happy. Thanks for the report
@sempervictus Agreed. I think at some point LUKS must have changed how it decides default sector size, or perhaps there is some difference between my disks that affects that. Both of my disks advertise 512 byte sectors, but they are SSDs from different manufacturers (Samsung 990 Pro and Sabrent Rocket 4.0 Plus).
@ryantrinkle SSDs under the hood use some 4k-16k NAND page size, so 512 is just a catch-all default. Using 4k with them is very much Ok and probably LUKS tries to be smart here (if SSD -> set 4k)
Please see here for a debugging patch that I hope will reveal more info about what's going on: https://github.com/openzfs/zfs/issues/15646#issuecomment-2283206150
(if possible, I would prefer to keep discussion going in #15646, so its all in one place).
I build and regularly test ZFS from the master branch. A few days go I built and tested the commit specified in the headline of this issue, deploying it to three machines.
On two of them (the ones that had mirrored pools), a data corruption issue arose where many WRITE errors (hundreds) would accumulate when deleting snapshots, but no CKSUM errors took place, nor was there evidence that hardware was the issue. I tried a scrub, and that just made the problem worse.
Initially I assumed I had gotten extremely unlucky and hardware was dying, because two mirrors of one leg were experiencing the issue, but none of the drives of the other leg were -- so I decided best to be safe and attach a third mirror drive to the first leg (that was $200, oof). Since I had no more drive bays, I popped the new drive into a USB port (USB 2.0!) and attached it to the first leg.
During the resilvering process, the third drive also began experiencing WRITE errors, and the first CKSUM errors.
I tried different kernels (6.4, 6.5 from Fedora) to no avail. The error was present either way. zpool clear was followed by a few errors whenever disks were written to, and hundreds of errors whenever snapshots were deleted (I have zfs-auto-snapshot running in the background).
Then, my backup machine began experiencing the same WRITE errors. I can't have this backup die on me, especially not that I have actual data corruption on the big data file server.
At this point I concluded there must be some serious issue with the code, and decided to downgrade all machines to a known-good build. After downgrading the most severely affected machine (whose logs are above) to my build of e47e9bbe86f2e8fe5da0fc7c3a9014e1f8c132a9, everything appears nominal and the resilvering is progressing without issues. Deleting snapshots also is no longer causing issues.
Nonetheless, I have forever lost what appears to be "who knows what" metadata, and of course four days trying to resilver unsuccessfully:
In conclusion, something added between e47e9bbe86f2e8fe5da0fc7c3a9014e1f8c132a9..786641dcf9a7e35f26a1b4778fc710c7ec0321bf is causing this issue.