openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.68k stars 1.76k forks source link

Setting dedup_table_quota to 1 (or zfs ddtprune -p 100) seems to cause corruption #16713

Open dberlin opened 3 weeks ago

dberlin commented 3 weeks ago

Unintentionally discovered on two fedora 40 systems, running the commit e5d1f6816758f3ab1939608a9a45c18ec36bef41.

Both with zpools that had dedup=blake3 turned on, and dedup_table_quota set to 16G at the start. One is a single disk zpool The other machine is a mirrored zpool.

Both running kernel 6.11.x.

I was messing around with the new dedup to see how well it worked, and after messing around for a while (IE weeks), I decided to turn it off for now on both these machines. They had some amount of dedup success looking at the stats, but not huge success (which is what one would expect given what the machines are). Turning it off, without doing anything else, did not cause any corruption that i can tell.

However, i then tried to see if i could get it to prune basically all entries to get the dedup table as small as it could be - I'm aware you can't remove the dedup table entirely without recreating the pool, so i figured i would just prune it as much as possible.

I ran the following commands on both machines: zpool set dedup=off zpool ddtprune -p 100 zpool set dedup_table_quota=1

(as an aside, i'll note that ddtprune -d 0 is invalid, but ddtprune -p 100 is not, which is at least amusing if nothing else)

This pruned a lot, but not all entries. I expected it would prune basically ~all entries not actually in use, and at least staring at the zpool -D stats, it appeared to have done that and the numbers looked reasonable at a glance.

I then continued about my merry way, for the evening.

However, after rebooting the machines, they refused to boot with i/o errors loading various libraries (no reported zfs panics/asserts/etc errors in log). After booting from a live cd, building the exact same version of zfs and loading it, scrub shows checksum failures (on the mirrored disk, the checksum failures were identical on both disks).

The scrub-identified corrupt files were all newer files that appear to have been dedup'd, and then accessed after the commands above were run.

These machines are snapshot every hour - the same files in the snapshots created after the time the commands above were run are also corrupt. Snapshots before that time do not contain corrupt files, even where the same files exist.

The snapshots are also replicated to a backup storage machine running zfs 2.2.x (with no dedup) - a scrub is running (it has a much much larger storage pool so it will take another day) but so far no errors appear on that machine. as mentioned, dedup is off, but it does have DDT entries, which i presume is from sending it streams from the machines that had dedup on.

So I may have a copy of a corrupt and maybe not-corrupt version of the same snapshot if it's helpful (seeing what i can do).

I didn't expect any of this (and am still recovering the two machines), so don't yet have a totally useful reproduction recipe, but figured i would flag it. Happy to try to gather whatever data is helpful.

dberlin commented 3 weeks ago

Three updates:

  1. The scrub on the larger backup storage computer (running 2.2.x) finished. It detects no corruption. So even corrupted (as detected on the 2.3.x machines with dedup) snapshots, transferred to the older ZFS machine, are not seen as corrupt.

  2. New corruption is still showing up in a few (but not all) files that had dedup'd blocks. So it is either corrupting them on access, or some other background process is corrupting them. Zero files with non-dedup'd blocks have been corrupted.

  3. ZFS has also now panic'd a few times with the same message, on both machines - 'zfs: adding existent segment to range tree ...'. This appears when i try to replace some of the corrupt files. I have worked around it by using the zfs_recover parameter for now.