RFC: Compressor Versioning

sempervictus commented 2 years ago

Describe the feature would like to see added to OpenZFS

ZFS includes compression algorithm code from a variety of sources to perform inline block-level compression and decompression. Much like ZFS itself, these upstream codebases grow and evolve - adding features, fixing bugs (which could be block-level data bugs for us), and improving performance. Despite most if not all of these algorithms being backward compatible, the compressor output changes (regardless of how much) which alters the on-disk checksums of blocks leading to an inability to ever upgrade our compression code (at least the compressor side) or rely on deduplication and compressed+persistent l2arc when pools are imported on another OS using its own compressor implementation. For a filesystem intended to meet needs well into the future, this is a critical hindrance and potential blocker to adoption. In order to address this problem, i am proposing that we start carrying versioned compressors which write their upstream commit version into each block they compress. This would permit ZFS to RMW data with a new compressor on-access, and treat unversioned blocks as "version 0" - or "always needing an upgrade." This would also permit logic in the RMW "repair" code to understand which versions need a rewrite, and which don't so as to differentiate actual checksum errors from compressed data mismatches.

How will this feature improve OpenZFS?

This approach would permit ZFS to keep up to date with upstream compression algorithms.

Additional context

We have already lost work due to this problem in the ZSTD 1.4.8-9 upgrade PR LZ4 implementation could use a refresh as well just due to the code being out of date

sempervictus commented 2 years ago

ping @behlendorf @ahrens @rincebrain @BrainSlayer

sempervictus commented 2 years ago

Data Changes

Any block marked as being compressed (ARC, SLOG, etc) is marked with the compressor version
The version marking can be a metadata field or injected into a predetermined location in the block's checksum for inline optimization of the compressor version check
Blocks lacking marking are treated as having "version 0"

Operational Changes

Blocks read from disk are compared to the current version of the compressor
If a block is discovered to have been written with an old compressor version, it is RMW'd with the new one (or we fall back to the originating compressor if we include versioned ones)
- This creates a bunch of concerns about flooding pools, handling of snaps made with prior versions and guarantees around their consistency (do we define that as the original sum or the re-compressed one?), etc
- Since the operation is DMU-level and effectively "fixing" the block like a scrub would if it were rebuilding from others due to a sum mismatch, i propose we address the concerns above by ensuring that the contents of the block before and after re-compression are the same, and modifying "in-place" (freeing the old ones after writing the new ones) to avoid flooding (its not much of an upgrade if the new compressor produces bigger blocks).
This can be done in the course of normal IOs or a "special operation" like a scrub/resilver - might require carrying the current+last version of the compressor code
- Versions not carrying compressor code capable of reproducing original sums would have to impor the pool RO or with a flag explicitly starting the "special operation" for recompression
This would also allow us to mark certain versions as dangerous (block them from being used at runtime) if the routines are found to have a fatal memory/data/execution flaw which could be exploited by data inputs up to our block size - think inverse zip bombs or some other crafty data construct thought up by HAL9000's meaner cousin next time folks go to war. Depending on how we view our security model, we can consider what is already on-disk as "safe for decompression because we wrote it" but not what's being written to disk (for compression) since that comes from "evil consumers" if the bug is known to impact the compressor; or treat both as untrusted as the situation demands.

Code Structure Changes

If we want to provide flexible application of compressor (and maybe decompressor) versions, we may need to carry them versioned as well (at least one version back)

BrainSlayer commented 2 years ago

the compressed block already contains the zstd version code. so the request is pointless. storing a upstream commit version like a git hash is not required. the stable version code including major and minor version is fairly enough (and way smaller than storing a full git hash)

gmelikov commented 2 years ago

Maybe I'm totally wrong, but I'd like to think about restricting to recompress data and trying to compare with previous checksum. So, we just won't need to version compressors (while older compressor can decompress newer block, but this case usually make compressors totally incompatible).

IiRC we may try to recompress now in (need to recheck):

for compressed ARC
encryption (l2arc)

So I propose to start with the question - why do we really need recompression to be idempotent?

NB, compression result may already store not only compessed data https://github.com/openzfs/zfs/blob/60ffc1c460e4cdf3c3ca12c8840fd0675a98ed0d/module/zfs/lz4.c#L67-L74

gmelikov commented 2 years ago

And wanted to highlight PR https://github.com/openzfs/zfs/pull/9416#issuecomment-964436998 , where zio_compress() will give different results between before/after version and it looks acceptable even now.

gmelikov commented 2 years ago

So, usage of zio_compress_data() https://github.com/openzfs/zfs/search?q=zio_compress_data%28 :

ZIO:

zio.c in zio_write_compress() - it's ZIO, if I understand everything correctly - it's used only for NEW writes.

ARC:

arc.c in arc_hdr_authenticate() - only for UNcompressed ARC, Best effort if comment is true: https://github.com/openzfs/zfs/blob/376027331d7edaf0bb7bd146f2f087108b7eeac5/module/zfs/arc.c#L1774-L1788 https://github.com/openzfs/zfs/blob/376027331d7edaf0bb7bd146f2f087108b7eeac5/module/zfs/arc.c#L1791
arc.c in l2arc_apply_transforms() - if if (psize >= size) then remain uncompressed https://github.com/openzfs/zfs/blob/376027331d7edaf0bb7bd146f2f087108b7eeac5/module/zfs/arc.c#L9298-L9309
arc.c in l2arc_log_blk_commit() - compress buffer, not data https://github.com/openzfs/zfs/blob/376027331d7edaf0bb7bd146f2f087108b7eeac5/module/zfs/arc.c#L10796-L10798

So, it looks like for me that now we don't need to have idempotent zio_compress_data() results across different versions of code to work. Maybe result is not 100% optimal, but it should work.

Will be glad if someone can review my thoughts here.

sempervictus commented 2 years ago

@BrainSlayer: thanks for the heads up on ZSTD. Do the other compressors store the same information in their outputs? Could be a good template for how everything can keep that data. I'm curious what your thoughts are on how to check that. I'm thinking that if some known offset from start of block hash contains the version bits, then the sum mismatch could be checked for mismatching on the compressor version during the checksum verification call. @gmelikov: that's an interesting one as i was under the impression that any mismatch between ARC/L2ARC/SLOG/VDEV data would be a problem based on discussion in the now closed ZSTD update PR. My understanding of the problem is that we do need idempotent compression as blocks read from a vdev and put into ARC by a different compressor version would produce two different blocks and "end the world." @behlendorf had a couple of comments on this which i can try to dig up, or hopefully the maestro himself will show us the sheets and explain how this symphony plays. In terms of the ARC compression piece, is the source to the zio_compress_data call not equivalent to the data which would be written to a VDEV?

BrainSlayer commented 2 years ago

of course not. but the storing such a information for zlib/lz4 would turn the block incompatible. zstd was the first compressor which introduced versioning in zfs. for zlib, lz4 etc. this is not required. they will not change since the algorithm is fixed.

sempervictus commented 2 years ago

@BrainSlayer - i think thats the crux of the problem. In #12805 there is a comment about the newer compressor producing slightly different outputs than the old one, but since compressors normally don't version their outputs, its another situation where we can't implement the current code because blocks vary between them. Could you please elaborate a bit on the would turn the block incompatible piece? I think that the more pitfalls and problems we can get down on paper the better we'll be able to reason about solution vectors.

BrainSlayer commented 2 years ago

if not neccessary we should avoid format changes which disallow older versions of zfs to read the filesystem. upgrading the block format with a version prefix does fundamentally change the format and breaks backward compatibility.

sempervictus commented 2 years ago

@BrainSlayer: Agreed, i'd normally not even think about data level changes like this, but the problem itself is one of data changing in such a way as to make pools incompatible or in the very least break dedup. Feature flags also look a bit sketchy to me since your repo by now would have had to have 3 new ones added for 1.4.8,.9, and 1.5.0 and another one for the new LZ4 code; plus thats a bit of a headache in terms of "which blocks got compressed with what" for other compressor/decompressor algos. Where else could we stick attributes relating to this in a generic manner for all compressor types such that we dont totally bork import for older versions, and how could we make that efficient enough when doing the check on every block read? Could we use the block pointer or some other already-being-accessed metadata field to avoid too many extra memory/data ops?

sempervictus commented 2 years ago

So to reinforce the need for this, i took a look at CVEs against lz4, and found a few which may or may not be relevant to our use, but do show that we must be able to upgrade our compression algorithms or be banned from use by security policy when we are provably impacted:

CVE ID	Description
CVE-2021-3520	There's a flaw in lz4. An attacker who submits a crafted file to an application linked with lz4 may be able to trigger an integer overflow, leading to calling of memmove() on a negative size argument, causing an out-of-bounds write and/or a crash. The greatest impact of this flaw is to availability, with some potential impact to confidentiality and integrity as well.
CVE-2019-17543	LZ4 before 1.9.2 has a heap-based buffer overflow in LZ4_write32 (related to LZ4_compress_destSize), affecting applications that call LZ4_compress_fast with a large input. (This issue can also lead to data corruption.) NOTE: the vendor states "only a few specific / uncommon usages of the API are at risk."
CVE-2014-4715	Yann Collet LZ4 before r119, when used on certain 32-bit platforms that allocate memory beyond 0x80000000, does not properly detect integer overflows, which allows context-dependent attackers to cause a denial of service (memory corruption) or possibly have unspecified other impact via a crafted Literal Run, a different vulnerability than CVE-2014-4611.
CVE-2014-4611	Integer overflow in the LZ4 algorithm implementation, as used in Yann Collet LZ4 before r118 and in the lz4_uncompress function in lib/lz4/lz4_decompress.c in the Linux kernel before 3.15.2, on 32-bit platforms might allow context-dependent attackers to cause a denial of service (memory corruption) or possibly have unspecified other impact via a crafted Literal Run that would be improperly handled by programs not complying with an API limitation, a different vulnerability than CVE-2014-4715.

There are more for gzip and zstd

Given that there's a potential security angle to this, could one of the maintainers please weigh in on the subject of "how do we patch compressor vulns today" and whether (and how) we actually know our compressor implementations to be safe?

If we can't upgrade, we at least should carry provably exploitable PoCs for all documented vulns against the algo/its implementations which are then wrapped in tests to prove that something impacting a differently sized/organized/etc implementation of a compressor does not in fact impact ZFS.

BrainSlayer commented 2 years ago

first you should check if any of these cve's belongs to the code. i just CVE-2014-4715 and this is already mitigated. the other cve's do not belong to the part of the code which is used. consider that the lz4 library is much bigger than the code which is used in zfs. zfs just uses a small subset of it

sempervictus commented 2 years ago

Thank you sir for the clarification :). My point is more to illustrate that the code we are taking from various upstreams may in itself be flawed from a security perspective (or become so later under various compiler conditions/architectures/etc) and must be modular in its implementation - we need to be able to upgrade our LZ4, ZSTD, GZIP, etc implementations or otherwise change them if some part is found to have a vulnerability sometime down the line.

BrainSlayer commented 2 years ago

but upgrading means here that we need to merge upstream fixes. lz4 in zfs is not some upstream code. its a handcraftet implementation based on upstream code. so as far as the compressed bitstream doesnt change, any new fix or patch can be applied and this is the case for lz4 as i mentioned already. the gzip implemenation uses the linux and freebsd kernel api. no maintainance required. since there is no local implementation for gzip in the zfs source itself

sempervictus commented 2 years ago

There's a PR for the LZ4 decompressor code from upstream, but the compressor is currently being left as-is AFAIK. Any idea on lzjb? I cant see zle changing... For ZSTD though this still poses a problem: we need a way to upgrade as they do in upstream, or we would break dedup and such on every new version or be stuck with a potential security issue in an older one (1.3.8 had one apparently, but i see none for 1.4.x). I would still say that even for the in-house implementation of lz4, forcing a static compressor implementation for all time to come seems foolhardy.

BrainSlayer commented 2 years ago

why would we break dedup. blocks written in the past can be still used and new blocks get a new checksum once they are created. so dedup will still work. and if the new block with same content doesnt match a older block which has the same content, its no big deal. its not deduped in that case, but for future writes dedup still works. and of course you can simply run a progreamm to reread and rewrite all blocks for dedup refreshing. (i have script which does full deduplication on a fs by simple cp src to src.tmp, mv src.tmp src for each file on the filesystem. such a procedure can be also done on block level by zfs once a upgrade is required and bitstream chances are expected for a new compressor version

sempervictus commented 2 years ago

@BrainSlayer - if we lose a disk from a vdev and resilver it after upgrading compressors, or perform some other operation on a file which write the same file back with no changes, won't a new ZSTD compressor change the on-disk bits and therefore no longer match the dedup sum? I think it is a big deal in that we have datasets with 30X dedup on them (law firms) and constantly adding "the same" data - their storage capacity would diminish rapidly and cumulatively with each new version of a compressor. There's also the uncompressed ARC thing from the ZSTD 1.4.x PR

My intent in filing this issue is to get the maintainers attention on the problem domain - we as users can say "its no big deal" if something doesnt work under some cases, but for the world at-large, this is an enterprise software suite which must behave as advertised or the commercial entities selling it might find themselves in breach of contract/on the wrong side of their customers' ire.

BrainSlayer commented 2 years ago

this is why you have to read and rewrite all files un upgrade. so the problem solves it by itself. i mean. its just a solution. the compressor version is stored in the block itself. of course we can also add a separate compressor for each individual version to zfs the code. and if you write "each new version". there arent that many new versions and we only have to consider new versions where the bitstream changes. but as i said. easy fix is running a script which reads and rewrites all files. problem solved

sempervictus commented 2 years ago

@BrainSlayer - i think that's what i'm trying to get at: the mechanism by which such a rewrite can happen in the background. Something like a resilver at the data layer vs the vdev layer. The script idea might be something zed could do, though i am woefully behind on that piece of zfs

IvanVolosyuk commented 2 years ago

I think there is a conceptual problem with a rewrite of the data. Older version of ZFS should be able to read/write the data the same as newer version, otherwise we supposed to guard the data by feature flags when we losing interoperability.

The newer version of ZFS can support new format with newer version of compressor and it will use it only if the feature flags are enabled, but it should also be able to read/write older format of the pool with the feature flags disabled and for that it needs an old compressor.

I'm all for the upgrade of compressors, but it looks pretty difficult to pull off because of previous design choices.

sempervictus commented 2 years ago

@IvanVolosyuk - agreed, i'm not sure they were so much design choices as implementation pathways back then (though this whole thing does seem very academic from the get-go). I'm hoping folks can weigh in with design and architectural proposals which have consideration for how the functionality of upgrading compressors, ciphers, and other DMU-layer constructors would work in the intended "end-state" and how we could get from where we are today with these rather brittle/hard-wired implementations to the proposed design goals. @ahrens - IIRC there is an effort out there to have object stores be used for DMU backing... Given the costs associated with IO and storage in public cloud, and the volumes of data for which such technology would be used in commercial implementations; wouldn't the ability to increase compression ratio or resolve future security bugs in the compressors be somewhat required for the long-term viability of the commercial endeavor? How do resilvers/scrubs(/vdev ops? i'm guessing buckets can be striped organized in various parity/draid schemes) and the "recompression" discussion elements factor into those cost (and therefore viability) models? I can't venture any sort of educated guess at how much data is out there on ZFS or what it's worth, but i can pretty confidently say that its "a lot" by anyone's comprehension in both measures. It seems to me that when (not if) we come to a point down the road where a security problem (even a DoS) is identified in one of the compressor (or cipher) implementations, we will want to have all of the infrastructure in-place and well-tested to be able to drop in a replacement and ship a "patched" version of ZFS pronto.

aluo-x commented 2 years ago

Does anyone know how btrfs dealt with the issue of updating the zstd library?

I see the kernel updated zstd to 1.4.10 from 1.3.1 earlier in 2021. And I was under the impression that btrfs checksums were post-compression.

btrfs did add an option to recompress all existing files using the defragment -czstd command. Although the wiki indicates that older kernels without zstd would not be able to read file systems compressed this way, breaking compatibility.

sempervictus commented 2 years ago

Interesting question, since they're part of the GPL codebase with access to every change hitting the kernel's compressors and ciphers. I think that having a defragment function is handy in general - ZFS fragments with no recourse, especially if it can be used to do touch-up repair/rewrite of blocks on updated compressors and ciphers.

IvanVolosyuk commented 2 years ago

So, in the light of the above upgrading the compressor is possible in case of CVEs. It can have negative effects on 'uncompressed arc + L2ARC' and DDT. Send+receive should be able to re-compress the data though.

sempervictus commented 2 years ago

Well, "possible" is questionable here - if you disregard the feature guarantees of the FS as described (commercial guarantees in some cases), sure. Thats what i'm hoping to preempt: a point where someone will have to make a commit which breaks stuff previously written to disk in order to escape some nightmare scenario of a reproducible overflow in ring0 using data from ring3 with enough control to provide RCE to the data-submitter back in ring3 (or pick a bad thing that can happen in a system which wont accept change without breaking).

sempervictus commented 2 years ago

So looks like we're handling CVE-2021-3520 via #12947 which is on the decompressor side... but its only a matter of time/fuzzing cycles till a compressor is impacted.

sempervictus commented 2 years ago

@IvanVolosyuk - send receive may not be an option for backup datasets holding snapshots/copies of giant stores. Thinking like 10y worth of snapshots at a law firm. This is the case for which i think we need a scrub-type healing process to recompress the data without send/recv

IvanVolosyuk commented 2 years ago

@sempervictus That's sound like it would require bpw

bghira commented 2 years ago

uncompressed ARC shouldn't really be a consideration here, imo.

gmelikov commented 2 years ago

Wanted to add an interesting case, mentioned by @BrainSlayer in #11979 :

at least zstd has a logic if allocation fails, the compression for the block is disabled, so compression can give different results on memory starvation in stable code branches.

IvanVolosyuk commented 2 years ago

at least zstd has a logic if allocation fails, the compression for the block is disabled, so compression can give different results on memory starvation in stable code branches.

Should this just leave the block uncompressed? It seems zio_write_compress() removes the compressed flag if the compressor fails to produce smaller output.

gmelikov commented 2 years ago

Should this just leave the block uncompressed?

If it's initial compression of a block - yes.

If it's a recompress to get original checksum - then we'll get different result and can't compute original checksum of compressed block (see https://github.com/openzfs/zfs/issues/12840#issuecomment-991894048)

robszy commented 2 years ago

I think that this comment https://github.com/openzfs/zfs/issues/12705#issuecomment-954476993 is way out from issues with compression algo upgrades.

We need to go away from bit to bit exact results of compression ( checksumming in zfs gone to far in its design for whatever reasons ) and then we could use only one implementation of algo which is the latest :). If we go in path of versioning we will have every version of every compressor to stay compatible with every zfspool on earth as @Ornias1993 stated in replay to @behlendorf comment: https://github.com/openzfs/zfs/pull/11367#issuecomment-752232210

PrivatePuffin commented 2 years ago

Yes the whole solution to this is simply to stop enforcing post-compression checksums to be the same. And also yes, that indeed means cutting/redesigning some crappy legacy "features".

However @sempervictus you seems to repeat yourself and your requests around compression about once every 6 months. You also act surprised about what @BrainSlayer said about ZSTD, but we explained that to you about 4 times before already.

robszy commented 2 years ago

Also from this checksumming curse we have pulls like https://github.com/openzfs/zfs/pull/12805 which is nothing more than cherry picking easy parts which is good to have updated but it is world upside down the way it should be.

Updating compression should be easy as in https://github.com/openzfs/zfs/pull/11367 which sadly wasn't merged because of "some crappy legacy features"

So most important thing is to lift zfs constraint of byte for byte equality and then merge https://github.com/openzfs/zfs/pull/11367 :)

Other way in 2100 everyone will still use zstd 1.4.5 compressor in zfs

sempervictus commented 2 years ago

Personally, as a user i'd love to see a modernization of the design to avoid post-compression sums being used for dedup if it doesnt maul the rest of the functionality. In the commercial space though, such changes may result in the same problem that changing compressors creates - old datasets would stop deduplicating new writes (since their sums would still be for compressed blocks) and outgrow their commercially-planned storage capacity/growth rate. Moreover, its not readily apparent to me how one would make that change on an existing zpool since existing sums would need to be recomputed off of the decompressed block contents and still need something like resilver/scrub/bprw to modify the existing blocks with said updated sums. This would also break RO imports of newer pools on older ZFS versions as sums on all blocks would be "wrong." Then there's scrubs themselves to consider - if working off of the decompressed sum, how does that play with blocks which are compressed and encrypted?

@Ornias1993: I am explicitly beating the dead horse into a homogeneous paste because this problem is only getting worse (upstream compressor versions are climbing, we're stuck) and so far there's no clear solution to accommodate the listed concerns without compromising commercial requirements for back-compat/platform stability. I do not recall any of your responses presenting a functional solution to the backwards compatibility problems around compressor versioning with dedup ; my apologies if i'm not remembering a relevant comment (these things stretch back a while and very little of my time/focus these days is in the FOSS world).

robszy commented 2 years ago

On dedup problem @Ornias1993 comented in 11367: https://github.com/openzfs/zfs/pull/11367#issuecomment-753447821 - if you want stable chcksums don't use unstable compressor which is something perfectly to understand in enterprise env.

Back to problem at hand:

About first bullet of ARC in @gmelikov in pr 11367 @behlendorf said there will be io errors when we don't authenticate https://github.com/openzfs/zfs/pull/11367#issuecomment-758268676 so it not seams "Best effort" but I don't understand why we authenticate something that is in ram (uncopressed arc) already so it should be authenticated long time ago ? Can someone shed a light what are we doing there for real because for sure we compressing so decryption and authentication and decompression was long time ago ? There was question about that https://github.com/openzfs/zfs/discussions/11385 but not much of an answer.

About recompress case there is a comment https://github.com/openzfs/zfs/pull/11367#issuecomment-753517958 but can't we just change sum in arc header so it will match on reread from l2cache ?

In bullet l2arc_apply_transforms what happens when psize < size because we will compress in that case ? This will be compression without encryption case ?

sempervictus commented 2 years ago

The "dont use unstable compressor" notion is circular logic: compressor code is updated separately from ZFS upstream, sometimes to fix security flaws (which requires application of patches in enterprise environments). There is no such thing as stable code that will never change, only dead code that is unmaintained. Its the whole crux of this issue - compressor versions change, and ZFS has no way to keep up with the changes. The ZSTD example is the best one so far - that codebase moves fast and only a few folks like @BrainSlayer are seeing the benefits of those changes today.

The uncompressed ARC case seems to be both a vestigial feature and a logical blocker when using the compressed sums, but the logic in the authentication function looks like the spot where compressed data is handled could be turned into a decompression call to get the raw sums out for the block. That does create the problem of repeat decompression passes on the same data in the IO pipe every time those sums need to be checked against the raw data (as well as the blocks no longer matching the old compressed block sums). It also looks like the handling of compressed block sizes is always looking for compressed buffers being bigger than the raw ones and able to select the raw buffer even when compression is enabled for the block - such that "compression is enabled" but the block itself doesn't meet the compression constraints and ends up storing the raw sum anyway. Those areas of code should be able to grab the sum of the raw buffer and attach it to the compressed block...

Which leads me to wonder if the solution for keeping back-compat and using raw sums moving forward isn't just to store both in the block? Then the "is this block compressed" checks would boil down to xoring the two sums and for encrypted data we'd have the the raw-send sums available. If this were added as a pool feature it would still be unable to dedup old data when the compressor changes, but for new data written after the feature is enabled but before a compressor version change, the function should still work as both old-compressor-generated blocks and new ones would have the raw sum for comparison. If a decompressor ever produces data mismatching the raw sum then the inter-version decompressor changes are breaking changes anyway and require edge-case handling of some sort. I think this also handles the inter-OS-gzip problem since a check on the raw sums won't care whether the native OS or ZFS compose the final buffer on-disk. However, this approach still doesn't help that law firm with 10y retention requirements of name-a-critical-industry's records which has to update their ZFS deployment due to some as of yet not public vuln in the gzip compressor they've been using for decades. Those folks would still face potential data doubling since there would be a new "base layer" of records using the raw sum and all subsequent dedup would have to happen against that layer of data because their compressed sums wouldn't match what is in the old compressed but not newly-informed-of-raw-sums blocks. How much computational and memory overhead would be required to maintain both of the sums throughout the IO pipeline?

robszy commented 2 years ago

You answered yourself:

If this were added as a pool feature it would still be unable to dedup old data when the compressor changes

So way out is to rewrite those files as @BrainSlayer suggested and problem is then solved in my opinion without adding tons of code and complexitiy

I'm waiting for answers to my questions....

sempervictus commented 2 years ago

@robszy: rewriting files isnt an option if you have ZVOLs instead of ZPL entries. Also actually touching files in the VFS counts as access for compliance purposes (file integrity monitoring stacks such as OSSEC will flag modification and trip alerts) so the rewrite needs to happen underneath, somewhere in the filesystem innards so as not alter userspace-visible attributes.

As for your questions:

i do not know why the ARC data is being authenticated, but my experience in working with AEADs in other stuff leads me to believe that you need to perform authentication on-access as best-practice to ensure that nothing modified your ciphertext buffer before you try to do things with it (say a bit flip, or dirty condition of some sort).
The sum of the L1ARC header in the uncompressed ARC case cannot match the compressed sum in the L2ARC header (or be forcibly updated to such) because the actual buffer contents are different. What's worse is that if the persistent VDEV was written with an older compressor than the L2ARC (or ZSTD chose to compress less due to time/memory constraint), we end up with 3 sums - the original on-disk compressed one, the uncompressed L1ARC, and the just-compressed L2ARC.
I havent dug into the difference between uint64_t psize = HDR_GET_PSIZE(hdr); and uint64_t size = arc_hdr_size(hdr); to have an answer

EDIT to clarify the compliance problem a bit: regulatory frameworks requiring retention of data for X number of years after modification would reset the clock on retention requirement if the user were to rewrite the files as the modification time would shift to whenever they did it (even though the logical contents do not change). Regulatory bodies are often not fully staffed by reasonable people, so getting exceptions can range from costly to impossible. Mythical block pointer rewrites would be great to solve this (delivered by unicorns in a time machine), send/recv may be the only real way currently; but something like a "value-added scrub" which would decompress the blocks and amend their metadata by adding the uncompressed sum would avoid the VFS-level modifications while addressing the extension of attributes need underneath.

robszy commented 2 years ago

Ok so I'm waiting for reply to questions you don't know answers for and for more elaborated answer to 2 :)....

sempervictus commented 2 years ago

I'll just assume that English isn't your first language and you're not actually intending to be rude, but i am somewhat at an impasse as to why you're awaiting answers from me at all. The issue is posted to describe an architectural problem and attempt to crowd-source a solution which meets the constraints presented. How is your contribution to the discussion furthering that goal?

robszy commented 2 years ago

Of course not. I'm waiting for others more experienced in zfs people :)

PrivatePuffin commented 2 years ago

rewriting files isnt an option if you have ZVOLs instead of ZPL entries

Guess he was refering to blocks.

PrivatePuffin commented 2 years ago

2. What's worse is that if the persistent VDEV was written with an older compressor than the L2ARC (or ZSTD chose to compress less due to time/memory constraint), we end up with 3 sums - the original on-disk compressed one, the uncompressed L1ARC, and the just-compressed L2ARC.

Which should be relatively easy to fix, by nuking the content in (L2)ARC. This is not really the blocker for version updates afaik

robszy commented 2 years ago

Rewriting data is in my opinion only option in dedup case because changing chcecksums to match decompressed data will surely break compatibility with older zfs versions if we only changing checksums in bps. Maybe there are other ways to setup data structures only for dedup which consider only uncompressed checksums but it will also break compatibility

Leaving some kind of long term support version of lz4 or zstd is also wrong because nobody will add security patches to lts versions and nobody wants security flaws in zfs i think.

robszy commented 2 years ago

But other problems with hdr_authenticate ( maybe we can decompress only after authentication of non 0 level block ?) and recompress (nuke from acr/change checksum ) cases are doable so in my opinion we should do it, not tell people that there will be some issues.

RJVB commented 2 years ago

I haven't read through the whole thing, but how about -o compression=lz4-i (or lz4@i where i could be a simple counter meaning "the ith version of lz4 that was included in ZFS"? Same problem that there is no foreseen place to store the version info, in addition to having to find a way to include those different versions in ZFS?

That former issue could be trivial to address by adding a new compressor name (rather than replacing an existing compressor with a newer version that wouldn't be recognised by the older version).

IvanVolosyuk commented 2 years ago

Summary of what are the issues we have:

Bit-identical compression is required for dedup to be effective. But the compression is already different between Linux and FreeBSD #12919 and QAT #7896. It is not possible to change any of them without loosing effectiveness of dedup. What we are talking about is changing compression between versions of ZFS. This can add another reason for people to be stuck at ancient version of ZFS.
Re-compression. This seem to only happen when using uncompressed ARC and compressed L2ARC. This makes blocks compressed with different (older?) compressor effectively not cacheable in L2ARC or even worse actually.
Adding compressor version number to the block output. It doesn't achieve much. It helps only if we add multiple version of compressors into the code base. Pool should be usable without (1) and (2) problems on older version of ZFS, thus compressor from older version have to be included into new version as well. New compressors thus have to be protected by feature flags to not be used for pools compatible with older version of ZFS.
Bug fixes / security fixes have to be backported and can still break (1) and (2) but in those cases the affected blocks may be already effectively corrupted and unreadable.
Hypothetical large scale change of ZFS where checksums are computed before compression will not fly with the existing encryption support. Ability to scrub / operate the pool without loading encryption keys is not compatible with this approach.

Deadlock. The only way forward is (3) as it seems. Keep multiple compressors in the codebase, protect them by feature flags. Ugly, but is the only feasible approach unless we are willing to sacrifice (1) and (2).

@ahrens Is this the correct summary?

openzfs / zfs