Open sempervictus opened 2 years ago
ping @behlendorf @ahrens @rincebrain @BrainSlayer
the compressed block already contains the zstd version code. so the request is pointless. storing a upstream commit version like a git hash is not required. the stable version code including major and minor version is fairly enough (and way smaller than storing a full git hash)
Maybe I'm totally wrong, but I'd like to think about restricting to recompress data and trying to compare with previous checksum. So, we just won't need to version compressors (while older compressor can decompress newer block, but this case usually make compressors totally incompatible).
IiRC we may try to recompress now in (need to recheck):
So I propose to start with the question - why do we really need recompression to be idempotent?
NB, compression result may already store not only compessed data https://github.com/openzfs/zfs/blob/60ffc1c460e4cdf3c3ca12c8840fd0675a98ed0d/module/zfs/lz4.c#L67-L74
And wanted to highlight PR https://github.com/openzfs/zfs/pull/9416#issuecomment-964436998 , where zio_compress() will give different results between before/after version and it looks acceptable even now.
So, usage of zio_compress_data() https://github.com/openzfs/zfs/search?q=zio_compress_data%28 :
ZIO:
ARC:
Best effort
if comment is true:
https://github.com/openzfs/zfs/blob/376027331d7edaf0bb7bd146f2f087108b7eeac5/module/zfs/arc.c#L1774-L1788
https://github.com/openzfs/zfs/blob/376027331d7edaf0bb7bd146f2f087108b7eeac5/module/zfs/arc.c#L1791if (psize >= size)
then remain uncompressed
https://github.com/openzfs/zfs/blob/376027331d7edaf0bb7bd146f2f087108b7eeac5/module/zfs/arc.c#L9298-L9309So, it looks like for me that now we don't need to have idempotent zio_compress_data() results across different versions of code to work. Maybe result is not 100% optimal, but it should work.
Will be glad if someone can review my thoughts here.
@BrainSlayer: thanks for the heads up on ZSTD. Do the other compressors store the same information in their outputs? Could be a good template for how everything can keep that data. I'm curious what your thoughts are on how to check that. I'm thinking that if some known offset from start of block hash
contains the version bits, then the sum mismatch could be checked for mismatching on the compressor version during the checksum verification call.
@gmelikov: that's an interesting one as i was under the impression that any mismatch between ARC/L2ARC/SLOG/VDEV data would be a problem based on discussion in the now closed ZSTD update PR. My understanding of the problem is that we do need idempotent compression as blocks read from a vdev and put into ARC by a different compressor version would produce two different blocks and "end the world." @behlendorf had a couple of comments on this which i can try to dig up, or hopefully the maestro himself will show us the sheets and explain how this symphony plays.
In terms of the ARC compression piece, is the source to the zio_compress_data
call not equivalent to the data which would be written to a VDEV?
of course not. but the storing such a information for zlib/lz4 would turn the block incompatible. zstd was the first compressor which introduced versioning in zfs. for zlib, lz4 etc. this is not required. they will not change since the algorithm is fixed.
@BrainSlayer - i think thats the crux of the problem. In #12805 there is a comment about the newer compressor producing slightly different outputs than the old one, but since compressors normally don't version their outputs, its another situation where we can't implement the current code because blocks vary between them.
Could you please elaborate a bit on the would turn the block incompatible
piece? I think that the more pitfalls and problems we can get down on paper the better we'll be able to reason about solution vectors.
if not neccessary we should avoid format changes which disallow older versions of zfs to read the filesystem. upgrading the block format with a version prefix does fundamentally change the format and breaks backward compatibility.
@BrainSlayer: Agreed, i'd normally not even think about data level changes like this, but the problem itself is one of data changing in such a way as to make pools incompatible or in the very least break dedup. Feature flags also look a bit sketchy to me since your repo by now would have had to have 3 new ones added for 1.4.8,.9, and 1.5.0 and another one for the new LZ4 code; plus thats a bit of a headache in terms of "which blocks got compressed with what" for other compressor/decompressor algos. Where else could we stick attributes relating to this in a generic manner for all compressor types such that we dont totally bork import for older versions, and how could we make that efficient enough when doing the check on every block read? Could we use the block pointer or some other already-being-accessed metadata field to avoid too many extra memory/data ops?
So to reinforce the need for this, i took a look at CVEs against lz4, and found a few which may or may not be relevant to our use, but do show that we must be able to upgrade our compression algorithms or be banned from use by security policy when we are provably impacted:
CVE ID | Description |
---|---|
CVE-2021-3520 | There's a flaw in lz4. An attacker who submits a crafted file to an application linked with lz4 may be able to trigger an integer overflow, leading to calling of memmove() on a negative size argument, causing an out-of-bounds write and/or a crash. The greatest impact of this flaw is to availability, with some potential impact to confidentiality and integrity as well. |
CVE-2019-17543 | LZ4 before 1.9.2 has a heap-based buffer overflow in LZ4_write32 (related to LZ4_compress_destSize), affecting applications that call LZ4_compress_fast with a large input. (This issue can also lead to data corruption.) NOTE: the vendor states "only a few specific / uncommon usages of the API are at risk." |
CVE-2014-4715 | Yann Collet LZ4 before r119, when used on certain 32-bit platforms that allocate memory beyond 0x80000000, does not properly detect integer overflows, which allows context-dependent attackers to cause a denial of service (memory corruption) or possibly have unspecified other impact via a crafted Literal Run, a different vulnerability than CVE-2014-4611. |
CVE-2014-4611 | Integer overflow in the LZ4 algorithm implementation, as used in Yann Collet LZ4 before r118 and in the lz4_uncompress function in lib/lz4/lz4_decompress.c in the Linux kernel before 3.15.2, on 32-bit platforms might allow context-dependent attackers to cause a denial of service (memory corruption) or possibly have unspecified other impact via a crafted Literal Run that would be improperly handled by programs not complying with an API limitation, a different vulnerability than CVE-2014-4715. |
There are more for gzip and zstd
Given that there's a potential security angle to this, could one of the maintainers please weigh in on the subject of "how do we patch compressor vulns today" and whether (and how) we actually know our compressor implementations to be safe?
If we can't upgrade, we at least should carry provably exploitable PoCs for all documented vulns against the algo/its implementations which are then wrapped in tests to prove that something impacting a differently sized/organized/etc implementation of a compressor does not in fact impact ZFS.
first you should check if any of these cve's belongs to the code. i just CVE-2014-4715 and this is already mitigated. the other cve's do not belong to the part of the code which is used. consider that the lz4 library is much bigger than the code which is used in zfs. zfs just uses a small subset of it
Thank you sir for the clarification :). My point is more to illustrate that the code we are taking from various upstreams may in itself be flawed from a security perspective (or become so later under various compiler conditions/architectures/etc) and must be modular in its implementation - we need to be able to upgrade our LZ4, ZSTD, GZIP, etc implementations or otherwise change them if some part is found to have a vulnerability sometime down the line.
but upgrading means here that we need to merge upstream fixes. lz4 in zfs is not some upstream code. its a handcraftet implementation based on upstream code. so as far as the compressed bitstream doesnt change, any new fix or patch can be applied and this is the case for lz4 as i mentioned already. the gzip implemenation uses the linux and freebsd kernel api. no maintainance required. since there is no local implementation for gzip in the zfs source itself
There's a PR for the LZ4 decompressor code from upstream, but the compressor is currently being left as-is AFAIK. Any idea on lzjb? I cant see zle changing... For ZSTD though this still poses a problem: we need a way to upgrade as they do in upstream, or we would break dedup and such on every new version or be stuck with a potential security issue in an older one (1.3.8 had one apparently, but i see none for 1.4.x). I would still say that even for the in-house implementation of lz4, forcing a static compressor implementation for all time to come seems foolhardy.
why would we break dedup. blocks written in the past can be still used and new blocks get a new checksum once they are created. so dedup will still work. and if the new block with same content doesnt match a older block which has the same content, its no big deal. its not deduped in that case, but for future writes dedup still works. and of course you can simply run a progreamm to reread and rewrite all blocks for dedup refreshing. (i have script which does full deduplication on a fs by simple cp src to src.tmp, mv src.tmp src for each file on the filesystem. such a procedure can be also done on block level by zfs once a upgrade is required and bitstream chances are expected for a new compressor version
@BrainSlayer - if we lose a disk from a vdev and resilver it after upgrading compressors, or perform some other operation on a file which write the same file back with no changes, won't a new ZSTD compressor change the on-disk bits and therefore no longer match the dedup sum? I think it is a big deal in that we have datasets with 30X dedup on them (law firms) and constantly adding "the same" data - their storage capacity would diminish rapidly and cumulatively with each new version of a compressor. There's also the uncompressed ARC thing from the ZSTD 1.4.x PR
My intent in filing this issue is to get the maintainers attention on the problem domain - we as users can say "its no big deal" if something doesnt work under some cases, but for the world at-large, this is an enterprise software suite which must behave as advertised or the commercial entities selling it might find themselves in breach of contract/on the wrong side of their customers' ire.
this is why you have to read and rewrite all files un upgrade. so the problem solves it by itself. i mean. its just a solution. the compressor version is stored in the block itself. of course we can also add a separate compressor for each individual version to zfs the code. and if you write "each new version". there arent that many new versions and we only have to consider new versions where the bitstream changes. but as i said. easy fix is running a script which reads and rewrites all files. problem solved
@BrainSlayer - i think that's what i'm trying to get at: the mechanism by which such a rewrite can happen in the background. Something like a resilver at the data layer vs the vdev layer. The script idea might be something zed could do, though i am woefully behind on that piece of zfs
I think there is a conceptual problem with a rewrite of the data. Older version of ZFS should be able to read/write the data the same as newer version, otherwise we supposed to guard the data by feature flags when we losing interoperability.
The newer version of ZFS can support new format with newer version of compressor and it will use it only if the feature flags are enabled, but it should also be able to read/write older format of the pool with the feature flags disabled and for that it needs an old compressor.
I'm all for the upgrade of compressors, but it looks pretty difficult to pull off because of previous design choices.
@IvanVolosyuk - agreed, i'm not sure they were so much design choices as implementation pathways back then (though this whole thing does seem very academic from the get-go). I'm hoping folks can weigh in with design and architectural proposals which have consideration for how the functionality of upgrading compressors, ciphers, and other DMU-layer constructors would work in the intended "end-state" and how we could get from where we are today with these rather brittle/hard-wired implementations to the proposed design goals. @ahrens - IIRC there is an effort out there to have object stores be used for DMU backing... Given the costs associated with IO and storage in public cloud, and the volumes of data for which such technology would be used in commercial implementations; wouldn't the ability to increase compression ratio or resolve future security bugs in the compressors be somewhat required for the long-term viability of the commercial endeavor? How do resilvers/scrubs(/vdev ops? i'm guessing buckets can be striped organized in various parity/draid schemes) and the "recompression" discussion elements factor into those cost (and therefore viability) models? I can't venture any sort of educated guess at how much data is out there on ZFS or what it's worth, but i can pretty confidently say that its "a lot" by anyone's comprehension in both measures. It seems to me that when (not if) we come to a point down the road where a security problem (even a DoS) is identified in one of the compressor (or cipher) implementations, we will want to have all of the infrastructure in-place and well-tested to be able to drop in a replacement and ship a "patched" version of ZFS pronto.
Does anyone know how btrfs dealt with the issue of updating the zstd library?
I see the kernel updated zstd to 1.4.10 from 1.3.1 earlier in 2021. And I was under the impression that btrfs checksums were post-compression.
btrfs did add an option to recompress all existing files using the defragment -czstd
command. Although the wiki indicates that older kernels without zstd would not be able to read file systems compressed this way, breaking compatibility.
Interesting question, since they're part of the GPL codebase with access to every change hitting the kernel's compressors and ciphers. I think that having a defragment function is handy in general - ZFS fragments with no recourse, especially if it can be used to do touch-up repair/rewrite of blocks on updated compressors and ciphers.
So, in the light of the above upgrading the compressor is possible in case of CVEs. It can have negative effects on 'uncompressed arc + L2ARC' and DDT. Send+receive should be able to re-compress the data though.
Well, "possible" is questionable here - if you disregard the feature guarantees of the FS as described (commercial guarantees in some cases), sure. Thats what i'm hoping to preempt: a point where someone will have to make a commit which breaks stuff previously written to disk in order to escape some nightmare scenario of a reproducible overflow in ring0 using data from ring3 with enough control to provide RCE to the data-submitter back in ring3 (or pick a bad thing that can happen in a system which wont accept change without breaking).
So looks like we're handling CVE-2021-3520
via #12947 which is on the decompressor side... but its only a matter of time/fuzzing cycles till a compressor is impacted.
@IvanVolosyuk - send receive may not be an option for backup datasets holding snapshots/copies of giant stores. Thinking like 10y worth of snapshots at a law firm. This is the case for which i think we need a scrub-type healing process to recompress the data without send/recv
@sempervictus That's sound like it would require bpw
uncompressed ARC shouldn't really be a consideration here, imo.
Wanted to add an interesting case, mentioned by @BrainSlayer in #11979 :
if allocation fails, the compression for the block is disabled
, so compression can give different results on memory starvation in stable code branches.
- at least zstd has a logic
if allocation fails, the compression for the block is disabled
, so compression can give different results on memory starvation in stable code branches.
Should this just leave the block uncompressed? It seems zio_write_compress() removes the compressed flag if the compressor fails to produce smaller output.
Should this just leave the block uncompressed?
If it's initial compression of a block - yes.
If it's a recompress to get original checksum - then we'll get different result and can't compute original checksum of compressed block (see https://github.com/openzfs/zfs/issues/12840#issuecomment-991894048)
I think that this comment https://github.com/openzfs/zfs/issues/12705#issuecomment-954476993 is way out from issues with compression algo upgrades.
We need to go away from bit to bit exact results of compression ( checksumming in zfs gone to far in its design for whatever reasons ) and then we could use only one implementation of algo which is the latest :). If we go in path of versioning we will have every version of every compressor to stay compatible with every zfspool on earth as @Ornias1993 stated in replay to @behlendorf comment: https://github.com/openzfs/zfs/pull/11367#issuecomment-752232210
Yes the whole solution to this is simply to stop enforcing post-compression checksums to be the same. And also yes, that indeed means cutting/redesigning some crappy legacy "features".
However @sempervictus you seems to repeat yourself and your requests around compression about once every 6 months. You also act surprised about what @BrainSlayer said about ZSTD, but we explained that to you about 4 times before already.
Also from this checksumming curse we have pulls like https://github.com/openzfs/zfs/pull/12805 which is nothing more than cherry picking easy parts which is good to have updated but it is world upside down the way it should be.
Updating compression should be easy as in https://github.com/openzfs/zfs/pull/11367 which sadly wasn't merged because of "some crappy legacy features"
So most important thing is to lift zfs constraint of byte for byte equality and then merge https://github.com/openzfs/zfs/pull/11367 :)
Other way in 2100 everyone will still use zstd 1.4.5 compressor in zfs
Personally, as a user i'd love to see a modernization of the design to avoid post-compression sums being used for dedup if it doesnt maul the rest of the functionality. In the commercial space though, such changes may result in the same problem that changing compressors creates - old datasets would stop deduplicating new writes (since their sums would still be for compressed blocks) and outgrow their commercially-planned storage capacity/growth rate. Moreover, its not readily apparent to me how one would make that change on an existing zpool since existing sums would need to be recomputed off of the decompressed block contents and still need something like resilver/scrub/bprw to modify the existing blocks with said updated sums. This would also break RO imports of newer pools on older ZFS versions as sums on all blocks would be "wrong." Then there's scrubs themselves to consider - if working off of the decompressed sum, how does that play with blocks which are compressed and encrypted?
@Ornias1993: I am explicitly beating the dead horse into a homogeneous paste because this problem is only getting worse (upstream compressor versions are climbing, we're stuck) and so far there's no clear solution to accommodate the listed concerns without compromising commercial requirements for back-compat/platform stability. I do not recall any of your responses presenting a functional solution to the backwards compatibility problems around compressor versioning with dedup ; my apologies if i'm not remembering a relevant comment (these things stretch back a while and very little of my time/focus these days is in the FOSS world).
On dedup problem @Ornias1993 comented in 11367: https://github.com/openzfs/zfs/pull/11367#issuecomment-753447821 - if you want stable chcksums don't use unstable compressor which is something perfectly to understand in enterprise env.
Back to problem at hand:
About first bullet of ARC in @gmelikov in pr 11367 @behlendorf said there will be io errors when we don't authenticate https://github.com/openzfs/zfs/pull/11367#issuecomment-758268676 so it not seams "Best effort" but I don't understand why we authenticate something that is in ram (uncopressed arc) already so it should be authenticated long time ago ? Can someone shed a light what are we doing there for real because for sure we compressing so decryption and authentication and decompression was long time ago ? There was question about that https://github.com/openzfs/zfs/discussions/11385 but not much of an answer.
About recompress case there is a comment https://github.com/openzfs/zfs/pull/11367#issuecomment-753517958 but can't we just change sum in arc header so it will match on reread from l2cache ?
In bullet l2arc_apply_transforms what happens when psize < size because we will compress in that case ? This will be compression without encryption case ?
The "dont use unstable compressor" notion is circular logic: compressor code is updated separately from ZFS upstream, sometimes to fix security flaws (which requires application of patches in enterprise environments). There is no such thing as stable code that will never change, only dead code that is unmaintained. Its the whole crux of this issue - compressor versions change, and ZFS has no way to keep up with the changes. The ZSTD example is the best one so far - that codebase moves fast and only a few folks like @BrainSlayer are seeing the benefits of those changes today.
The uncompressed ARC case seems to be both a vestigial feature and a logical blocker when using the compressed sums, but the logic in the authentication function looks like the spot where compressed data is handled could be turned into a decompression call to get the raw sums out for the block. That does create the problem of repeat decompression passes on the same data in the IO pipe every time those sums need to be checked against the raw data (as well as the blocks no longer matching the old compressed block sums). It also looks like the handling of compressed block sizes is always looking for compressed buffers being bigger than the raw ones and able to select the raw buffer even when compression is enabled for the block - such that "compression is enabled" but the block itself doesn't meet the compression constraints and ends up storing the raw sum anyway. Those areas of code should be able to grab the sum of the raw buffer and attach it to the compressed block...
Which leads me to wonder if the solution for keeping back-compat and using raw sums moving forward isn't just to store both in the block? Then the "is this block compressed" checks would boil down to xoring the two sums and for encrypted data we'd have the the raw-send sums available. If this were added as a pool feature it would still be unable to dedup old data when the compressor changes, but for new data written after the feature is enabled but before a compressor version change, the function should still work as both old-compressor-generated blocks and new ones would have the raw sum for comparison. If a decompressor ever produces data mismatching the raw sum then the inter-version decompressor changes are breaking changes anyway and require edge-case handling of some sort. I think this also handles the inter-OS-gzip problem since a check on the raw sums won't care whether the native OS or ZFS compose the final buffer on-disk. However, this approach still doesn't help that law firm with 10y retention requirements of name-a-critical-industry's records which has to update their ZFS deployment due to some as of yet not public vuln in the gzip compressor they've been using for decades. Those folks would still face potential data doubling since there would be a new "base layer" of records using the raw sum and all subsequent dedup would have to happen against that layer of data because their compressed sums wouldn't match what is in the old compressed but not newly-informed-of-raw-sums blocks. How much computational and memory overhead would be required to maintain both of the sums throughout the IO pipeline?
You answered yourself:
If this were added as a pool feature it would still be unable to dedup old data when the compressor changes
So way out is to rewrite those files as @BrainSlayer suggested and problem is then solved in my opinion without adding tons of code and complexitiy
I'm waiting for answers to my questions....
@robszy: rewriting files isnt an option if you have ZVOLs instead of ZPL entries. Also actually touching files in the VFS counts as access for compliance purposes (file integrity monitoring stacks such as OSSEC will flag modification and trip alerts) so the rewrite needs to happen underneath, somewhere in the filesystem innards so as not alter userspace-visible attributes.
As for your questions:
uint64_t psize = HDR_GET_PSIZE(hdr);
and uint64_t size = arc_hdr_size(hdr);
to have an answerEDIT to clarify the compliance problem a bit: regulatory frameworks requiring retention of data for X number of years after modification would reset the clock on retention requirement if the user were to rewrite the files as the modification time would shift to whenever they did it (even though the logical contents do not change). Regulatory bodies are often not fully staffed by reasonable people, so getting exceptions can range from costly to impossible. Mythical block pointer rewrites would be great to solve this (delivered by unicorns in a time machine), send/recv may be the only real way currently; but something like a "value-added scrub" which would decompress the blocks and amend their metadata by adding the uncompressed sum would avoid the VFS-level modifications while addressing the extension of attributes need underneath.
Ok so I'm waiting for reply to questions you don't know answers for and for more elaborated answer to 2 :)....
I'll just assume that English isn't your first language and you're not actually intending to be rude, but i am somewhat at an impasse as to why you're awaiting answers from me at all. The issue is posted to describe an architectural problem and attempt to crowd-source a solution which meets the constraints presented. How is your contribution to the discussion furthering that goal?
Of course not. I'm waiting for others more experienced in zfs people :)
rewriting files isnt an option if you have ZVOLs instead of ZPL entries
Guess he was refering to blocks.
2. What's worse is that if the persistent VDEV was written with an older compressor than the L2ARC (or ZSTD chose to compress less due to time/memory constraint), we end up with 3 sums - the original on-disk compressed one, the uncompressed L1ARC, and the just-compressed L2ARC.
Which should be relatively easy to fix, by nuking the content in (L2)ARC. This is not really the blocker for version updates afaik
Rewriting data is in my opinion only option in dedup case because changing chcecksums to match decompressed data will surely break compatibility with older zfs versions if we only changing checksums in bps. Maybe there are other ways to setup data structures only for dedup which consider only uncompressed checksums but it will also break compatibility
Leaving some kind of long term support version of lz4 or zstd is also wrong because nobody will add security patches to lts versions and nobody wants security flaws in zfs i think.
But other problems with hdr_authenticate ( maybe we can decompress only after authentication of non 0 level block ?) and recompress (nuke from acr/change checksum ) cases are doable so in my opinion we should do it, not tell people that there will be some issues.
I haven't read through the whole thing, but how about -o compression=lz4-i
(or lz4@i
where i
could be a simple counter meaning "the i
th version of lz4 that was included in ZFS"? Same problem that there is no foreseen place to store the version info, in addition to having to find a way to include those different versions in ZFS?
That former issue could be trivial to address by adding a new compressor name (rather than replacing an existing compressor with a newer version that wouldn't be recognised by the older version).
Summary of what are the issues we have:
Deadlock. The only way forward is (3) as it seems. Keep multiple compressors in the codebase, protect them by feature flags. Ugly, but is the only feasible approach unless we are willing to sacrifice (1) and (2).
@ahrens Is this the correct summary?
Describe the feature would like to see added to OpenZFS
ZFS includes compression algorithm code from a variety of sources to perform inline block-level compression and decompression. Much like ZFS itself, these upstream codebases grow and evolve - adding features, fixing bugs (which could be block-level data bugs for us), and improving performance. Despite most if not all of these algorithms being backward compatible, the compressor output changes (regardless of how much) which alters the on-disk checksums of blocks leading to an inability to ever upgrade our compression code (at least the compressor side) or rely on deduplication and compressed+persistent l2arc when pools are imported on another OS using its own compressor implementation. For a filesystem intended to meet needs well into the future, this is a critical hindrance and potential blocker to adoption. In order to address this problem, i am proposing that we start carrying versioned compressors which write their upstream commit version into each block they compress. This would permit ZFS to RMW data with a new compressor on-access, and treat unversioned blocks as "version 0" - or "always needing an upgrade." This would also permit logic in the RMW "repair" code to understand which versions need a rewrite, and which don't so as to differentiate actual checksum errors from compressed data mismatches.
How will this feature improve OpenZFS?
This approach would permit ZFS to keep up to date with upstream compression algorithms.
Additional context
We have already lost work due to this problem in the ZSTD 1.4.8-9 upgrade PR LZ4 implementation could use a refresh as well just due to the code being out of date