[Feature Request] Implement Media Type-Based Compression Exclusion

RubenKelevra commented 3 months ago

Description:

The zstd command-line tool has recently introduced a feature that automatically excludes already compressed file types from further compression. Since ZFS needs to compress files in real-time and stores a lot of mixed content of compressed and uncompressed file types in the same subvolume, it would be beneficial to implement a similar feature within ZFS. This would involve determining the IANA media type (previously known as MIME type) of a file and storing it in ZFS's file metadata. By doing so, while ZFS stores blocks for a file in the filesystem, it can look up the metadata to check if the media type is excluded from compression based on subvolume settings.

Unlike zstd, which relies on file extensions to identify media types—a method that may not be consistent with Unix/Linux file naming conventions—ZFS could use its file metadata to store the media type. This metadata would be determined when the first block of a file is written, and updated if the first block is modified.

If the media type cannot be determined, ZFS could either default to compression or infer the type from the file extension. The zstd source code provides a list of common media types to exclude.

A new configuration option for subvolumes could control this feature. Users could enable it (default), disable it, or provide a custom list of media types. If users want to add to the default list instead of replacing it, they can set a custom list starting with a keyword like default_media_types, followed by the additional types they want to exclude.

Benefits:

Implementing this feature would significantly reduce the compression overhead for subvolumes containing mixed media types. By automatically skipping non-compressible blocks, the average latency for writing files on compressed volumes would decrease, allowing for either lower average latency or higher compression settings with the same latency. Additionally, this would reduce wasted CPU usage, further improving system performance.

This enhancement would also address the issue highlighted in this bug report.

Future Potential:

With this feature, ZFS could run statistics per media type on a subvolume. Based on this analysis, ZFS could also dynamically exclude certain media types from compression after a threshold of unsuccessful compression attempts is reached. This would allow ZFS to adaptively reduce CPU usage over time by skipping the compression of non-compressible media types on the fly.

Moreover, zstd supports dictionaries, which improve both compression and decompression speeds and compression ratios, if storing small files. Since ZFS handles each 128 K block as an individual file on zstd level, this would be highly beneficial:

Screenshot_20240802_071156

Having media type detection would enable the creation of media-specific dictionaries, either by the user or having ZFS ship generic media type specific dictionaries. This would further enhance performance and compression efficiency, by removing the need for zfs to embed a dictionary in each file block.

satmandu commented 3 months ago

There is a discussion of early abort/early bail-out for zstd compression in zfs here: https://www.reddit.com/r/zfs/comments/134pkfg/is_lz4_the_only_zfs_compression_option_that/

FYI that references this PR too: https://github.com/openzfs/zfs/pull/13517

rincebrain commented 3 months ago

You might also find #5928 interesting discussion around it.

I don't love the concept of doing different things based on the surrounding data, because then you get into weird problems with idempotence, or writing things in a different order producing different results, or or or. It's hard to reason about.

If someone wants to prototype it and show huge wins, I'm not going to try to shout it down in the face of massive gains, but dictionaries are really very useful for very predictable data types, I wasn't impressed when I last experimented with adding this, and you'd need a feature flag to add support, so you'd break read-only compat with it.

RubenKelevra commented 3 months ago

There is a discussion of early abort/early bail-out for zstd compression in zfs here: https://www.reddit.com/r/zfs/comments/134pkfg/is_lz4_the_only_zfs_compression_option_that/

FYI that references this PR too: #13517

Okay, but running lz4 or zstd-1 on a block in addition to the high level compression will increase the CPU demand for all compressible blocks. It would also be worse for non-compressible blocks than simple skipping based on media types.

I mean, we normally just know what blocks will compress or won't, based on the file type. There are hardly any surprises - except maybe tar, iso, or other formats which package content.

RubenKelevra commented 3 months ago

You might also find #5928 interesting discussion around it.

Maybe read the whole discussion and who took part... and who wrote the superseding PR ;)

@rincebrain wrote:

If someone wants to prototype it and show huge wins, I'm not going to try to shout it down in the face of massive gains, but dictionaries are really very useful for very predictable data types, I wasn't impressed when I last experimented with adding this

I think Facebook is exactly doing this on every day production. Maybe @felixhandte can chime in here, if that would be a good idea. :)

@rincebrain wrote:

I don't love the concept of doing different things based on the surrounding data, because then you get into weird problems with idempotence, or writing things in a different order producing different results, or or or. It's hard to reason about. [...] and you'd need a feature flag to add support, so you'd break read-only compat with it.

I can't follow your argument here. If a file gets written the first bytes contain (for the most file types) the magic bytes) those could be parsed, and the file metadata would be updated with the media type. If there's no match found, the media type would be left empty and thus the default applies.

The default would be to detect the file type based on the filename - which is always available if a block should be stored. If it matches the exclusion list for the subvolume, it's compressed, otherwise it's not.

I'm not sure if it would break reading compatibility, as we already support compressed and uncompressed blocks in the same files. It's just a couple of new metadata field, which the current version might actually just ignore if they are there. I'm not sure.

If a different version would read it, and the metadata field would be ignored, it would just default to compress every new block written, which wouldn't break any assumptions.

rincebrain commented 3 months ago

It breaks reading compatibility to use a dictionary because the current code never uses any explicit dictionary-handling codepath. So if you hand it one, it's going to try decompressing it without the dictionary, and fail, I believe. (You might be able to modify it to not, since if you don't hand it a dictionary explicitly it synthesizes one from the input and embeds that, but since I think most of your gains are not having to embed the dictionary with each individual compressed record, that seems kind of counterproductive.)

I agree just tracking the stats doesn't change matters. Technically changing which compression algorithm you use based on it would change the behavior but not in a way that breaks anything other than possibly people's expectations. But explicitly using a dictionary would, because you need to have the dictionary to decompress it, and there's no explicit handling of them now, so nothing is going to use it going forward.

My point is that, this would mean you would get reports like torrents compressing differently based on whether they wrote the first record first. So I would like to see compelling specific examples of this being a real-world win to argue in favor of it, personally, because when I tried using bespoke dictionaries on ZFS modified to use preset dictionaries, the yields were small to nonexistent. But perhaps I tested it badly, or something.

I'm certainly not in charge of anything, but that's why I played with dictionaries and didn't pursue it further, and why I would argue hard in favor of wanting specific examples of it being a big gain on ZFS specifically.

(And the reason that PR worked, I believe, is that the CPU and especially memory requirements scale nonlinearly as you increase zstd levels, so LZ4/zstd-1 winds up being a drop in the bucket compared to them, even on always-compressible data on something as slow as a single-core SPARC.)

RubenKelevra commented 3 months ago

It breaks reading compatibility to use a dictionary [...]

Using dictionaries is not the scope of this feature request.

My point is that, this would mean you would get reports like torrents compressing differently based on whether they wrote the first record first.

Hardly. Using the media type is only important for files which are stored locally by the user without file extensions, as most files shared on the internet do have file extensions for compatibility with Windows.

And as we have the filename available before the first block is stored, the fallback via the file extension does work, even if the first block is not written first.

So the decision is most likely stable, except some weird edge cases where someone would share e.g. a mp4 file with .txt file extension. But the worst case scenario is some wasted CPU cycles for compression effort which turned out to be fruitless.

So the worst case scenario is what we do right now, just all the time.

rincebrain commented 3 months ago

I don't really think it is. I think the worst case scenario is wasting a bunch of disk space because it guessed poorly, and making it even harder to explain to people why things behave differently.

The reason I might argue this makes more sense on other platforms is that they can make a compression choice that's "good enough" out of the box and then rewrite it later based on more information, while ZFS's architecture doesn't let you do that.

If you think this is a significant performance gain or CPU savings, please, by all means, write it and show the improvements in performance. But I don't immediately see it as producing significant gains, from my own experiments, and it seems to make the system much harder to reason about for most people when the behavior varies in ways they don't expect. (And how are you proposing handling things like tar archives - excluding them from the check by filename? By fingerprint?)

RubenKelevra commented 3 months ago

I don't really think it is. I think the worst case scenario is wasting a bunch of disk space because it guessed poorly, and making it even harder to explain to people why things behave differently.

Please outline a scenario where this would happen. Thanks

And how are you proposing handling things like tar archives - excluding them from the check by filename? By fingerprint?

Tar files are not on the exclusion list. So i never proposed handling them.

Please read the list:

https://github.com/facebook/zstd/blob/fdfb2aff39dc498372d8c9e5f2330b692fea9794/programs/fileio.c#L1895

But I don't immediately see it as producing significant gains, from my own experiments, and it seems to make the system much harder to reason about for most people when the behavior varies in ways they don't expect.

That's the second time you tell us that your experiments contradicts numbers published by zstd. Would you mind linking your experiments, so we're on the same page on what you've tested and why it is related to my feature request in hard numbers?

rincebrain commented 3 months ago

No thanks.

I was suggesting that I didn't find the idea compelling for these reasons, so I won't write it myself, and what I would like to see to not argue against it in a hypothetical PR, and your replies are demanding I prove you wrong.

RubenKelevra commented 3 months ago

I appreciate your input, however, I find it difficult to follow your train of thought regarding the potential complexities and issues. Your mention of past experiments proving potential problems is intriguing, but it would be really helpful to see specific examples or scenarios where this approach caused issues. Without concrete details, it's challenging to understand why this seemingly straightforward optimization could be problematic.

Could you please provide more context or specific cases from your experiments?

snajpa commented 2 months ago

Try taking a shot at implementing such an idea :-) Nobody else is obliged to and probably will not, b/c people are busy with their own feature backlogs (and bug backlogs, oh the bug backlogs...)

I don't mean to be offensive, it's just what you're proposing seems complex with a lot of potential for breakage in code that already barely works - and the benefits of such a change are not at all clear (my solution would be to tell the user that the compression property is per dataset, thus organize your data accordingly). I'd guess whoever'd end up trying to implement this would probably need to help to improve the code a bit (esp. whatever dirty-dbuf-related).

RubenKelevra commented 2 months ago

Hey @snajpa,

thanks for the open communication.

Well I think the main benefit is to just throw data at it with compression on it and have it figure this stuff out on its own, without having zfs trying to waste resources on every block just to end up with no benefit on it.

I'm currently using 17 datasets, most of them related to just setting compression on and off for a certain set of files, and I'm just tired of it.

I think we should use automation to make this just work on its own, instead of babysitting ZFS.

I'm currently working on some kind of compression benchmark to get an idea of the benefits.

gmelikov commented 2 months ago

Not to oppose this feature request, but to highlight a "workaround" - how efficient lz4 bail-out is on incompressible data (so lz4 is already nearly memcpy-like as of speed right now in this case):

$ ./lzbench -t16,16 -elz4/zstd ~/Downloads/silesia.zip
lzbench 1.8 (64-bit Linux)  AMD Ryzen 7 5800U with Radeon Graphics
Assembled by P.Skibinski

Compressor name         Compress. Decompress. Compr. size  Ratio Filename
memcpy                  16746 MB/s 17025 MB/s    67633896 100.00 /home/gmelikov/Downloads/silesia.zip
lz4 1.9.3               15155 MB/s 15780 MB/s    67899128 100.39 /home/gmelikov/Downloads/silesia.zip
zstd 1.5.0 -1            1777 MB/s 10058 MB/s    67597066  99.95 /home/gmelikov/Downloads/silesia.zip
zstd 1.5.0 -2            1841 MB/s 10013 MB/s    67597045  99.95 /home/gmelikov/Downloads/silesia.zip

, and data will be written uncompressed, so we won't pay any cpu on reads later.

RubenKelevra commented 2 months ago

Yeah I know that idea, that is what btrfs is using. But not sure why, in reality it's quite slow. You still notice a difference between turning on and off compression on btrfs with uncompressable data.

Apart from that that's still a lot of compute you're wasting, if all your cores are at 100% just throwing stuff against the wall to figure out what sticks, just to throw away the compute you've doing and redoing it a second time 🥲

gmelikov commented 2 months ago

@RubenKelevra iirc btrfs implemented only zstd compression, so they don't have compressors with proper bail-out. And that's why btrfs's default and only compression algorithm (iirc) will be sluggier than zfs's compression=on=lz4.

RubenKelevra commented 2 months ago

@gmelikov btrfs has zstd, lzo and gzip.

Btrfs uses a pretty advanced heuristic analysis to make sure it detects uncompressable data. So it first analysis data and then runs the actual compression.

See here for details:

https://btrfs.readthedocs.io/en/latest/Compression.html

Lz4 btw makes no sense to use, if you already got support for zstd. Zstd is faster and better on all metrics if you compare it to lz4. Meaning if you invest the same amount of CPU cycles, by switching to the fast mode of zstd, it will compress better than lz4.

gmelikov commented 2 months ago

Ah, my bad memory about the only algorithm, thanks for info. Interesting what's the raw speed of btrfs's heuristics for incompressible data.

Zstd is faster

But it's not faster, it's slower nearly every time on all aspects and that's why it has better compressratio https://github.com/facebook/zstd?tab=readme-ov-file#benchmarks and it doesn't have cheap bail-out. IMHO it's a huge pity that they didn't implement lz4 after all in btrfs. Here are some my other benchmarks, for compressible data in this case:

./lzbench -t16,16 -elz4/zstd ~/Downloads/silesia/silesia.tar
lzbench 1.8 (64-bit Linux)  AMD Ryzen 7 7840U w/ Radeon 780M Graphics
Assembled by P.Skibinski

Compressor name         Compress. Decompress. Compr. size  Ratio Filename
memcpy                  20147 MB/s 19718 MB/s   211957760 100.00 /home/gmelikov/Downloads/silesia/silesia.tar
lz4 1.9.3                 931 MB/s  5752 MB/s   100881413  47.60 /home/gmelikov/Downloads/silesia/silesia.tar
zstd 1.5.0 -1             602 MB/s  2079 MB/s    73500241  34.68 /home/gmelikov/Downloads/silesia/silesia.tar

RubenKelevra commented 2 months ago

But it's not faster, it's slower nearly every time on all aspects and that's why it has better compressratio https://github.com/facebook/zstd?tab=readme-ov-file#benchmarks and it doesn't have cheap bail-out. IMHO it's a huge pity that they didn't implement lz4 after all in btrfs.

Look at that table again:

zstd 1.5.6 --fast=3 ratio: 2.239 compression speed: 650 MB/s lz4 1.9.4 ratio: 2.101 compression speed: 700 MB/s

Which is basically the same speed but better compression ratio on zstd's part.

Same goes for a simple test on my system:

$ time zstd --fast=5 Downloads/cachyos-kde-linux-240704.iso -o Downloads/cachyos-kde-linux-240704.iso.zst5
Downloads/cachyos-kde-linux-240704.iso : 99.07%   (  2.26 GiB =>   2.24 GiB, Downloads/cachyos-kde-linux-240704.iso.zst5) 

________________________________________________________
Executed in    1.54 secs 
$ time lz4 Downloads/cachyos-kde-linux-240704.iso Downloads/cachyos-kde-linux-240704.iso.lz4
Compressed 2426961920 bytes into 2408273574 bytes ==> 99.23%                   

________________________________________________________
Executed in    1.51 secs

Here are some my other benchmarks, for compressible data in this case:

You're comparing lz4 against zstd in the normal profile, if you want it to be faster, just set the --fast=x flag.

Btw: Here's zstd -1, not far off from lz4 and --fast=5 anyway - so I really don't see a point in using lz4 for same compression speed but less flexibility doing better in compression.

time zstd -1 Downloads/cachyos-kde-linux-240704.iso -o Downloads/cachyos-kde-linux-240704.iso.zst                                      
Downloads/cachyos-kde-linux-240704.iso : 98.97%   (  2.26 GiB =>   2.24 GiB, Downloads/cachyos-kde-linux-240704.iso.zst2) 

________________________________________________________
Executed in    1.57 secs

gmelikov commented 2 months ago

Compressor name Ratio Compression Decompress. zstd 1.5.6 -1 2.887 510 MB/s 1580 MB/s zstd 1.5.6 --fast=3 2.239 650 MB/s 2000 MB/s lz4 1.9.4 2.101 700 MB/s 4000 MB/s

lz4 is at least 5% faster even in this example, it may look like "not so much difference", yes, BUT we should not forget about decompression, lz4 is at least twice faster.

And we should not forget about fast bail-out in LZ4 in context of this feature request, where lz4 has nearly memcpy speed, while zstd doesn't have such speed, at least now.

Look, my point is that in ZFS LZ4 gives you nearly universal way not to even think about disabling compression=on now.

ZSTD may be good too, and you can try to make LZ4 out of it (and it will be roughly it, they both have the same author after all) but it won't be so universal and you should think about disabling for incompressible data, and here you may want to use some stand-alone heuristics because of that, yes.

But we can't say that zstd can totally replace lz4, no. They just have different targets.

And I want to add that compression in ZFS is async because we write in ZIL without compression and will compress data on TXG write. While decompression is synchronous for us (prefetcher may help with that on non-random reads, but for random reads it is sync read, unfortunately).

openzfs / zfs