Consider supporting zstd dictionaries

schmorp commented 10 months ago

Zstd supports dictionaries that can improve compressions spectacularly with many similar files - a bit like fuzzy deduplication. The ideas is to train a common dictionary for many similar files that can then be used to reduce the size of individually compressed files. The dictionary can be comparatively small and be part of the metadata, but could improve compression of all blocks. dwarfs could either support specifying a static dictionary (or potentially multiple ones), or it could train it itself - the former is probably more useful for repeated applications.

mhx commented 10 months ago

Hi and thanks for your suggestion!

I'm aware of Zstd dictionaries and familiar with their use cases, but I doubt they'll be of much use for DwarFS.

As you mentioned, you'll typically use dictionaries for cases where you want to compress small blobs that are quite similar. However, and please correct me if I'm wrong, you get a benefit from dictionaries only if you compress the files/blobs individually. If you compress all files as a single stream, compression will likely be better than when using a dictionary with individually compressed files.

The only stage at which dictionaries could be used at all in DwarFS would be block compression. However, blocks are typically much larger than is useful for the application of dictionaries. Please feel free to play around with this, there's a script for extracting the blocks from a DwarFS image, so you can train a dictionary and compress the blocks individually.

mhx commented 10 months ago

So, to avoid any confusion: DwarFS never compresses files individually. Each individual file will always run through an initial de-duplication stage (the "segmenter") before its remaining unique chunks are added to a file system "block" (typically around 4-32 MiB). By grouping similar files close together, their chunks will likely end up in the same block, which further helps with compression. Each of these blocks will then be compressed using e.g. Zstd. So in the context of DwarFS, Zstd will rarely work on data smaller than a few MiB, which is why I doubt that dictionaries will lead to any improvement in compression ratio and/or speed.

But as I said, I'm happy to be convinced otherwise if you can demonstrate that dictionaries are indeed beneficial. I remember actually running some experiments a while back and in those cases, the use of dictionaries actually resulted in a much worse compression ratio.

schmorp commented 10 months ago

Thanks for considering this and thanks for explaining in detail, although I was well aware of how dwarfs works internally, my expectation or hope would have been that dictionaries would provide small but consistent per-block savings (which can add up over many blocks), and creating dicitonaries is very cheap (cpu-wise).

Now, since you invested some effort and provided an enabler script, it's only fair that I use it to strengthen the case either way, coming to similar conclusions as you.

I ran three tests. The first was taken from a dwarfs containing compressed dns tcpdump traces. The 64MiB blocks had an xz-compressed size between 6-8MB each, and recompressing them with zstd -14 gave about 12MB blocks.

I then trained a dictionary on a subset of them (100 blocks out of ~30000) and compressed about 61GB of blocks with and without it, and got:

11779876766 without dictionary 11774898524 with dictionary

Scaling the savings to the full block set linearly, I would expect around 300MB of savings overall (our of 300GB, so about 0.1%).

These savings are pretty insignificant, even though dictionaries are almost free. Also, when compression ratio is the primary goal, xz does much better already. On the other hand, if only compression ratio mattered, then zstd also shouldn't be supported.

As a contrasting test, I then tried less compressible data - mostly small png files. From an existing 224GB squashfs, I created a -l9 dwarfs of a 52GB subset of the files (resulting in a 52GB dwarfs), and used two dictionaries, one trained on the blocks, one trained on the source files.

About 10% of the blocks were incompressible, buit I treated them the same as xz-compressed blocks, except I didn't have to uncompress them, which puts the zstd numberes at a slight disadvantage.

The results were:

50578129924 xz blocks 50844396391 zstd -14 50841376128 zstd -14 + block dictionary 50844298795 zstd -14 + individual file dictionary

So, yes, I see definite (but almost insignificant) savings, but seeing how insignificant they are, I also wouldn't bother implementing dictionary support - while dictionaries are cheap, devising file formats and coding is not.

Again, thanks for your script allowing me to do more experiments, and thanks for considering the idea.

mhx commented 10 months ago

Thanks for the detailed response and the additional tests you ran! It's always good to have another pair of eyes on these kinds of things.

mhx / dwarfs

Consider supporting zstd dictionaries #180