I suggest doing speed/size comparison with other dedup. solutions

I suggest comparing your solution with other deduplication solutions, such as borg, casync, desync, rdedup. Compare not only size of deduplicated and compressed data, but also speed of creating deduplicated image and speed of reading data back. Here is my own benchmark, which compares many solutions: https://github.com/borgbackup/borg/issues/7674#issuecomment-1656787394 . I recommend reading whole discussion, it contains many ideas. My conclusion is so: existing solutions are very inefficient, every one contains some inefficiency. Either it is not parallel, either it is written in Go instead of C/C++/Rust and thus slow, either it has some another problem. Based on my experience, I conclude that one can easily create deduplicating solution, which will beat in terms of size and speed all others.

Here is summary of my critique of other solutions: https://lobste.rs/s/0itosu/look_at_rapidcdc_quickcdc#c_ygqxsl .

I didn't study puzzlefs closely. But from what I see I already see one inefficiency: puzzlefs uses sha256, which is on many machines (including mine) slower than blake2 and blake3.

I don't plan adding puzzlefs to my comparison https://github.com/borgbackup/borg/issues/7674#issuecomment-1656787394 . I already deleted test data and test programs.

You may say that all this (i. e. speed) is not important. I disagree. I had one particular use case: "docker, but for VMs". I compared existing solutions, and all them was unsatisfactory. So I developed my own: azwyon (in Rust): https://github.com/borgbackup/borg/issues/7674#issuecomment-1654175985 . Azwyon can store and extract 10 GiB virtual machines in several seconds on my machine. This is satisfactory for me. And this result is unachievable with other solutions. (Keep in mind that azwyon uses fixed-size chunking.) Similarly, it is possible that someone will evaluate puzzlefs, concludes that it is slow and rejects it.

Please, understand me correctly. I don't say that puzzlefs is slow! I didn't test it, so I don't know. I just want to share my experience. My message is so: "Many other solutions are slower than they may be, so it is possible that puzzlefs is slow, too, so I suggest properly testing it, and, if it actually turns out to be slow, fix it using ideas from these links"

Hey @safinaskar, thanks for your suggestion. Did you document the steps for reproducing your benchmark results? How did you determine the values for Size Total ctime Total dtime Last ctime Last dtime?

Did you document the steps for reproducing your benchmark results?

Yes, everything is in that borg bug report. First you need to create files 00, 01, etc. Here are instructions: https://github.com/borgbackup/borg/issues/7674#issue-1772919615 . But, of course, it is okay to use some other realistic data instead. Keep in mind that I use nearly 10 similar snapshots of 10 GiB data. Or I can simply send them to you (I just understood that I can very easily re-generate them.)

Then run runner/benchmark: https://paste.gg/p/anonymous/2bac07a20d5f48dbabd4d9e3ad2c4ab1 (from https://github.com/borgbackup/borg/issues/7674#issuecomment-1654175985 ). Keep in mind that runner uses sudo a lot, so run it on throw-away host or carefully review code before running.

More details on reproducing are in that bug report ( https://github.com/borgbackup/borg/issues/7674 ). If you have any questions, ask, I will happily answer.

How did you determine the values for

In my terminology "compression" (a. k. a. "store") is all steps needed to place source file to "repo". It may include chunking, deduplicating and actual compression. For example, borg2 create or casync make is "compression". "Decompression" (a. k. a. "extract") is, again, all steps needed to fully extract one file from "repo" (usually this is actual decompression). For example, borg2 extract or casync extract is "decompression". So:

"Size" is size of "repo" after "compressing"/"storing" all source files. I. e. result of compressing and deduplicating
"Total ctime" is full time of "compressing" all files, i. e. full time of creating repo. Full time of converting of source files to single repo. "Compressing" of separate files is always invoked one-by-one (i. e. full time is sum of times for every file). But some deddupers may "compress" any single file in parallel
"Total dtime" is opposite of "Total ctime". Full time of "decompressing" of all files one-by-one from full repo
"Last ctime" is time of "compressing" of last file, i. e. "11", assuming all other are already stored in "repo"
"Last dtime" is time of "decompressing" last file from full repo

Typical output from runner/benchmark looks like this:

{"method":"Borg { chunker_params: \"buzhash,19,23,21,4095\", compression: \"zstd,3\", encryption: \"none\" }","size":2851383597,"compression":[18318222,11317438,21237556,45904208,44393790,40996476,48499568,45409089,46290323,44191118,46500109,34804936],"decompression":[8965775,8972195,9445975,35787014,36940309,36681542,37723561,37497282,38120893,37698965,37893902,38152120],"total_compression":447862839,"total_decompression":363879538}

"size" is size, "total_compression" is "total ctime" (in microseconds), last element of "compression" is "last ctime" (in microseconds)

I don't know whether puzzlefs can add some file to existing image. If this is not possible (i. e. if the only way to create puzzlefs image is to create everything at once), then "last ctime" has no sense for puzzlefs. Also, if we process separate files in parallel when creating puzzlefs image, then "total ctime" doesn't have direct puzzlefs counterpart, too. (But comparing puzzlefs image creating time with "total ctime" of other deduppers is still good task, it is possible then even in this situation other deduppers still will be faster in creating of repo.)

Also note that whole my benchmark is biased to my particular original problem. I was in search of dedupper for my "docker, but for VMs", I call it Jocker. Jocker always stores and extracts snapshots one-by-one. Also the blobs themselves are snapshots of single VM. This means that offset of particular file inside snapshot doesn't change. And thus CDC doesn't have any advantage over fixed sized chunking. No surprise then that my dedupper called azwyon with fixed sized chunking "won".

When benchmarking I used instructions for stabilizing benchmark results from https://easyperf.net/blog/2019/08/02/Perf-measurement-environment-on-Linux and https://llvm.org/docs/Benchmarking.html .

Some other random notes:

rayon ( https://crates.io/crates/rayon ) is cool. pariter ( https://dpc.pw/adding-parallelism-to-your-rust-iterators ) is cool
blake3 is (currently) very cool. It has official Rust implementation. It is a lot faster than sha256 on my machine
But if you want to have really strong cryptographic hash, be prepared to regularly change hashes, because they regularly break ( https://valerieaurora.org/hash.html ). Remember git sha1 fiasco ( https://thenewstack.io/git-transitioning-away-from-the-aging-sha-1-hash/ )
CDC
- https://crates.io/crates/hash-roll - crate with lots of CDC algorithms (I tried some of them and it seems that these algorithm are not very optimized). Also, same author published this very good article: https://github.com/jmesmon/hash-roll/blob/HEAD/COMPARE.md , which discusses algorithms from other crates (and thus hopefully more optimized). Keep in mind that the article published 3 years ago
- Recently published articles on CDC here: https://joshleeb.com/
- https://github.com/borgbackup/borg/issues/7674#issuecomment-1678831215

Thanks for the detailed reply!

A quick note: I did try to use rayon to parallelize puzzlefs build, but the single-threaded version was faster. I didn't save the test results, but I still have the branch: https://github.com/project-machine/puzzlefs/issues/84

I suggest reading current version of azwyon: https://paste.gg/p/anonymous/3c43cb97f37440839d19392ca7f8dff8 . Assume public domain. It is nearly 150 lines of Rust. It differs from previous versions: I switched to pariter instead of my homegrown replacement.

Here are my comments on your code:

https://github.com/project-machine/puzzlefs/compare/master...ariel-miculas:puzzlefs:parallelize_build#diff-2e3f2e30492a9ad16f4512f817b3d025dc75db14a085f879067520d1d12bd3f2R16 . It seems you use parallel CDC. I'm skeptical on this. You can get speed up from parallelization if your operation is slower then moving data. CDC seems to be fast enough. But desync did this ( https://github.com/folbricht/desync , section "Parallel chunking"). But I don't like their approach, and it is written in Go, and hence slow. What I suggest is doing CDC in one thread and feed chunks one after one to thread pool for compressing. This is what I do in azwyon (but I do fixed sized chunking). pariter is perfect for this. Compressing is slow enough and so should be parallelized. I used zstd level 3 in my tests and it was slow enough and so I got x2-3 speed up when paralleling on 4 CPUs (I don't remember numbers exactly)
https://github.com/project-machine/puzzlefs/compare/master...ariel-miculas:puzzlefs:parallelize_build#diff-56433df5bc6639997363face15c8e9e58ff3ca3cf5956ba84dc350fdce074588R95 - spawn docs are not clear enough on what exactly spawn does. I just reported this here: https://github.com/rayon-rs/rayon/issues/1095 . But as well as I understand, spawn will return immediately if thread pool is full. This means that your loop will read the input file in full and then it will be processed. I. e. you will read whole file to memory! Of course, this will be slow. And this will not work if the file doesn't fit to memory. I described my solution in previous item: use pariter. pariter doesn't read new input item if thread pool is full. Look how azwyon is implemented. I never store whole file in memory
https://github.com/project-machine/puzzlefs/compare/master...ariel-miculas:puzzlefs:parallelize_build#diff-56433df5bc6639997363face15c8e9e58ff3ca3cf5956ba84dc350fdce074588R105 - if you use pariter, you will not need channels, nor sort_by - pariter preserves order. Side note: it seems you used sort_by instead of sort_by_key because of this bug: https://github.com/rust-lang/rust/issues/34162
For good speed chunks should be big enough. I use 1 MiB chunks in azwyon. Maybe it is good idea to use small CDC chunks (for better space savings) and then merge them into bigger megachunks for better parallelism
https://github.com/project-machine/puzzlefs/compare/master...ariel-miculas:puzzlefs:parallelize_build#diff-56433df5bc6639997363face15c8e9e58ff3ca3cf5956ba84dc350fdce074588R189 - looks strange. Any (without dyn) is equivalent to 'static + ?Sized. So I suggest to replace Any with 'static + ?Sized or with just 'static
Using dependencies like https://github.com/project-machine/puzzlefs/compare/master...ariel-miculas:puzzlefs:parallelize_build#diff-cfb4575aa4a7f8552038a5eafa8c0ba785e6c75d371eeec7d90e20a6956a82beR10 (hex = "*") is bad idea. Update your Rust installation and add dependencies using cargo add hex, this will use current latest version
https://github.com/project-machine/puzzlefs/compare/master...ariel-miculas:puzzlefs:parallelize_build#diff-18db0e8213b417bc958a2c304a9d8c1aa7e78362bb68669781cf098a4ee70694R105 - you call std::io::copy merely to write slice to writer? Just use Write::write_all

As well as I understand, pariter is not based on rayon thread pool, it uses its own thread pool. Despite this, I still highly recommend pariter. I tried to add pariter-like functionality to rayon: https://github.com/rayon-rs/rayon/pull/1071 , but maintainers rejected my patch. Don't use this patch as-is, it is buggy.

Also: when writing azwyon, I studied borg source code in attempt to understand why it so fast. I borrowed some ideas. For example, borg memorizes hashes of sequences of zero bytes. Also, borg detects holes in input files (I didn't implement this). And borg drops input file from OS cache (sometimes this boosts performance, see https://github.com/borgbackup/borg/issues/7674 for details). See https://github.com/borgbackup/borg/issues/7674 for other ideas

Also: as well as I understand, "slow puzzlefs" already exists. It is borg. Borg has command "borg2 mount", it mounts some snapshot using FUSE. So, it seems the only way for puzzlefs to succeed is to be faster than "borg2 mount". Same for casync mount

rayon authors just acknowledged that rayon::spawn never blocks. So, yes, your loop seems to read whole file to memory, which is wrong. https://github.com/rayon-rs/rayon/issues/1095#issuecomment-1749288765

Also: as well as I understand, "slow puzzlefs" already exists. It is borg. Borg has command "borg2 mount", it mounts some snapshot using FUSE. So, it seems the only way for puzzlefs to succeed is to be faster than "borg2 mount". Same for casync mount

I've looked at both casync and borg, here are some differences:

puzzlefs supports integrity checking with fsverity, similar to how composefs does it
I'm also implementing a Linux kernel driver for PuzzleFS, currently it's in the Proof-of-concept stage
unlike casync, borg or composefs, it's written in Rust (both the FUSE driver and the kernel driver): this gives us memory safety guarantees (this should not be underestimated) and comparable performance to C/C++ code

@ariel-miculas , I just found this article: https://lwn.net/Articles/934047/ . It seems EROFS is very advanced fs, and this will be very hard to complete with it. It supports CDC-based chunking using Rabin–Karp, deduplication and compression as described here: https://erofs.docs.kernel.org/en/latest/design.html . It supports DAX and many other things. It is one of the few filesystems, which don't need buffer_head: https://lore.kernel.org/lkml/20230424054926.26927-1-hch@lst.de/ . EROFS is explicitly designed for containers. EROFS is already in mainline kernel.

If you try to submit your fs to kernel, it is quite possible that kernel developers will simply say to you: "We already have EROFS, why there is a need for puzzlefs?"

So it is possible it may be good idea not to spend your time on puzzlefs anymore. I don't try to insult you. I just try to save your time. I think EROFS is what your need for your original production task.

But if you still think this is good idea to submit puzzlefs to the kernel, then be prepared for EROFS competition. Prepare benchmarks against EROFS. Ideally on realistic data, i. e. on realistic containers. And work hard to make puzzlefs perform good compared to EROFS and .tar.zst.

I don't try to say that puzzlefs is bad. I just want to say that I think this will be hard to convince kernel devs to accept puzzlefs. It may be good idea to show current state of puzzlefs kernel driver to core kernel stakeholders (i. e. Torvalds, Greg KH, Al Viro, etc) and get their opinion: whether there are chances to get this fs to mainline, i. e. whether is it worth to spend time on puzzlefs kernel driver

@ariel-miculas , article on how hard it is to get your filesystem to kernel: https://lwn.net/Articles/933616/ .

Notable phrase:

If you ask a room full of kernel developers if a new filesystem is needed, "the answer is almost always going to be 'no'", Josef Bacik said to laughter

@ariel-miculas , I just found this article: https://lwn.net/Articles/934047/ . It seems EROFS is very advanced fs, and this will be very hard to complete with it. It supports CDC-based chunking using Rabin–Karp, deduplication and compression as described here: https://erofs.docs.kernel.org/en/latest/design.html . It supports DAX and many other things. It is one of the few filesystems, which don't need buffer_head: https://lore.kernel.org/lkml/20230424054926.26927-1-hch@lst.de/ . EROFS is explicitly designed for containers. EROFS is already in mainline kernel.

If you try to submit your fs to kernel, it is quite possible that kernel developers will simply say to you: "We already have EROFS, why there is a need for puzzlefs?"

Yes, there is this possibility, also be aware I've already submitted two versions of PuzzleFS to the LKML: here's the first one and here's the second one. The latest version is avaible on github, I have not sent it to the mailing list because I expect some changes to the filesystem abstractions, since there were disagreements on the proposed APIs. You can read the entire thread here. If you want to take a look at the latest PuzzleFS version, I've compiled a list of resources in the PuzzleFS page of the Rust for Linux project, where you will find the latest PuzzleFS branch, as well as other resources such as the LWN article on PuzzleFS and my presentation at Open Source Summit Europe.

So it is possible it may be good idea not to spend your time on puzzlefs anymore. I don't try to insult you. I just try to save your time. I think EROFS is what your need for your original production task.

There are other container solutions which use EROFS, such as composefs and nydus. EROFS doesn't work for the problem that we are trying to solve, i.e. deduplicate the data using a content defined chunking algorithm so we can store multiple versions of the same image as efficiently as possible. EROFS uses deduplication, but only inside a single image, whereas we want to deduplicate multiple versions of the same image. With the help of content defined chunking, we can essentially only store the deltas between multiple versions of the same image instead of storing them separately, which is how archive formats do it (e.g. tar, squashfs, erofs). We want to redesign the container image format such that a more efficient representation based on variable sized chunks is possible. You can see the results in my PuzzleFS presentation, in the Results slide. I've also added them below.

Steps:

I’ve downloaded 10 versions of Jammy from hub.docker.com
These images only have one layer which is in tar.gz format
I’ve built 10 equivalent puzzlefs images
Compute the tarball_total_size by summing the sizes of every Jammy tarball (uncompressed) => 766 MB (use this as baseline)
Sum the sizes of every oci/puzzlefs image => total_size
Compute the total size as if all the versions were stored in a single oci/puzzlefs repository => total_unified_size
Saved space = tarball_total_size - total_unified_size

Type	Total size (MB)	Average layer size (MB)	Unified size (MB)	Saved (MB) / 766 MB
Oci (uncompressed)	766	77	766	0 (0%)
PuzzleFS uncompressed	748	74	130	635 (83%)
Oci (compressed)	282	28	282	484 (63%)
PuzzleFS (compressed)	298	30	53	713 (93%)

But if you still think this is good idea to submit puzzlefs to the kernel, then be prepared for EROFS competition. Prepare benchmarks against EROFS. Ideally on realistic data, i. e. on realistic containers. And work hard to make puzzlefs perform good compared to EROFS and .tar.zst.

I'm aware it will be difficult to convince the kernel maintainers to merge PuzzleFS, they've already complained that they get too many container filesystems submitted to them. Time will tell whether they consider PuzzleFS to be worthy of upstreaming.

I don't try to say that puzzlefs is bad. I just want to say that I think this will be hard to convince kernel devs to accept puzzlefs. It may be good idea to show current state of puzzlefs kernel driver to core kernel stakeholders (i. e. Torvalds, Greg KH, Al Viro, etc) and get their opinion: whether there are chances to get this fs to mainline, i. e. whether is it worth to spend time on puzzlefs kernel driver

As I've mentioned above, I've already submitted multiple versions of the PuzzleFS driver, I've presented it to multiple conferences, so at least I have brought awareness to the PuzzleFS project. And other people have written articles about it, such as Phoronix and LWN. Wedson Almeida is working on his own filesytem in Rust, tarfs, so I'm not the only one trying to upstream new filesystems.

@ariel-miculas , article on how hard it is to get your filesystem to kernel: https://lwn.net/Articles/933616/ .

Notable phrase:

If you ask a room full of kernel developers if a new filesystem is needed, "the answer is almost always going to be 'no'", Josef Bacik said to laughter

I think it's worth exploring what a new filesystem implemented in Rust can achieve, since kernel maintainers sometimes argue that they want to see new things implemented in Rust rather than having existing implementations duplicated in both C and Rust. Of course, it's understandable they don't want multiple filesystems which solve the same container image problem, so pushback is to be expected.

@safinaskar thanks for presenting your concerns, we definitely have to make a strong case for PuzzleFS if we ever want it to be included in the Linux kernel.

project-machine / puzzlefs

I suggest doing speed/size comparison with other dedup. solutions #114