project-machine / puzzlefs

Apache License 2.0
378 stars 18 forks source link

I suggest doing speed/size comparison with other dedup. solutions #114

Open safinaskar opened 9 months ago

safinaskar commented 9 months ago

I suggest comparing your solution with other deduplication solutions, such as borg, casync, desync, rdedup. Compare not only size of deduplicated and compressed data, but also speed of creating deduplicated image and speed of reading data back. Here is my own benchmark, which compares many solutions: https://github.com/borgbackup/borg/issues/7674#issuecomment-1656787394 . I recommend reading whole discussion, it contains many ideas. My conclusion is so: existing solutions are very inefficient, every one contains some inefficiency. Either it is not parallel, either it is written in Go instead of C/C++/Rust and thus slow, either it has some another problem. Based on my experience, I conclude that one can easily create deduplicating solution, which will beat in terms of size and speed all others.

Here is summary of my critique of other solutions: https://lobste.rs/s/0itosu/look_at_rapidcdc_quickcdc#c_ygqxsl .

I didn't study puzzlefs closely. But from what I see I already see one inefficiency: puzzlefs uses sha256, which is on many machines (including mine) slower than blake2 and blake3.

I don't plan adding puzzlefs to my comparison https://github.com/borgbackup/borg/issues/7674#issuecomment-1656787394 . I already deleted test data and test programs.

You may say that all this (i. e. speed) is not important. I disagree. I had one particular use case: "docker, but for VMs". I compared existing solutions, and all them was unsatisfactory. So I developed my own: azwyon (in Rust): https://github.com/borgbackup/borg/issues/7674#issuecomment-1654175985 . Azwyon can store and extract 10 GiB virtual machines in several seconds on my machine. This is satisfactory for me. And this result is unachievable with other solutions. (Keep in mind that azwyon uses fixed-size chunking.) Similarly, it is possible that someone will evaluate puzzlefs, concludes that it is slow and rejects it.

Please, understand me correctly. I don't say that puzzlefs is slow! I didn't test it, so I don't know. I just want to share my experience. My message is so: "Many other solutions are slower than they may be, so it is possible that puzzlefs is slow, too, so I suggest properly testing it, and, if it actually turns out to be slow, fix it using ideas from these links"

ariel-miculas commented 9 months ago

Hey @safinaskar, thanks for your suggestion. Did you document the steps for reproducing your benchmark results? How did you determine the values for Size Total ctime Total dtime Last ctime Last dtime?

safinaskar commented 9 months ago

Did you document the steps for reproducing your benchmark results?

Yes, everything is in that borg bug report. First you need to create files 00, 01, etc. Here are instructions: https://github.com/borgbackup/borg/issues/7674#issue-1772919615 . But, of course, it is okay to use some other realistic data instead. Keep in mind that I use nearly 10 similar snapshots of 10 GiB data. Or I can simply send them to you (I just understood that I can very easily re-generate them.)

Then run runner/benchmark: https://paste.gg/p/anonymous/2bac07a20d5f48dbabd4d9e3ad2c4ab1 (from https://github.com/borgbackup/borg/issues/7674#issuecomment-1654175985 ). Keep in mind that runner uses sudo a lot, so run it on throw-away host or carefully review code before running.

More details on reproducing are in that bug report ( https://github.com/borgbackup/borg/issues/7674 ). If you have any questions, ask, I will happily answer.

How did you determine the values for

In my terminology "compression" (a. k. a. "store") is all steps needed to place source file to "repo". It may include chunking, deduplicating and actual compression. For example, borg2 create or casync make is "compression". "Decompression" (a. k. a. "extract") is, again, all steps needed to fully extract one file from "repo" (usually this is actual decompression). For example, borg2 extract or casync extract is "decompression". So:

Typical output from runner/benchmark looks like this:

{"method":"Borg { chunker_params: \"buzhash,19,23,21,4095\", compression: \"zstd,3\", encryption: \"none\" }","size":2851383597,"compression":[18318222,11317438,21237556,45904208,44393790,40996476,48499568,45409089,46290323,44191118,46500109,34804936],"decompression":[8965775,8972195,9445975,35787014,36940309,36681542,37723561,37497282,38120893,37698965,37893902,38152120],"total_compression":447862839,"total_decompression":363879538}

"size" is size, "total_compression" is "total ctime" (in microseconds), last element of "compression" is "last ctime" (in microseconds)

I don't know whether puzzlefs can add some file to existing image. If this is not possible (i. e. if the only way to create puzzlefs image is to create everything at once), then "last ctime" has no sense for puzzlefs. Also, if we process separate files in parallel when creating puzzlefs image, then "total ctime" doesn't have direct puzzlefs counterpart, too. (But comparing puzzlefs image creating time with "total ctime" of other deduppers is still good task, it is possible then even in this situation other deduppers still will be faster in creating of repo.)

Also note that whole my benchmark is biased to my particular original problem. I was in search of dedupper for my "docker, but for VMs", I call it Jocker. Jocker always stores and extracts snapshots one-by-one. Also the blobs themselves are snapshots of single VM. This means that offset of particular file inside snapshot doesn't change. And thus CDC doesn't have any advantage over fixed sized chunking. No surprise then that my dedupper called azwyon with fixed sized chunking "won".

When benchmarking I used instructions for stabilizing benchmark results from https://easyperf.net/blog/2019/08/02/Perf-measurement-environment-on-Linux and https://llvm.org/docs/Benchmarking.html .

Some other random notes:

ariel-miculas commented 9 months ago

Thanks for the detailed reply!

A quick note: I did try to use rayon to parallelize puzzlefs build, but the single-threaded version was faster. I didn't save the test results, but I still have the branch: https://github.com/project-machine/puzzlefs/issues/84

safinaskar commented 9 months ago

I suggest reading current version of azwyon: https://paste.gg/p/anonymous/3c43cb97f37440839d19392ca7f8dff8 . Assume public domain. It is nearly 150 lines of Rust. It differs from previous versions: I switched to pariter instead of my homegrown replacement.

Here are my comments on your code:

As well as I understand, pariter is not based on rayon thread pool, it uses its own thread pool. Despite this, I still highly recommend pariter. I tried to add pariter-like functionality to rayon: https://github.com/rayon-rs/rayon/pull/1071 , but maintainers rejected my patch. Don't use this patch as-is, it is buggy.

Also: when writing azwyon, I studied borg source code in attempt to understand why it so fast. I borrowed some ideas. For example, borg memorizes hashes of sequences of zero bytes. Also, borg detects holes in input files (I didn't implement this). And borg drops input file from OS cache (sometimes this boosts performance, see https://github.com/borgbackup/borg/issues/7674 for details). See https://github.com/borgbackup/borg/issues/7674 for other ideas

safinaskar commented 9 months ago

Also: as well as I understand, "slow puzzlefs" already exists. It is borg. Borg has command "borg2 mount", it mounts some snapshot using FUSE. So, it seems the only way for puzzlefs to succeed is to be faster than "borg2 mount". Same for casync mount

safinaskar commented 9 months ago

rayon authors just acknowledged that rayon::spawn never blocks. So, yes, your loop seems to read whole file to memory, which is wrong. https://github.com/rayon-rs/rayon/issues/1095#issuecomment-1749288765

ariel-miculas commented 9 months ago

Also: as well as I understand, "slow puzzlefs" already exists. It is borg. Borg has command "borg2 mount", it mounts some snapshot using FUSE. So, it seems the only way for puzzlefs to succeed is to be faster than "borg2 mount". Same for casync mount

I've looked at both casync and borg, here are some differences:

safinaskar commented 5 months ago

@ariel-miculas , I just found this article: https://lwn.net/Articles/934047/ . It seems EROFS is very advanced fs, and this will be very hard to complete with it. It supports CDC-based chunking using Rabin–Karp, deduplication and compression as described here: https://erofs.docs.kernel.org/en/latest/design.html . It supports DAX and many other things. It is one of the few filesystems, which don't need buffer_head: https://lore.kernel.org/lkml/20230424054926.26927-1-hch@lst.de/ . EROFS is explicitly designed for containers. EROFS is already in mainline kernel.

If you try to submit your fs to kernel, it is quite possible that kernel developers will simply say to you: "We already have EROFS, why there is a need for puzzlefs?"

So it is possible it may be good idea not to spend your time on puzzlefs anymore. I don't try to insult you. I just try to save your time. I think EROFS is what your need for your original production task.

But if you still think this is good idea to submit puzzlefs to the kernel, then be prepared for EROFS competition. Prepare benchmarks against EROFS. Ideally on realistic data, i. e. on realistic containers. And work hard to make puzzlefs perform good compared to EROFS and .tar.zst.

I don't try to say that puzzlefs is bad. I just want to say that I think this will be hard to convince kernel devs to accept puzzlefs. It may be good idea to show current state of puzzlefs kernel driver to core kernel stakeholders (i. e. Torvalds, Greg KH, Al Viro, etc) and get their opinion: whether there are chances to get this fs to mainline, i. e. whether is it worth to spend time on puzzlefs kernel driver

safinaskar commented 5 months ago

@ariel-miculas , article on how hard it is to get your filesystem to kernel: https://lwn.net/Articles/933616/ .

Notable phrase:

If you ask a room full of kernel developers if a new filesystem is needed, "the answer is almost always going to be 'no'", Josef Bacik said to laughter

ariel-miculas commented 5 months ago

@ariel-miculas , I just found this article: https://lwn.net/Articles/934047/ . It seems EROFS is very advanced fs, and this will be very hard to complete with it. It supports CDC-based chunking using Rabin–Karp, deduplication and compression as described here: https://erofs.docs.kernel.org/en/latest/design.html . It supports DAX and many other things. It is one of the few filesystems, which don't need buffer_head: https://lore.kernel.org/lkml/20230424054926.26927-1-hch@lst.de/ . EROFS is explicitly designed for containers. EROFS is already in mainline kernel.

If you try to submit your fs to kernel, it is quite possible that kernel developers will simply say to you: "We already have EROFS, why there is a need for puzzlefs?"

Yes, there is this possibility, also be aware I've already submitted two versions of PuzzleFS to the LKML: here's the first one and here's the second one. The latest version is avaible on github, I have not sent it to the mailing list because I expect some changes to the filesystem abstractions, since there were disagreements on the proposed APIs. You can read the entire thread here. If you want to take a look at the latest PuzzleFS version, I've compiled a list of resources in the PuzzleFS page of the Rust for Linux project, where you will find the latest PuzzleFS branch, as well as other resources such as the LWN article on PuzzleFS and my presentation at Open Source Summit Europe.

So it is possible it may be good idea not to spend your time on puzzlefs anymore. I don't try to insult you. I just try to save your time. I think EROFS is what your need for your original production task.

There are other container solutions which use EROFS, such as composefs and nydus. EROFS doesn't work for the problem that we are trying to solve, i.e. deduplicate the data using a content defined chunking algorithm so we can store multiple versions of the same image as efficiently as possible. EROFS uses deduplication, but only inside a single image, whereas we want to deduplicate multiple versions of the same image. With the help of content defined chunking, we can essentially only store the deltas between multiple versions of the same image instead of storing them separately, which is how archive formats do it (e.g. tar, squashfs, erofs). We want to redesign the container image format such that a more efficient representation based on variable sized chunks is possible. You can see the results in my PuzzleFS presentation, in the Results slide. I've also added them below.

Steps:

Type Total size (MB) Average layer size (MB) Unified size (MB) Saved (MB) / 766 MB
Oci (uncompressed) 766 77 766 0 (0%)
PuzzleFS uncompressed 748 74 130 635 (83%)
Oci (compressed) 282 28 282 484 (63%)
PuzzleFS (compressed) 298 30 53 713 (93%)

But if you still think this is good idea to submit puzzlefs to the kernel, then be prepared for EROFS competition. Prepare benchmarks against EROFS. Ideally on realistic data, i. e. on realistic containers. And work hard to make puzzlefs perform good compared to EROFS and .tar.zst.

I'm aware it will be difficult to convince the kernel maintainers to merge PuzzleFS, they've already complained that they get too many container filesystems submitted to them. Time will tell whether they consider PuzzleFS to be worthy of upstreaming.

I don't try to say that puzzlefs is bad. I just want to say that I think this will be hard to convince kernel devs to accept puzzlefs. It may be good idea to show current state of puzzlefs kernel driver to core kernel stakeholders (i. e. Torvalds, Greg KH, Al Viro, etc) and get their opinion: whether there are chances to get this fs to mainline, i. e. whether is it worth to spend time on puzzlefs kernel driver

As I've mentioned above, I've already submitted multiple versions of the PuzzleFS driver, I've presented it to multiple conferences, so at least I have brought awareness to the PuzzleFS project. And other people have written articles about it, such as Phoronix and LWN. Wedson Almeida is working on his own filesytem in Rust, tarfs, so I'm not the only one trying to upstream new filesystems.

@ariel-miculas , article on how hard it is to get your filesystem to kernel: https://lwn.net/Articles/933616/ .

Notable phrase:

If you ask a room full of kernel developers if a new filesystem is needed, "the answer is almost always going to be 'no'", Josef Bacik said to laughter

I think it's worth exploring what a new filesystem implemented in Rust can achieve, since kernel maintainers sometimes argue that they want to see new things implemented in Rust rather than having existing implementations duplicated in both C and Rust. Of course, it's understandable they don't want multiple filesystems which solve the same container image problem, so pushback is to be expected.

@safinaskar thanks for presenting your concerns, we definitely have to make a strong case for PuzzleFS if we ever want it to be included in the Linux kernel.