mhx / dwarfs

A fast high compression read-only file system for Linux, Windows and macOS
GNU General Public License v3.0
2.13k stars 56 forks source link

[Feature request] Allow providing dwarfs with a dedup library #208

Closed exeter-matthew-wakeling closed 5 months ago

exeter-matthew-wakeling commented 5 months ago

This is really neat.

Feature request: Add a "library" option to give mkdwarfs a list of files that should be loaded into the dedup mechanism first, but not stored, allowing the image to be even smaller if the contents of the file can be retrieved from that library instead. Bonus points if you can specify a dwarfs image as a library and have it sensibly use the files contained in it.

Then you have the basis for a deduplicating incremental backup system. Currently, I have a system I wrote that will take a single file and a list of library files and produce a compressed deduplicated file that can re-create that single file using the library, which is great if you use tar to create that single file, but a little unwieldy when coming to decompress and restore everything. The bonus of making it a proper mountable filesystem instead is that then it's a proper mountable filesystem and retrieving single files is a doddle.

My use case is that I have students, and I have given them coursework, which involves them logging in to a Linux machine and hacking away. I want to store regular snapshots of their work so that I can keep a backup for their sake but also so I can see a progression of development to try to work out if they are cheating (yes, I have had to deal with this), but I don't want to store 100 copies of the same fairly large files. Yes, I could achieve the same thing using ZFS snapshots, but that'd require snapshotting the entire filesystem, which is more than I want to do, and it requires root.

mhx commented 5 months ago

This is really neat.

Thanks!

The "deduplicating incremental backup system" use case is quite high on my todo list (but it has been there for a while now). One of the first issues (#18) has a comment that summarizes the idea in two sentences. It pretty much boils down to

Bonus points if you can specify a dwarfs image as a library and have it sensibly use the files contained in it.

only that everything will be contained in a single DwarFS image; i.e. you'll be appending the incremental data to an existing image.

There's still no firm timeline, though.

mhx commented 5 months ago

As mentioned on HN, borg might work for your use case in the meantime.

exeter-matthew-wakeling commented 5 months ago

only that everything will be contained in a single DwarFS image

That's a good idea, because it reduces the chances that the "library" will accidentally go missing. The only request I would make would be the ability to still mount or extract the older version of the filesystem. Edit: I just read that comment, and I see that's already what you're planning. Great.

mhx commented 5 months ago

Closing this as it's covered by #18.