Question: Are identical files always "perfectly" deduplicated (a chunking design-related question)

restic / restic

Fast, secure, efficient backup program

https://restic.net

BSD 2-Clause "Simplified" License

25.83k stars 1.54k forks source link

Question: Are identical files always "perfectly" deduplicated (a chunking design-related question) #2401

Closed jim-collier closed 5 years ago

jim-collier commented 5 years ago

I'm knee-deep in the evaluation and testing phase of backup product selection. (And ditching Crashplan after ten+ years, considering their recent string of disastrous business decisions.)

I have about 7tb of mostly photos and videos. Due to my workflow, there's a high degree of redundancy (~2x).

I was testing Duplicacy but, although excellent in many ways, I realized it's not going to work for me. Restic is next up, and I'm wondering if it suffers from the same design "flaw"?

My daily workflow - which involves moving folders and individual files around quite a bit - often breaks Duplicacy's ability to deduplicate. That's because of the way it chunks data. Chunks include parts of more than one file, even if the files are large - based on some unchangeable scanning order like alphabetical, inode, or something. As a result, it doesn't deduplicate on a strictly file-by-file basis (at least for large files such as digital photos and videos), which is what I need.

I've scoured Restic's documentation, and the github issues, but I can't find any reference that addresses this.

So, for larger files (e.g. photos), does Restic limit chunks to containing single files? More than one chunk per file is fine, but more than one file per chunk is not. (I understand the need to combine smaller files, e.g. <1mb, to reduce API charges on many backends like the one I use, Backblaze B2. That's all good.)

Thanks in advance!

alphapapa commented 5 years ago

You're probably looking for this: https://restic.readthedocs.io/en/v0.3.0/Design/. Specifically, maybe this: https://restic.readthedocs.io/en/v0.3.0/Design/#backups-and-deduplication

I found it by googling "restic pack", but the different terminology between backup systems can make things hard to find.

cfbao commented 5 years ago

One blob only contains data from one file. Multiple blobs may be combined into one pack file. Dedupe is done at the blob level.

Also, I think this sort of question is better suited for the forum.

jim-collier commented 5 years ago

You're probably looking for this: https://restic.readthedocs.io/en/v0.3.0/Design/. Specifically, maybe this: https://restic.readthedocs.io/en/v0.3.0/Design/#backups-and-deduplication

I found it by googling "restic pack", but the different terminology between backup systems can make things hard to find.

@alphapapa great, thanks! I did actually read that section. Very carefully, multiple times. I still don't see how it clearly states anything about one-file-per-chunk (or more accurately, blob). But your answer is good enough for me to continue with evaluation.

Just for completeness, here's the section in question:

For creating a backup, restic scans the source directory for all files, sub-directories and other entries. The data from each file is split into variable length Blobs cut at offsets defined by a sliding window of 64 byte. The implementation uses Rabin Fingerprints for implementing this Content Defined Chunking (CDC). An irreducible polynomial is selected at random and saved in the file config when a repository is initialized, so that watermark attacks are much harder.

Files smaller than 512 KiB are not split, Blobs are of 512 KiB to 8 MiB in size. The implementation aims for 1 MiB Blob size on average.

For modified files, only modified Blobs have to be saved in a subsequent backup. This even works if bytes are inserted or removed at arbitrary positions within the file.

@cfbao perfect, thanks! And agreed, forum would have been a better place. Ironically I did do a "search for text on page" for "forum". Unfortunately I think I must have done so from restic.readthedocs.io, rather than the main site, which required more than one brain fart to acheive.

alphapapa commented 5 years ago

I still don't see how it clearly states anything about one-file-per-chunk (or more accurately, blob).

You might be thinking too much in terms of other backup software.

As you know, the point of deduplication is to reduce storage, because a range of data may be present in more than one file. So I think the description "one-file-per-chunk" or "one-file-per-blob" isn't accurate, because each blob may be present in multiple files. The point is that each unique blob is only stored once. In other words, it doesn't matter how many files a blob comes from. Each file (e.g. when restored) is composed of one or more blobs, which may themselves be present in one or more files.

If you're familiar with git, it's similar in that both are ultimately storing blobs addressed by content-derived hashes, and those blobs' association with particular filenames is a separate concern.

Your actual concern seems to be:

My daily workflow - which involves moving folders and individual files around quite a bit - often breaks Duplicacy's ability to deduplicate. That's because of the way it chunks data. Chunks include parts of more than one file, even if the files are large - based on some unchangeable scanning order like alphabetical, inode, or something.

I think that concern is addressed by the section you quoted. Restic does the right thing.

jim-collier commented 5 years ago

You might be thinking too much in terms of other backup software.

@alphapapa I just didn't express my concern very well. Yes - I understand the concept of blobs (and whatever other vocabularies call the same idea), and deduplicated data. (I've actually written what I think is the first/only offline ZFS deduplication utility...uploading to github as soon as I solve one last bug...) The problem with Duplicacy, is that it combines parts of multiple files into variable-sized chunks. If the hash for that chunk (before encryption and compression) has been stored before, it only stores a reference, not the data again. That's great, if your files stay in the same place and in the same context next to each other, and also tend to not change willy-nilly. Otherwise, data deduplication can easily drop 20% or more (according to others' observations).

Many FLOSS backup products, Restic included, seem to do such "chunk/blob-based" deduplication. Which aside from Duplicity's drawback, seems like the obvious smart approach. But integrated "backup-product-and-cloud-storage" solutions, like Crashplan, seem to do it differently. (Presumably due to not worrying about charging per object and/or API call, and being able to maintain an arbitrarily large/complex database on the server.) They back up newest files first (which is not possible with Duplicacy without destroying deduplication completely), and the deduplication seems flawless (if insanely memory intensive).

Anyway, that's just background. You answered the question, thanks!

fd0 commented 5 years ago

The problem with Duplicacy, is that it combines parts of multiple files into variable-sized chunks. If the hash for that chunk (before encryption and compression) has been stored before, it only stores a reference, not the data again.

Huh, that sounds odd, and restic works differently. I've also written a blog article how restic works under the hood, there are many examples at the end playing around with saving files containing duplicate data.

What may be a problem though is that restic's memory usage scales with the number of files. So if you have a large number of small files, restic may not be suited very well, but for photos (which tend to be rather large, mine are usually 8-30MiB in size per file) it should work.

I'm closing this issue since it has been answered. We're using the issue tracker exclusively for tracking bugs and feature requests, the forum would have been better for this (please feel free to add further comments though) :)

jim-collier commented 5 years ago

Yes, apologies again for trying to find the forum and not being able to. (Because I didn't realize I was on the docs site!) And thanks for the link to the blog. Good info. I have one more question related to (or at least dependent on) not chunking partial files together. I'll do that on the forum. Thanks.

BTW, while the overwhelming majority of my storage (currently 7TB and growing about 1.5x per year) is consumed by photos and videos, the overwhelming majority by count, is small files. So I'll just have to test Restic. Crashplan has similar deduping memory scaling, and can struggle. One nice bonus of Duplicacy is the simplicity of deduplication (independent of the not-so-great chunking algorithm). It stores each chunk as a file on the back-end (including cloud storage), with the content hash (in base64 probably) as the filename. Rather than keeping an in-memory db of hashes (or literally any db), it just queries the backend for the existence of a file with with the name of the current chunk hash it's working on. Yeah that involves alot of tradeoffs, but also has a beautiful simplicity.

alphapapa commented 5 years ago

Many FLOSS backup products, Restic included, seem to do such "chunk/blob-based" deduplication. Which aside from Duplicity's drawback, seems like the obvious smart approach. But integrated "backup-product-and-cloud-storage" solutions, like Crashplan, seem to do it differently. (Presumably due to not worrying about charging per object and/or API call, and being able to maintain an arbitrarily large/complex database on the server.) They back up newest files first (which is not possible with Duplicacy without destroying deduplication completely), and the deduplication seems flawless (if insanely memory intensive).

I can't speak for CrashPlan's internals, but my understanding is that it does essentially the same thing as Restic: file content is chunked, hashed, and de-duplicated, and unique chunks are stored in repositories, whether local or remote. Having used CrashPlan, that's one of the things that attracted me to Restic: it's almost like a FLOSS version of CrashPlan, minus the inotify watcher (I hope it will get support for one someday, but even without it, backup performance is very good).

BTW, you might find this helpful: https://github.com/alphapapa/restic-runner If performance is ever a problem, it helps to split data into sets and back them up separately, and this script makes that easy and flexible.

jim-collier commented 5 years ago

BTW, you might find this helpful: https://github.com/alphapapa/restic-runner If performance is ever a problem, it helps to split data into sets and back them up separately, and this script makes that easy and flexible.

I can't "thumbs-up" this hard enough.