gilbertchen / duplicacy

A new generation cloud backup tool
https://duplicacy.com
Other
5.12k stars 334 forks source link

Request: Make use of deduplication during restore #27

Open thrnz opened 7 years ago

thrnz commented 7 years ago

For a test scenario, I backed up a folder containing 5 copies of the same 200mb file. When backing up the folder the deduplication worked as expected - storing all 1GB worth, though only needing to transfer 200MB - however when it came to running a test restore, each file was restored separately, presumably redownloading the exact same set of chunks in order to do so.

Would it be at all possible to rework the way in which files are restored in order to optimize the amount of data transferred? I imagine this would come at the cost of more processor overhead, having to work out which files share which chunks prior to, or during, the restore itself. Perhaps chunk downloads could be cached and only discarded once they are known to be no longer needed. For the sake of efficiency, any processing would be done in parallel to the chunk downloads.

gilbertchen commented 7 years ago

Duplicacy has an on-disk cache, but that is for snapshot files only, not for file chunks. The main reason for that is to minimize the disk usage. The cache would become too large if every duplicate chunk were to be saved.

The workaround is to use the copy command to copy the snapshot to a local file storage and then restore from the local file storage. In fact, our cache is implemented as a local file storage so if caching is enabled for file chunks the effect would be the same.

That being said, I still think your test case is a bit artificial -- it is rare to have the same set of files saved in different places in the repository (symlinks should be used instead). What duplicacy does support during the restore process is inter-revision deduplication. That is, if the repository is at revision A and you want to restore to revision B, then only chunks that do not exist in revision A will be downloaded.

thrnz commented 7 years ago

The test case I gave was just an extreme example. A more real world example would be restoring several VMs created from copies of the same disk - for instance in VMWare Player where snapshots aren't supported - where the virtual disks are mostly similar.

I've been comparing Duplicacy to Duplicati 2.0, and this was one of the things that Duplicati does better. While real world applications are limited, it could still be of benefit in certain fringe cases.

Another potential improvement that would help in a lot more cases would be to download more than one chunk at a time when restoring. That could have a pretty big impact on restore speeds.

gilbertchen commented 7 years ago

In that case, you should provide a base file for each file to be downloaded -- just copy over any fileyou have on your local machine that is similar to the one to be downloaded. Duplicacy will perform an in-place update, skipping identical chunks and only downloading chunks not found in the base file.