gilbertchen / duplicacy

A new generation cloud backup tool
https://duplicacy.com
Other
5.26k stars 339 forks source link

Comparison with Arq? #113

Open valentine opened 7 years ago

valentine commented 7 years ago

Hello,

Thanks for making the CLI version open source, and for the very detailed feature comparison.

Would it be possible for the readme to include a comparison against the technical aspects of Arq?

Thanks!

gilbertchen commented 7 years ago

I commented on the design of Arq here:

Like Duplicacy, Arq naturally supports deduplication by saving chunks using the hashes as the file names (but this only applies to large files; more on this later). Unfortunately, the names of the chunk files contain the UUID of the computer as the prefix, which limits deduplication to files residing on the same computer (two computers having the same set of files will have two distinct sets of chunks stored in the storage due to this UUID prefix). Therefore, Arq does not support cross-computer deduplication and is only suited for backing up a single computer.

Another issue I can see is the handling of small files (<64KB):

A packset is a set of "packs". When Arq is backing up a folder, it combines small files into a single larger packfile; when the packfile reaches 10MB, it is stored at the destination. Also, when Arq finishes backing up a folder it stores its unsaved packfiles no matter their sizes.

I don't see how deduplication can work for smaller files when they are packed into packfiles with a hard limit of 10MB (no rolling checksum?) and hashes are stored in a separate index file (I suspect this is why Charles noticed little deduplication). On the contrary, Duplicacy treats small and large files the same way, by packing them together (into an imaginary huge tar file) and then splitting them in chunks using the variable-sized chunking algorithm. This guarantees that moving a directory full of small files to a different place (or to a different computer) will not change most of the chunks. Modifying or removing a small file may invalidate a number of existing chunks, but this number is under control because of the variable-sized chunking algorithm.

The choice of 64KB looks somewhat problematic to me -- it may not be large enough (the default chunk size in Duplicacy is 4MB). Uploading 64KB files with an average residential internet connection (mine is 1MB/s up) may still be too slow. In addition, if there are many small directories, since each directory has its own packfile and index file, you will have many small files to upload, which will significantly downgrade the performance.