How is the chunk boundary calculated?

fracai commented 6 years ago

The design guide mentions finding chunk boundaries using the rolling hash function and then later in reference to whether the end of a file will trigger a new chunk or not.

I can see where buzhash is used in chunkmaker, but I'm not familiar enough with what it's checking against to get what this does. I would have thought the rolling hash would be compared to previously stored chunks, but it looks like the hash is compared to a mask instead.

I tried some practical examples to get some insight, but it hasn't helped me too much. Specifically, I made a script to copy in several files, add some more, remove some, rename, etc. Copying everything in and renaming by appending a suffix adds minimal new data. Removing a file from the middle or beginning of the file list and then renaming everything leads to around three additional chunks (default chunk size, files are around 1.5 - 4 MB). Presumably the removed file has left a gap in the hash stream and the algorithm takes a few chunks before a previously used boundary is found.

Ultimately I'm looking for tips on tuning the chunk size parameters. The biggest source of new and changed files is going to be images. These will be sorted to dated folders and then moved to my photo library. It's the moving to the library that has me thinking about the hash stream changing and how large the chunk should be in order to decrease new uploads.

Thanks for any detail you can provide on the boundary search.

gilbertchen commented 6 years ago

the removed file has left a gap in the hash stream and the algorithm takes a few chunks before a previously used boundary is found.

This is correct. So when you have many removed files scattered around the directory tree then the deduplication will be ineffient. There is a discussion about this issue here: https://duplicacy.com/issue?id=5740615169998848. Reducing the chunk size would definitely help in this case.

However, if your photo folders are added and moved by dates I suspect this will not be an issue.

fracai commented 6 years ago

Thanks, that discussion is very helpful. Specifically, the large file optimization that inserts artificial boundaries for files that are larger than the average chunk size. I was originally wondering if every file should start a new chunk, but only breaking for larger files sounds like a more balanced solution.

It looks like this could be added by having the nextReader function that is passed in from backupmanager.Backup return fileReader.currentFile, entry.Size, true, instead of just the io.Reader and boolean. chunkmaker would then call endOfChunk and startNewChunk around duplicacy_chunkmaker.go:218 if the fileSize is greater than maker.hashMask.

Is it safe to use hashMask here? I could store the averageChunkSize, but it looks like it's already being stored in the hashMask. I wouldn't want to rely on that when it'd be clearer to just save one more field.

Also, it looks like the chunk boundary is basically looking for the chunkHash anded with the averageChunkSize to equal 0. Is there anything "magic" about that or is it just as good as any other heuristic for picking a deterministic boundary?

Generally, I agree I wouldn't expect the photos to be too much of an issue, but the filesize boundary does sound like a good measure.

fracai commented 6 years ago

I have a branch ready, but after some further testing I wonder if it's worth it. I noticed that there's special handling if the min and max config are set to be the same size. It looks like using these settings already insert a boundary for each file. When compared to my branch, this results in fewer chunks, much better deduplication, and fewer bytes transferred overall. I suspect that with many small files splitting only for bigger files would be more efficient, but I haven't tested that yet.

gilbertchen commented 6 years ago

Your implementation looks right, although I would increase the large file threshold to maybe 2 * maker.averageChunkSize.

Setting the max and min chunk sizes to the same will disable the variable-size chunking algorithm and switch to the fixed-size chunking algorithm, which will lead to 2 problems: it can't handle byte shifts like deletions or insertions, and because of that it can't combine small files into chunks so it isn't suitable for backing up to cloud storages with large latency. For virtual machine files or database files, the fixed-size chunking algorithm should work well.

gilbertchen / duplicacy

How is the chunk boundary calculated? #248