Inefficient Storage after Moving or Renaming Files?

jonreeves commented 6 years ago

I think this may be related to issue #248 .

I noticed that backups of my active folders were growing in size significantly where I expected them to change very little. The best example I can give is my "Photos" folder/snapshot...

I regularly download my sd cards to a temporary named project folder under a "Backlog" subfolder. It can be days or months before I work on these images, but usually this will involve Renaming ALL the files and separating them into subfolders by Scene/Location or Purpose (print, portfolio, etc...). The project folder gets renamed too and the whole thing is then moved from out of "Backlog" to "Catalogued". None of the content of any of the files have physically changed during all this, file hashes should be the same.

The renaming and moving alone appears to be enough to make the backup double in size. Any image edits in theory shouldn't have an impact as typically the original file is untouched... The good thing about modern photo processing is edits are non destructive to the original image file. Instead and XML sidecar file is saved along side the image file with meta about the edits applied.

I'm yet to test the impact of making image edits but I suspect it may make things worse because each file gets another file added after it, and it seemed like file order made a difference to how the rolling hash was calculated.

Image1.raw
Image2.raw
...

Becomes...

Image1.raw
Image1.xmp
Image2.raw
Image2.xmp

This should be perfect for an incremental backup system, but it seems like duplicacy struggles under these circumstances. I figured the -hash option might help, but it didn't seem to.

Am I doing something wrong or missing an option?

Is this a bug, or just a design decision?

Is there any possible way to improve this?

Although the above example may sound unique I find this happens in almost all my everyday folders. Design projects, website coding projects, Files and Folders are just often reorganized.

I'm guessing the only way to reclaim this space would be to prune the older snapshots where the files were named differently?

TowerBR commented 6 years ago

I think the Evernote database optimization simply changes so much of the database that there simply is not much of the old data that can be reused

I think the same, as everything is new, everything has to be sent again, there is nothing different that can be done. But I think about the impact of this on really big databases. I'm very curious about how Vertical Backup works.

The total number of chunks do not radically differ between branches (the graphs exaggerates by being relative)

I agree, in fact it is not because they are relative, but because the graph scale is small (2610-2710, a "100" range).

Any attempt to sort the files before processing can also be catastrophic, because you never know how the files are manipulated in a way that drastically affects how they are sorted.

Exact. We must think of the "cause" of the problem, and I do not think the order is the main cause. I think that would be more the "not knowing" of the "coming forward" in the file queue.

I have not yet studied the Duplicacy code like you do, so I do not know if what I'm going to say is correct and / or has already been implemented, but think of the file queue:

If file 3 has a small modification, I think what Duplicacy does today is to change the subsequent chunks until it finds an old chunk where the boundary fits, generating a new chunk 6, 7 and so on:

But if it could "see" that chunks 5 could be used, it would only generate chunk 6, to replace 4:

Is the current operation like this?

fracai commented 6 years ago

I think you have the current operation correct; at least when using "-hash". Without "-hash", File4 wouldn't be chunked at all unless it has changed. I think you're right though, Duplicacy can't "see" ahead to know that a chunk is reusable. I'm pretty sure it doesn't load existing chunks at all and just calculates what the chunk will be and then tests to see if it's there in the storage or not. It could indeed load all available chunks and then look for those while it's calculating chunks. I don't think that scales very well though. With larger repositories that'd be a large list of hashes to compare with.

bundle-OR-chunk and incrementally building up those small-file-bundles actually sounds like a really good combination though.

tophee commented 6 years ago

Really interesting discussion. Thanks to everyone contributing to it,

Trying to put all this into perspective, may I ask to what extent changes in how chunks are produced in some future or alternative version of duplicacy will require a newly initialized storage (i.e. starting an entirely new backup)?

I understand that the chunksize parameters cannot be changed once the storage is initialized, but if those parameters remain the same ( -c 1M in my case), will I be able to apply, for example file-boundary chunking with the same storage?

My hunch is that it should be possible but chances are that it will lead to an very large upload, because so many chunks will have changed. But it would still be my preferred path because it would allow a gradual transition to a more efficient chunking algorithm as the old chunks gradually get pruned out. The only cost would be a temporary - though relatively long, depending on the prune settings - increase in storage use).

Is it possible? Or can I expect that I will soon be re-uploading all my stuff once again?

@kairisku re

Any other significant takeaways that should be considered?

Isn't it rather surprising that the file_boundary branch didn't perform any better than the official version? My tentative interpretation is that that this is due to the fact that @TowerBR 's tests were not about moving files around but mostly changing existing files (right?). Also, I believe the tests were run with a version before your most recent tweaks/bugfixes. But even so, It looks like honouring file boundaries only helps under specific conditions (or doesn't help under specific conditions).

kairisku commented 6 years ago

@TowerBR I have not yet studied the Duplicacy code like you do, so I do not know if what I'm going to say is correct and / or has already been implemented

There is no lookahead. The files are simply processed in order (with -hash all files are processed, without it only new or modified files are processed), and the stream of data is then blindly cut into chunks. It is just a matter of chance whether the newly formed chunks are identical to previous ones.

@tophee Trying to put all this into perspective, may I ask to what extent changes in how chunks are produced in some future or alternative version of duplicacy will require a newly initialized storage (i.e. starting an entirely new backup)?

Any significant change to the chunking algorithm will change how the chunks are formed and thus will require a significant upload. The file boundary change will probably have a very large impact for repositories consisting of files that are smaller than the chunksize, but for repositories with very large files there are fewer file boundaries affecting the chunking. In the best case only the chunks around the file boundaries are affected, while the hash-based boundaries within the files should be stable. You can always test the branch with a dry-run to see what would happen.

@tophee Isn't it rather surprising that the file_boundary branch didn't perform any better than the official version? My tentative interpretation is that that this is due to the fact that @TowerBR 's tests were not about moving files around but mostly changing existing files (right?).

The file boundary branch was intended for improving file rename/reorganize scenarios, so no big surprise there.

tophee commented 6 years ago

@kairisku It is just a matter of chance whether the newly formed chunks are identical to previous ones.

I was starting to suspect this but thought I must be misunderstanding something. @gilbertchen I think this is quite serious (or is it only serious in theory?) Doesn't this mean that duplicacy's cross-source deduplication is only a matter of chance? If so, I suppose you will eventually accept a pull request from @kairisku at least for that branch? (Though it looks like the smaller hash window branch also has its benefits.)

@kairisku The file boundary branch was intended for improving file rename/reorganize scenarios, so no big surprise there.

Okay, I understand the that the aim was to improve those scenarios, but isn't it reasonable to expect it to also perform better in "ordinary use" without moving/renaming? What is the reason for the file-boundaries version consistently using more storage space than both other versions? (okay, except for the data base scenario where it was better than the original)

kairisku commented 6 years ago

@tophee:

Doesn't this mean that duplicacy's cross-source deduplication is only a matter of chance?

You might think so, but hopefully not. The theory behind splitting chunks using a hash function, is to consistently find boundaries where the preceding data looks a certain way. If you have similar or identical files being backed up from different sources, the chunk boundaries should fall at the same positions resulting in identical chunks that can be deduplicated. Since the boundaries are selected within certain size limits, there is a chance of misalignment happening. Backups should be sufficiently fast, and therefore the algorithm cannot try a lot of different boundaries to perhaps find something that aligns with previous data, as that would require way too much resources. The current implementation does a reasonable good job, but then there are extreme situations where it falls short.

Okay, I understand the that the aim was to improve those scenarios, but isn't it reasonable to expect it to also perform better in "ordinary use" without moving/renaming?

It should perform better or at least as good, but that is just my opinion. Chunks that are formed from the beginning of files should in theory have a slightly better chance of being deduplicated compared to chunks that are cut from random positions within a file. Consider editing some large file in a way that only changes the data at the end of the file. The beginning is then unchanged, as so is probably the following file whose beginning should not be bundled together with the end of previous file. I see no scenario where pasting together the end of one file and the beginning of another into the same chunk would be better than having the files in different chunks, except for the fact that many smaller chunks are not optimal for cloud storages.

What is the reason for the file-boundaries version consistently using more storage space than both other versions?

I suspect @TowerBR tested with a slightly buggy version of the branch that did some extra splitting of the last chunk of a file. A bugfix was committed just after he started his long and detailed test.

tophee commented 6 years ago

It should perform better or at least as good, but that is just my opinion.

Though I'm not sure whether this is a matter of opinion, I agree. Your reasoning makes complete sense, which is basically why I thought the results were surprising. But if those were due to the little bug which is now fixed, what is keeping you from submitting a PR? Or do you first want to test a version that combines the smaller hash window with respecting file boundaries?

I would love to see these features in the official version!

TowerBR commented 6 years ago

I played a little with chunks settings and did a new test, but I'm really not sure what conclusions to take (lol): link

kairisku commented 6 years ago

My file_boundaries branch just got a commit for handling large and small files in two separate passes. I concluded this was the simplest change to implement a bundle-OR-chunk algorithm. All files larger than the minimum chunk size are processed first (with file boundary chunking), and in the second pass all the smaller files are bundled and chunked. I hope this will keep catastrophic cascades to a minimum. I have performed almost no tests at all, so there is a chance this commit breaks something in a spectacular way (e.g. if some internal logic goes totally haywire because files are now processed in a slightly different order).

I will in time submit a pull request, but I would like the patch to have a significant positive benefit. The biggest problem is compatibility with chunks in existing storages. Because incremental backups only process new and modified files, the file boundary and processing order changes should actually not cause any problems. But if anyone is doing backups with the -hash option, then you will probably see a significant amount of new chunks being uploaded. If the chunking is significantly affected, it might be required to put the new algorithms behind some storage parameter so initializing a new storage is required.

TowerBR commented 6 years ago

Interesting! Which would be a good setup (repository file type, chunks, etc) to do a test? ;-)

tophee commented 6 years ago

@kairisku

The biggest problem is compatibility with chunks in existing storages. Because incremental backups only process new and modified files, the file boundary and processing order changes should actually not cause any problems.

What happens if I move files which have been backed up with the old (i.e. current official) algorithm? That will most likely trigger a larger upload, right? But with the benefit that subsequent moves or renames will not have any effects, i.e. no new uploads, right?

kairisku commented 6 years ago

@TowerBR Interesting! Which would be a good setup (repository file type, chunks, etc) to do a test? ;-)

@jonreeves had a good setup, i.e. first adding small files beside larger ones, doing a backup which uploaded only the small files, and then moving all files to another folder and backing up again (which will back up both the large and the small files). With the latest commit there should be improvements in that scenario.

@tophee What happens if I move files which have been backed up with the old (i.e. current official) algorithm? That will most likely trigger a larger upload, right? But with the benefit that subsequent moves or renames will not have any effects, i.e. no new uploads, right?

You are almost correct! The old files will have a different chunk layout around the file boundaries, so by moving them you will (with my branch) get new chunks around the boundaries (and if unlucky, some further spillover). After that, subsequent moves/renames should have no impact for files larger than the minimum chunk size. There is still the case of small files that have been bundled into the same chunk, and if you move only some of those files, they will get bundled differently and that will result in new chunks.

TowerBR commented 6 years ago

I updated my "test 09" with another job, after Gilbert's explanations here, setting the average chunks to the size of most files, and the result was exactly as expected, according to his explanation:

chart06

What led me to think: for repositories where most files have a similar size, setting the average chunks size close to this size of most files would have an effect similar to the "bundle-OR-chunk" approach described above?

kairisku commented 6 years ago

As Gilbert says: for the official duplicacy implementation, the size of the files have no effect on the chunk sizes, because all the files are bundled into one consecutive stream of data that is chunked using the hash function.

With my file boundaries branch this indeed changes. If you have repositories where most files have a similar size, if that size falls within the min/max chunk size limits every file will be in its own chunk. The only deduplication that can happen in such a scenario, is if there are identical files to begin with. If files are larger than the maximum allowed chunk size, there is a possibility for identical chunks to be found from within otherwise non-identical files. With smaller chunks the probability increases (but probably not by much). There is no magic here, for any significant deduplication to take place where has to be some redundancy in the data to begin with. Such as copies of the same files, or only slightly edited files that have common parts with other files, virtual machine images containing common OS files, etc.

TowerBR commented 6 years ago

the size of the files have no effect on the chunk sizes,

Sure, but setting the size of the chunks to close to the size - of most - of the files, the number and size of the chunks is changed, and there are better chances that a change in a small file will not change many chunks.

tophee commented 6 years ago

I've read up a bit on rolling hashes and content based slicing because I wanted to understand the principle by which chunks are created when it is not by file boundary and what I've learned is that you basically wait until the the first X digits of the hash of your rolling window are all zero. Once that happens, you apply a cut and then move on to the next chunk. X is a constant that you chose depending on how big you want the average chunk size to be. The more zeros you want, the more unlikely it is that you will actually get them (i.e. it will take longer until you get them) and hence the bigger your chunks will turn out on average.

So this means that when we're setting the average chunk size, we're actually not actually telling duplicacy to make the chunks a particular size but we're telling it what the cut off criterion should be, i.e. how many zeros to watch out for, and that then leads to a distribution of chunk sizes that "happens" to have an about the average of what we requested in the setting (as @TowerBR's chart above nicely illustrates).

I'm not sure if anyone else finds this interesting or non-trivial, but for me, this helped me understand why the average chunk size is not necessarily met exactly (but only on average), or - if I remember the term correctly from Maths class some decades ago - it's the "expected value".

I also conclude from this that the file name of each chunk is the hash value of the entire chunk and therefore has nothing to do with the rolling hash (unless the chunk happened to be the same size as the rolling hash window, which is currently possible in duplicacy because the rolling hash window is set to min-chunk size, but which is impossible in the Kai's hash window branch, right?). And this, in turn, explains why we cannot quite foresee how well deduplication will work (i.e. what the chances are that a particular chunk will be used multiple times): the creation of the chunks has nothing to do with their "deduplicability". Chunking is one thing (based on rolling hashes) and deduplication is another (based on file hashes), right?

Anyway, sorry for thinking out aloud. What I actually wanted to ask is this: am I correct in assuming that what the file boundary branch basically does is add a second cut criterion, making duplicacy not just watch out for those zeros but also checks each new byte in the rolling hash window whether it's and end-of-file character (or is it a string?) and if it is: snip! - no matter what the rolling hash is. Right?

TowerBR commented 6 years ago

what the file boundary branch basically does is add a second cut criterion... also checks ... whether it's and end-of-file

By my understanding, yes, but let's wait for Kai, "father of the child"...

kairisku commented 6 years ago

Yes, @tophee describes the use of the rolling hash absolutely right. Although the name of a chunk is based on a totally different (more secure) hash from the rolling hash, and the chunk hash includes all data from the whole chunk, which the rolling hash typically is from a much shorter stretch of bytes.

And the file boundary branch indeed adds a second cut criteron that can cut the chunk regardless of the rolling hash, but only if it would result in a chunk with allowable size.

I am not totally clear what @TowerBR tried to show with his histograms of chunk sizes. If the testing was made with the file boundary branch, then yes the distribution of source file sizes will affect the sizes of the resulting chunks because the chunks are cut at the file boundaries. If testing with the original duplicacy implementation, then the chunk size should on average be just the average chunk size as specified when initializing the storage.

For a random distribution of file sizes, I actually expect the file boundary branch to produce larger chunks on the average, because the min/max limits of the chunk sizes are 1/4 and 4x the average desired chunk size, and a random distribution from 0.25 to 4 should average 2.125. But in reality file sizes are not randomly distributed, but follow some funny long-tailed distribution I no longer remember the name of (Zipf?) and therefore the average will be lower. Specific data sets have their own size distributions (photos, office-documents, databases, etc.).

TowerBR commented 6 years ago

If testing with the original duplicacy implementation, then the chunk size should on average be just the average chunk size as specified when initializing the storage.

The test was done with the original implementation.

I did the test to see how the size distribution of the chunks would be "around" the average size, and whether that distribution would be affected by the minimum and maximum sizes with a ratio greater than 0.25 to 4.

What I'm thinking is that if most chunks have sizes close to most files (setting the average chunk size close to the average size of most files), then we'll have a better chance of having minor chunks changes with unique file changes .

kairisku commented 6 years ago

My file_boundaries branch now identifies identical small files based on their content hash which avoids problems with new chunks being formed because small files might get packed differently within a chunk. With these latest changes all rename/reorganization operations should result in zero uploaded file chunks (only updated metadata).

Adding the same file multiple times (e.g. in different directories or with different names) should also be properly detected and only one copy will be uploaded (something which the original implementation did not explicitly check for, and due to bundling could cause all copies to end up in different chunks).

The latest changes have only been lightly tested, so do not use for critical production data! I am a bit uncertain about how e.g. pruning might possibly be affected, because it is now possible to have a snapshot where different files point to the same parts of a chunk when the content is identical. It might at least affect size statistics.

tophee commented 6 years ago

Wow, this is fantastic. Can't wait for this to become part of the official version (though I somewhat fear having to re-upload more than a TB of data again).

Though I'm not sure I quite understand the latest changes and what they mean with regard to your bundle-OR-chunk approach.

My file_boundaries branch now identifies identical small files based on their content hash which avoids problems with new chunks being formed because small files might get packed differently within a chunk.

So, by small files you mean those smaller than the min chunk size, right? (because those are the ones that will be bundled.) If so, are you saying that you are no longer doing any bundling? Or maybe I should just ask: what do you mean by "identifies identical small files". Where else would files be identified by their content hash but in their file name? Which, implies that they are saved under that name rather than bundled. (??)

it is now possible to have a snapshot where different files point to the same parts of a chunk when the content is identical.

I believe this is already possible now in the official branch.

Edit: Took a look at the code and see that you are still packaging the small files (as I thought you would), but I don't understand where you are keeping the index of the packaged small file hashes.

Edit2: Oh, now I see that you are talking about "different files point to the same parts of a chunk". I missed the parts part before. So, yes, that may indeed be new. In fact, this now appears to me as quite a substantial change in the way how duplicacy works... Or maybe not (because what (I suppose) counts for pruning is that the chunk is referenced. Prune doesn't care which part of the chunk, right?

kairisku commented 6 years ago

By small files I mean files smaller than the minimum chunksize. Files larger than that are always aligned to begin at the beginning of a new chunk, so the chunking takes care of deduplication for identical files there.

But since the small files are bundled several files into a chunk, there are always some files that starts somewhere in the middle of a chunk. If any such file is renamed or moved, it will most likely be bundled differently which means it would end up in a chunk that looks different from all old chunks.

E.g. say you have files A, B and C that gets bundled into chunk X. Rename B to B2 and run a new backup, original duplicacy will only process B2 and put that at the beginning of a new chunk Y (because it is different from X)

There are actually at least three different hashes in duplicacy: the rolling hash used to break large files into chunks, the hash of a whole chunk (which also determines the name of the chunk), and then there is a hash for each source file (a confident fingerprint of the content of the file).

What my latest commit does is entirely based on this last hash, i.e. in the above scenario it will immediately see that B2 is actually identical to B and just record in the snapshot that B2 uses the same part of the same chunk that B previously used. As a slight drawback, should you delete files A and C and prune them from the storage, no chunks will be removed, because the chunk that B2 references must still be kept although parts of the chunk is no longer needed. Because the metadata for A and C have been lost, the unused parts of the chunk will essentially be junk and cannot easily be reused even if A or C should reappear. I do not think this is too much of a problem.

Original duplicacy also has the problem if you make multiple copies of a file, like D1, D2 and D3 and then run a backup, it will pack D1+D2+D3 into a new chunk. With my branch, if D1 is already in the previous snapshot, nothing will be uploaded at all. If D1 is really a new file, that will be bundled into a chunk, but then D2 and D3 will just point to the same chunk fragment where D1 is located.

Although this might sound fantastic, the latest fixes really only concern scenarios where you have small files being renamed or copied. Depending on your data, this might be very significant, or have no impact at all.

Did these explanations clarify things, or does something still need a better explanation?

tophee commented 6 years ago

Yes, this clarifies a lot. Thanks for the detailed explanation.

However, what I still don't understand is where you are storing the information about those small files (i.e. their hashes)? Or is it already stored in the snaåshot files and you are just using it to compare it with apparently new files?

BTW: By "fantastic" I meant that it is a great improvement because it makes duplicacy behave more reasonable and efficient. Even if it may not have any effect in some scenarios, it's still gives me a good feeling because it improves the code.

thrnz commented 6 years ago

Nice work. Depending on the data being backed up, these changes would have a significant positive benefit on deduplication.

I'd be interested to see if there is any tradeoff performance/memory wise when there are a large number of files involved. Both with the 'second pass' for chunking smaller files, and with checking to see if a small file's hash already exists.

I might have a go at setting up and running some tests later to compare.

kairisku commented 6 years ago

@tophee However, what I still don't understand is where you are storing the information about those small files (i.e. their hashes)? Or is it already stored in the snaåshot files and you are just using it to compare it with apparently new files?

Every file hash is already present in the snapshot (it is e.g. used when validating a snapshot using check -files). So for new small files found when starting a new backup there is a small overhead of calculating their hashes before deciding if they really need to be processed or not. Normally the hash is calculated while packing the file into chunks but at that stage it would be more complicated to skip the file (or someone more familiar with the internal data structures could make an optimization here).

@thrnz I'd be interested to see if there is any tradeoff performance/memory wise when there are a large number of files involved. Both with the 'second pass' for chunking smaller files, and with checking to see if a small file's hash already exists.

Yes, please do some tests! The two-pass chunking should not really have any performance impact, because duplicacy first builds a list of all files that need to be processed and then just goes through the list in order. With my change it just goes through the in-memory list twice, processing the large files in the first pass and the small files in the second pass. I/O might be slightly affected in the sense that the throughput during the second pass can be a bit lower when processing lots of small files (if the reading of the source files is the bottleneck).

The extra hashing of all new small files before starting the whole chunking process can on the other hand have a measurable impact. I didn't have the guts to try any fancy multithreading (not that familiar with Go), so there will probably be an I/O bottleneck when duplicacy in a single thread reads and hashes all the new small files while trying to decide if they really need to be processed or not. Improvements are gladly welcome!

kairisku commented 6 years ago

No I have not tried that, nor do I understand how that would make any difference. Chunks are immutable, once a chunk is created it will not be changed or touched in any way. During an incremental backup, only new and modified files are processed while any old unmodified files are skipped (unless you use the -hash option), so all new and modified files will end up creating new chunks. I see no benefit from changing the order in which files are bundled.

I am not familiar with the worst case behaviour regarding git repositories you are referring to.

Moving and copying of small files are already handled in my branch which causes them to never be included in new chunks if the identical data is already present in an existing chunk. I really should get around to making a PR of my improvements.. :)

gilbertchen / duplicacy

Inefficient Storage after Moving or Renaming Files? #334