gilbertchen / duplicacy

A new generation cloud backup tool
https://duplicacy.com
Other
5.12k stars 335 forks source link

Inefficient Storage after Moving or Renaming Files? #334

Open jonreeves opened 6 years ago

jonreeves commented 6 years ago

I think this may be related to issue #248 .

I noticed that backups of my active folders were growing in size significantly where I expected them to change very little. The best example I can give is my "Photos" folder/snapshot...

I regularly download my sd cards to a temporary named project folder under a "Backlog" subfolder. It can be days or months before I work on these images, but usually this will involve Renaming ALL the files and separating them into subfolders by Scene/Location or Purpose (print, portfolio, etc...). The project folder gets renamed too and the whole thing is then moved from out of "Backlog" to "Catalogued". None of the content of any of the files have physically changed during all this, file hashes should be the same.

The renaming and moving alone appears to be enough to make the backup double in size. Any image edits in theory shouldn't have an impact as typically the original file is untouched... The good thing about modern photo processing is edits are non destructive to the original image file. Instead and XML sidecar file is saved along side the image file with meta about the edits applied.

I'm yet to test the impact of making image edits but I suspect it may make things worse because each file gets another file added after it, and it seemed like file order made a difference to how the rolling hash was calculated.

Image1.raw
Image2.raw
...

Becomes...

Image1.raw
Image1.xmp
Image2.raw
Image2.xmp

This should be perfect for an incremental backup system, but it seems like duplicacy struggles under these circumstances. I figured the -hash option might help, but it didn't seem to.

Am I doing something wrong or missing an option?

Is this a bug, or just a design decision?

Is there any possible way to improve this?

Although the above example may sound unique I find this happens in almost all my everyday folders. Design projects, website coding projects, Files and Folders are just often reorganized.

I'm guessing the only way to reclaim this space would be to prune the older snapshots where the files were named differently?

gilbertchen commented 6 years ago

The new files like Image.xmp and Image2.xmp usually do not affect existing chunks for Image1.raw and Image2.raw. Rather, new files are bundled together and then split into new chunks. However, when you use the -hash option, then Duplicacy will repack all files in order and then split them. This could create tons of new chunks and make previously created chunks obsolete, especially if the average file size isn't much larger than the average chunk size. The same happens when you selectively move a few files from one folder to another.

If you suspect this is the case, then reducing the chunk size is the first option. A chunk size of 1M (instead of the default 4M) has been shown to be able to significantly improve the deduplication efficiency: https://duplicacy.com/issue?id=5740615169998848 and https://duplicacy.com/issue?id=5747610597982208. Unfortunately, you can't change the chunk size after you initialize the storage. You will need to start a new one.

The technique mentioned in #248 (which @fracai actually implemented on his branch) is another possibility. By introducing artificial chunk boundaries at large files, it can be guaranteed that moving larges files to a new location won't lead to any new chunks. However, this technique could potentially create many chunks that are too small and I'm unsure it would be better than just reducing the average chunk size.

It is possible to retrospectively check the effect of different chunk sizes on the deduplication efficiency. First, create a duplicate of your original repository in a disposable directory, pointing to the same storage:

mkdir /tmp/repository
cd /tmp/repository
duplicacy init repository_id storage_url

Add two storages (ideally local disks for speed) with different chunk sizes:

duplicacy add -c 1M test1 test1 local_storage1
duplicacy add -c 4M test2 test2 local_storage2

Then check out each revision and back up to both local storages:

duplicacy restore -overwrite -delete -r 1 
duplicacy backup -storage test1
duplicacy backup -storage test2
duplicacy restore -overwrite -delete -r 2
duplicacy backup -storage test1
duplicacy backup -storage test2

Finally check the storage efficiency using the check command:

duplicacy check -tabular -storage test1
duplicacy check -tabular -storage test2
jonreeves commented 6 years ago

However, when you use the -hash option, then Duplicacy will repack all files in order and then split them.

Ah, I hadn't realized that. I see why that wouldn't help the matter now.

A chunk size of 1M (instead of the default 4M) has been shown to be able to significantly improve the deduplication efficiency.

I tried a Chunk size of 1M Chunk size, and it did help but unfortunately it slowed down uploads to cloud destinations. I'm guessing I'll have to choose between space or speed as a priority.

As a side note, when using 4M chunks I saw an increase of about 12.5% per project when renaming and reorganizing files. I performed the exact same process with a 1M Chunk size and the increase shrunk to 4.6%. From what you say, the growth is likely to be dependent on how jumbled up the order of the files become, but I did notice that the proportions between 4M vs 1M stayed about the same (for my file types / folder structure).

I tested this across 5 different Photo projects/folders and and the results were surprisingly similar. I did notice a weird situation where the size would grow even if I was just moving the Folder from one parent to another. Initially I wasn't able to replicate it with a simple folder with dummy files, but eventually I was able to isolate the behavior in three steps, I'm not entirely sure why it happens though:

Scenario If I have a Snapshot called Photos that points to my test 'Photos' folder, which has the following structure:

Photos
    |- Backlog
    |    |- 2018-01 - Holiday
    |- Catalogued

The steps I take are as follows:

  1. Init my Photos Folder
  2. Backup my Photos Folder
  3. Reorganize the files in "2018-01 - Holiday" into subfolders
  4. Backup my Photos Folder
  5. Move the "2018-01 - Holiday" folder from "Backlog" to "Catalogued"
  6. Backup my Photos Folder

What happens: Well after reorganizing, there is a big increase in size (that is already discussed earlier), but then there is another increase when just relocating the folder from the "Backlog" folder to the "Catalogued" folder. This was more unexpected.

This is the resulting Check.

Storage set to samba://../../destination
Listing all chunks
All chunks referenced by snapshot Photos at revision 1 exist
All chunks referenced by snapshot Photos at revision 2 exist
All chunks referenced by snapshot Photos at revision 3 exist

   snap | rev |                          | files |  bytes | chunks |  bytes | uniq |    bytes | new |    bytes |
 Photos |   1 | @ 2018-01-25 23:05 -hash |    42 | 1,085M |    223 | 1,081M |   18 | 105,940K | 223 |   1,081M |
 Photos |   2 | @ 2018-01-25 23:05       |    42 | 1,085M |    223 | 1,086M |    4 |   8,905K |  18 | 110,411K |
 Photos |   3 | @ 2018-01-25 23:06       |    42 | 1,085M |    223 | 1,081M |    5 |   7,247K |   5 |   7,247K |
 Photos | all |                          |       |        |    246 | 1,196M |  246 |   1,196M |     |          |

I have the entire log if it helps, but figured I'd start with this.

I'm just kind of curious why in this circumstance the size would increase when moving between two folders.

gilbertchen commented 6 years ago

As a side note, when using 4M chunks I saw an increase of about 12.5% per project when renaming and reorganizing files. I performed the exact same process with a 1M Chunk size and the increase shrunk to 4.6%.

This information is very helpful. What is the average size of your photo files? I think the technique implemented by @fracai in his branch will definitely reduce the overhead further.

Which cloud storage are you using? I ran a test to upload a 1GB random file to Wasabi with 16 threads and the differences between 4M and 1M are minimal.

This is the result for the 4M test:

Uploaded chunk 201 size 2014833, 37.64MB/s 00:00:01 99.2%
Uploaded chunk 202 size 7967204, 36.57MB/s 00:00:01 100.0%
Uploaded 1G (1073741824)
Backup for /home/gchen/repository at revision 1 completed
Files: 1 total, 1048,576K bytes; 1 new, 1048,576K bytes
File chunks: 202 total, 1048,576K bytes; 202 new, 1048,576K bytes, 1,028M bytes uploaded
Metadata chunks: 3 total, 15K bytes; 3 new, 15K bytes, 14K bytes uploaded
All chunks: 205 total, 1,024M bytes; 205 new, 1,024M bytes, 1,028M bytes uploaded
Total running time: 00:00:28

This is the result from the 1M test:

Uploaded chunk 881 size 374202, 36.49MB/s 00:00:01 99.7%
Uploaded chunk 880 size 2277351, 36.57MB/s 00:00:01 100.0%
Uploaded 1G (1073741824)
Backup for /home/gchen/repository at revision 1 completed
Files: 1 total, 1048,576K bytes; 1 new, 1048,576K bytes
File chunks: 882 total, 1048,576K bytes; 882 new, 1048,576K bytes, 1,028M bytes uploaded
Metadata chunks: 3 total, 64K bytes; 3 new, 64K bytes, 55K bytes uploaded
All chunks: 885 total, 1,024M bytes; 885 new, 1,024M bytes, 1,028M bytes uploaded
Total running time: 00:00:29

Well after reorganizing, there is a big increase in size (that is already discussed earlier), but then there is another increase when just relocating the folder from the "Backlog" folder to the "Catalogued" folder. This was more unexpected.

This is actually normal. You can think of the pack-and-split approach as arranging files in order, and then packing them into a big tar ball, and finally splitting the tar ball into chunks. Reorganizing files changes the positions of many files in the tar ball, so a lot of new chunks will be created. Moving files from the "Backlog" folder to the "Catalogued" folder basically moves a sequence of files as whole, so new chunks will be created at the both ends of the sequence (as well as at the gap it left behind), but not inside the sequence (the variable-size chunks algorithm is able to split chunks at previous boundaries as much as it can).

fracai commented 6 years ago

I suppose I should clear the dust off my branch and make a pull request. The current state is that it forces a break if the file is larger than the configured average chunk size. @gilbertchen suggested changing this to break on double the chunk size. My plan is to add a new argument to specify the size that forces the break. By default it could be set to double the average chunk. This would be different than just setting the maximum size as many small files could still be packed in to a large chunk. I think this would be useful for testing and optimizing, but ultimately it'd probably be better for most users if the results of this testing were used to inform the defaults, or create different profiles for different filesets (a photo library, videos, documents, etc.).

jonreeves commented 6 years ago

@gilbertchen

What is the average size of your photo files?

In this example 42MB for the ARW files and 6-11MB for the JPG files. Equal numbers of each.

Which cloud storage are you using? I ran a test to upload a 1GB random file to Wasabi with 16 threads and the differences between 4M and 1M are minimal.

I'm using B2 at the moment and I noted it was about 50% slower with 1M instead of 4M. I'll retest on Monday when I'm back on my fast line, but I suspect its something that could be combated with more threads given my latency to the b2 storage servers is about 200ms (despite upload being 200-300Mbit).

The other worry I had with using 1M chunks in the cloud was the cost of API hits in the future. I'm already a little worried about it given I hope to backup arround 10TB, so going from 4M to 1M is likely to make this worse. Wasabi may end up making more sense for me if this is the case (no API charge).

Moving files from the "Backlog" folder to the "Catalogued" folder basically moves a sequence of files as whole, so new chunks will be created at the both ends of the sequence (as well as at the gap it left behind), but not inside the sequence (the variable-size chunks algorithm is able to split chunks at previous boundaries as much as it can).

I think I follow. The main reason it seemed odd to me, was because there were no other files in the whole repository in this example, so I didn't think moving from "Backlog" would actually make a difference. Infact if I start the process at Step 4, the move doesn't result in any growth.

This got me thinking about whether the same problem would exist if the order of the files was instead dictated by the modified date within the whole snapshot? As opposed to the Filename within a subfolder. I guess this would be problematic for a number of reasons; the first that springs to mind is needing to scan for all files in the whole repository before doing anything, and keep that in memory. But probably introducing more complexity for when files actually change or deduping across devices too.

@fracai

I think this would be useful for testing and optimizing, but ultimately it'd probably be better for most users if the results of this testing were used to inform the defaults, or create different profiles for different filesets (a photo library, videos, documents, etc.).

Like most people, I have a real mixed collection of data I want to backup. For me, the vast majority in terms of raw MB is Photos (3MB-50MB) and Videos (100MB-4GB each), but I also have VMs (avg 20GB each), Docker Containers (avg 300MB each) and on the other end of the spectrum Node projects (thousands of small files < 50KB).

If I have to run different chunk sizes (to different storages) I will, but would deffinately be good to know what's optimal for my needs.

Your point about different profiles for different filesets is interesting, I can see that being really useful, even at a snapshot level.

kairisku commented 6 years ago

I think inserting chunk boundaries at file boundaries would be very beneficial regarding deduplication. Consider a folder where some randomly chosen files are edited or added every day. File boundaries are very natural break points for changed data and should thus be utilized.

Of course storage performance is dependent on large enough chunks and this has to be weighed against file boundary chunking. In my opinion, any file boundary appearing after the minimum chunk size has been gathered should terminate the chunk, because then the minimum chunk size can be set by the user according to taste and storage requirements. For a random distribution of file sizes the real average chunk size should be approximately 2.1 times the configured average chunk size (sum of 1..n file sizes modulo [0.25 .. 4.0] x storage avg chunk size).

Furthermore I think the rolling hash function should use a significantly smaller window size. The rolling hash is calculated over the minimum chunk size (call it M), which means any change in data within the last M bytes of the chunk (which is on average half of the chunk) would cause the chunk boundary in the stream to move to a new position, and that in turn could lead to a domino-effect spilling over several chunks if subsequent potential chunk boundaries are not within the min/max size limits. Calculating the rolling hash over a fairly short range of bytes (say 256 bytes) would provide much more stable chunk boundaries because small data changes would be very unlikely to affect the chunk boundary.

gilbertchen commented 6 years ago

In my opinion, any file boundary appearing after the minimum chunk size has been gathered should terminate the chunk.

I agree this is a good idea.

Furthermore I think the rolling hash function should use a significantly smaller window size.

Duplicacy uses the minimum chunk size as the rolling hash window not only because it has one fewer parameter to configure, but also to avoid the worse scenario where you have too many boundaries (with a 256 byte window for instance). I think a much larger window will make the rolling hash tends more random (although I don't have data to support this claim).

kairisku commented 6 years ago

Duplicacy uses the minimum chunk size as the rolling hash window not only because it has one fewer parameter to configure, but also to avoid the worse scenario where you have too many boundaries (with a 256 byte window for instance). I think a much larger window will make the rolling hash tends more random (although I don't have data to support this claim).

Too much randomness is not necessarily your friend. If the data stream has some reoccurring patterns, it could be beneficial to be able to find them using a smaller hashing window. The minimum chunk size will still limit the maximum number of chunks you can get.

I would have thought that a 256 byte window would provide enough entropy to avoid degenerate cases of too many chunks, and I tried to run some test data using different windows sizes. The results were not really what I expected, leading me to believe I perhaps messed up the hashing code somehow.

My test repository is a windows share with lots of rather small files (office documents, some executables) and storage initialized with 1M average chunk sizes:

Default window size (minimum chunksize, i.e. 256 kB):
Files: 26401 total, 7,528M bytes; 26401 new, 7,528M bytes
File chunks: 5943 total, 7,528M bytes; 5363 new, 6,845M bytes, 4,627M bytes uploaded

Window size                 File chunks
  256 kB                       5943
  128 kB                       6048
   32 kB                       6173
    4 kB                       6808
 1096 B                        5927
 1024 B                        8998
  512 B                       10298
  320 B                        5973
  256 B                       12092
  255 B                        5794
  192 B                        6287
  128 B                       13247
   64 B                        7227
   56 B                        5795

It strikes me as very odd that smaller window sizes that are even powers of two give significantly more chunks, while other window sizes perform almost identical to the original window size.

The modified chunkmaker code can be seen in my hash_window branch.

TowerBR commented 6 years ago

If I have to run different chunk sizes (to different storages) I will, but would deffinately be good to know what's optimal for my needs.

+1!

jonreeves commented 6 years ago

@kairisku

It strikes me as very odd that smaller window sizes that are even powers of two give significantly more chunks, while other window sizes perform almost identical to the original window size.

That is interesting. What was your File Chunk Count for the Original Window Size.

kairisku commented 6 years ago

@jonreeves: The original window size is the minimum chunk size, i.e. 256 kB in my test. That gave 5943 chunks.

kairisku commented 6 years ago

[kairisku]: In my opinion, any file boundary appearing after the minimum chunk size has been gathered should terminate the chunk.

[gilbertchen]: I agree this is a good idea.

@gilbertchen: My branch file_boundaries implements this. I had to increase the bufferCapacity to maximumChunkSize on order to detect file boundaries occurring in the whole range between minimumChunkSize and maximumChunkSize.

For repositories with very small files (average file size well below the average chunk size) this chunking can give a slight increase in number of chunks (in my extreme testing I got ~20% more chunks), but for file sizes averaging at or above the average chunk size there should not be any measurable difference in number of chunks.

I have not tested how effective this file boundary chunking is when editing or reordering files. @jonreeves, would you be able to test this variant?

jonreeves commented 6 years ago

@kairisku definitely. I'll be back with my data set on Saturday and can perform the same renames/moves as before and report back.

jonreeves commented 6 years ago

@kairisku I tested out your branch.

I have a source Folder that contains Images and have made duplicates of it in 8 different states to indicate 8 potential changes over time to the Location and Contents of the Folder.

Directory Structure My source has a simple 2 subdirectory structure. The intent is to illustrate a working folder (where things change frequently) - the Backlog folder. And a Catalog folder where files are moved to when archived.

Source
    |- .duplicacy/
    |- Backlog/
    |- Catalog/

Stages / Snapshot Revisions I took a 1GB collection of Photos and created 7 versions of the Project Folder in different stages so that I could copy them into the Sources folder (removing the previous stage) and run the next Backup/Snapshot.

None of the files change in this example, only new files are added and existing ones renamed or moved. Because of this I ensured the Modified Date of the files remain the same between stages where appropriate. Below is a list of the Stages which effectively represent the Revisions in the Storage. I explain what happened to the folder/files in the Stage:

  1. Stage 1 - Initial Backup containing only Source directory structure
  2. Stage 2 - Download SD Card to Project Folder inside Backlog/
  3. Stage 3 - Batch Rename Files inside Project Folder
  4. Stage 4 - Rename the Project Folder
  5. Stage 5 - Process desired RAW Files: Produces small .xmp files named the same as images
  6. Stage 6 - Render Outputs: New JPGs added to a subfolder called outputs
  7. Stage 7 - Reorganize: Place original Images and new .xmp files into named subfolders
  8. Stage 8 - Catalog: Move the Project Folder to a different path catalog/

Results I ran both the official build and the File Boundries Build with the same init flags. Here are the results showing how each stage's actions impacted the storage used:

# Duplicacy 2.0.10
#    -e -min 1M -c 4M -max 16M

   snap | rev |                          | files |  bytes | chunks |  bytes | uniq |   bytes | new |   bytes |
 Photos |   1 | @ 2018-02-10 23:03 -hash |       |        |      3 |    384 |    3 |     384 |   3 |     384 |
 Photos |   2 | @ 2018-02-10 23:03       |    32 | 1,048M |    224 | 1,044M |    4 |  4,735K | 224 |  1,044M |
 Photos |   3 | @ 2018-02-10 23:04       |    32 | 1,048M |    224 | 1,044M |    1 |      3K |   6 | 16,747K |
 Photos |   4 | @ 2018-02-10 23:04       |    32 | 1,048M |    224 | 1,044M |    1 |      3K |   1 |      3K |
 Photos |   5 | @ 2018-02-10 23:04       |    37 | 1,048M |    225 | 1,044M |    3 |     19K |   4 |     30K |
 Photos |   6 | @ 2018-02-10 23:04       |    42 | 1,085M |    234 | 1,081M |    3 |     21K |  12 | 38,309K |
 Photos |   7 | @ 2018-02-10 23:04       |    42 | 1,085M |    234 | 1,081M |    3 |     21K |  18 | 88,541K |
 Photos |   8 | @ 2018-02-10 23:04       |    42 | 1,085M |    234 | 1,081M |    6 | 18,054K |   6 | 18,054K |
 Photos | all |                          |       |        |    274 | 1,202M |  274 |  1,202M |     |         |
# Duplicacy 2.0.10 (Kairisku File Boundries)
#    -e -min 1M -c 4M -max 16M

   snap | rev |                          | files |  bytes | chunks |  bytes | uniq |   bytes | new |    bytes |
 Photos |   1 | @ 2018-02-10 23:29 -hash |       |        |      3 |    378 |    3 |     378 |   3 |      378 |
 Photos |   2 | @ 2018-02-10 23:29       |    32 | 1,048M |    222 | 1,044M |    4 |  2,917K | 222 |   1,044M |
 Photos |   3 | @ 2018-02-10 23:29       |    32 | 1,048M |    222 | 1,044M |    1 |      3K |   6 |  10,160K |
 Photos |   4 | @ 2018-02-10 23:30       |    32 | 1,048M |    222 | 1,044M |    1 |      3K |   1 |       3K |
 Photos |   5 | @ 2018-02-10 23:30       |    37 | 1,048M |    223 | 1,044M |    3 |     19K |   4 |      30K |
 Photos |   6 | @ 2018-02-10 23:30       |    42 | 1,085M |    229 | 1,081M |    3 |     20K |   9 |  38,308K |
 Photos |   7 | @ 2018-02-10 23:30       |    42 | 1,085M |    226 | 1,081M |    3 |     20K |  18 | 130,081K |
 Photos |   8 | @ 2018-02-10 23:30       |    42 | 1,085M |    225 | 1,081M |    6 | 21,811K |   6 |  21,811K |
 Photos | all |                          |       |        |    269 | 1,240M |  269 |  1,240M |     |          |

Conclusions It would seem that the Files Boundries build added an additional 38MB to the total over the Official build. At Stage 3 - Batch Rename Files it actually saved some space 10,160K vs 16,747K. But later on at Stage 7 - Reorganize Files into Subfolders uses more 130,081K vs 88,541K. Finally on Stage 8 - Catalog it also uses slightly more 21,811K vs 18,054K:

                                  Official      File Boundries
----------------------------------------------------------------
                                 |       bytes |         bytes |
----------------------------------------------------------------
Stage 1 - Initial Backup         |         384 |           378 |
Stage 2 - Download SD Card       |      1,044M |        1,044M |
Stage 3 - Batch Rename Files     |     16,747K |       10,160K |
Stage 4 - Rename Project Folder  |          3K |            3K |
Stage 5 - Process RAW Files      |         30K |           30K |
Stage 6 - Render Outputs         |     38,309K |       38,308K |
Stage 7 - Reorganize             |     88,541K |      130,081K |
Stage 8 - Catalog                |     18,054K |       21,811K |

Thoughts I think I mentioned this in another thread, but its worth bringing up again. I suspect that I just need to rethink how I backup files that actively change alot vs ones that are for archiving. I just need to work out a good strategy for doing this that doesn't involve me physically moving my folders about to meet the needs of the backup software.

I guess the intended way would be to make Backlog and Catalog separate sources and allow them to backup at different frequencies, and rely on pruning to eventually reclaim space lost on the moves/renames done in Backlog.

TowerBR commented 6 years ago

@jonreeves , @kairisku , maybe you're interested in some tests I'm doing: link

They can give you some ideas...

kairisku commented 6 years ago

@jonreeves I suspect you have an unfortunate interaction between the sizes of your files and the chunk sizes leading to an extreme situation. After moving the files around they appear as completely new files (regardless of their timestamp) causing the RAW, xmp and JPG files to appear together and if the file boundaries to not fall within the chunk size limits you get completely new chunks with data from different files.

You could check how the file and chunk boundaries are aligned by extracting the file information of the snapshot with something like this:

duplicacy cat | awk '/"files"/,/"id"/

and then look at the second number in the content -tag which should be zero if the file starts at the beginning of a chunk, i.e.

"content": "1830:0:1830:692736",

is an example of a file that starts at offset zero of chunk 1830. Since the number of files in your test is quite small, you could also show us all the content rows with

duplicacy cat | grep content

I would recommend using smaller chunks, e.g. average 1M chunks as that would decrease the probability of different files being bundled in the same chunk after reorganizing.

kairisku commented 6 years ago

@TowerBR maybe you're interested in some tests I'm doing: link

They can give you some ideas...

Yes, I have looked at your tests regarding the optimization of the Evernote database causing massive data uploads. I cannot really say how much of the database is actually modified during the optimization. If there are many changes everywhere, then nothing much can be done about it. But I think the variable chunking algorithm in duplicacy has room for improvement (as I said in an earlier comment in this issue).

By default duplicacy uses a hashing window equal to the minimum chunk size, so for your 1M average chunk size that means it hashes 256 kB at a time. If a single byte of that 256 kB has changed, the hash changes and the variable boundary gets moved to a new location which can give a cascade effect of many new chunks.

Could you try my hash_window branch which uses a smaller window size of 255 bytes for the hash? It should keep the chunk boundaries more stable leading to less changed data in your case (provided the Evernote optimization actually leaves portions of the data unchanged).

TowerBR commented 6 years ago

Could you try my hash_window branch which uses a smaller window size of 255 bytes for the hash?

Ok! I could run a test with "official" Duplicacy vs. yours, both of them with 1M fix chunks. Is it a good setup?

kairisku commented 6 years ago

Ok! I could run a test with "official" Duplicacy vs. yours, both of them with 1M fix chunks. Is it a good setup?

Not fixed, use variable size chunks with 1M average size.. :)

TowerBR commented 6 years ago

I asked because the performance of Duplicacy with fixed chunks was better than with the variable for this case (basically a large SQLite file), and this is the setup recommended by Gilbert for the case.

Then the setup would be: "official" Duplicacy with 1M fix x your branch with 1M var? Or both 1M var?

kairisku commented 6 years ago

I am hoping that my branch improves the variable chunking to be at least as efficient as using fixed chunks, leading to variable chunking being the best configuration for all cases. So your primary comparison should be official Duplicacy with 1M fixed chunks vs my branch with 1M variable chunks. Feel free to do any other comparison you find interesting (the mailbox -editing scenarios could also be interesting).

TowerBR commented 6 years ago

Please, give me a little help, since I have no experience compiling programs in Go: I downloaded and installed Go, I cloned your main repository to \Go\src\github.com\duplicacy\ and tried:

git checkout hash_window
go build

But it complained of a number of dependencies (Google, Minio, etc). Do I have to download them all or is there any way to compile by accessing them on line?

jonreeves commented 6 years ago

@TowerBR if I remember correctly the project uses dep for dependency management, so I think you need to install that too then run dep ensure to fetch all the libraries: https://golang.github.io/dep/

You can then enter the duplicacy/ folder and run go build.

jonreeves commented 6 years ago

and then look at the second number in the content -tag which should be zero if the file starts at the beginning of a chunk...

@kairisku I re-ran the tests again, but with a cat after each Stage. Your binary produced the same :0 count as official one. At Stage 8 only one file started at the start of a chunk - the first file in the first folder.

I suspect you have an unfortunate interaction between the sizes of your files and the chunk sizes leading to an extreme situation. After moving the files around they appear as completely new files

I think you're right. Each JPG is 12MB, ARW 41MB and XMP 12KB. There is an MP4 that is 270MB, but they vary from file to file cause video size is dictated by time.

What seems odd to me, is that these JPG and ARW files don't change, so the File Hash should always be the same. You'd expect that if a new/moved file is discovered on a subsequent backup run, and it has the 'exact' same File Hash, that you could effectively just relink it to the existing Chunks and boundaries.

I realize this likely isn't possible because of the current chunking approach and obviously you can't know the File Hash until you read it completely, by which time the "new" file is partially part of a new set of chunks unnecessarily.

I semi-hoped the -hash option would run through a complete File Hash before commiting new files to chunks (staging them on disk or memory), but that doesn't appear to be how it works.

TowerBR commented 6 years ago

@jonreeves , thanks for the help!

I moved a few steps ahead...

The dep ensure command is returning:

C:\Go\src\github.com\duplicacy>dep ensure
Warning: the following project(s) have [[constraint]] stanzas in Gopkg.toml:

  ✗  github.com/gilbertchen/azure-sdk-for-go

However, these projects are not direct dependencies of the current project:
they are not imported in any .go files, nor are they in the 'required' list in
Gopkg.toml. Dep only applies [[constraint]] rules to direct dependencies, so
these rules will have no effect.

Either import/require packages from these projects so that they become direct
dependencies, or convert each [[constraint]] to an [[override]] to enforce rules
on these projects, if they happen to be transitive dependencies,

grouped write of manifest, lock and vendor: link error: cannot rename C:\Users\Admin\AppData\Local\Temp\dep155995302\vendor to C:\Go\src\github.com\duplicacy\vendor: rename C:\Users\Admin\AppData\Local\Temp\dep155995302\vendor C:\Go\src\github.com\duplicacy\vendor: Access is denied.

I already tried to run with my user and as admin.

The first error seems to be no problem, but I didn't understand the last one (access denied, logged in as admin?).

kairisku commented 6 years ago

@TowerBR: I do not know the intricacies of go and its dependences. I just initially followed the installation instructions in the Duplicacy wiki, i.e.

 go get -u github.com/gilbertchen/duplicacy/...

which fetched (took a long while) all the needed sources for me (and it took me a while to realize the three dots at the end of the command should actually be there!). After that I created my fork and could use the already fetched dependencies.

Note: after checking out my branch you need to edit the import in duplicacy/duplicacy_main.go to reference "github.com/kairisku/duplicacy/src" instead of gilbertchen's sources, as otherwise go would just use the original sources and not my branch.

@jonreeves: the above note applies to you too, since it seems you are not getting any file boundary chunks. Otherwise your conclusions seem correct. Duplicacy does not use file hashes at all to identify previously seen files that may have changed names or locations, but rather concatenates the contents of all files into a long data stream that is cut into chunks according to artificial boundaries based on a hash function.

My file_boundaries branch also considers file boundaries as potential chunk separators (within the specified chunk size limits), which should result in two large enough files not being bundled into the same chunk. Of course in the case when the file boundary is found before the minimum chunk size has been collected it has to be bundled with one or more following files to achieve an large enough chunk. With the default chunk size limits I think there is a 1/16 chance of bundling still happening.

My hash_window branch on the other hand uses a significantly smaller hash window which should result in the hash-based chunk boundaries being more stable when data within a file is changed (i.e. SQLite databases or virtual machine images). The hash-based chunk boundaries are created when the hash of the last n bytes give a specific pattern, which means any changes within the last n bytes of a chunk will cause the hash to change and the chunk boundary to move to a different location which in turn causes one or several subsequent chunks to also change. My branch changes n from the large value of minchunksize (1M if using the default 4M chunks) to 255, which means a data change within a chunk has a very low probability of affecting the chunk boundary.

Both of my branches could be applied simultaneously (because each addresses different problems with the current chunking approach). I am also trying to test the changes myself, but do not actually have the very distinct issues @jonreeves and @TowerBR have been investigating, so I appreciate the testing assistance.

jonreeves commented 6 years ago

@kairisku thanks for the guidance, I didn't think to check the import statement. I did note that the output binary was different in size, so assumed it compiled something different to the official build, but... obviously it was probably because of the official src have also changed since the binary was made in Nov 2017. This probably explains why the difference in results shown above too (something related to the commits since Nov).

I followed your suggestions and re-built the binary and re-ran the tests. A big improvement for sure this time:

# Duplicacy 2.0.10 (Kairisku File Boundries)
#    -e -min 1M -c 4M -max 16M
   snap | rev |                          | files |  bytes | chunks |  bytes | uniq |   bytes | new |   bytes |
 Photos |   1 | @ 2018-02-13 08:19 -hash |       |        |      3 |    378 |    3 |     378 |   3 |     378 |
 Photos |   2 | @ 2018-02-13 08:19       |    32 | 1,048M |    173 | 1,044M |    3 |     15K | 173 |  1,044M |
 Photos |   3 | @ 2018-02-13 08:19       |    32 | 1,048M |    171 | 1,044M |    1 |      3K |   6 | 14,449K |
 Photos |   4 | @ 2018-02-13 08:19       |    32 | 1,048M |    171 | 1,044M |    1 |      3K |   1 |      3K |
 Photos |   5 | @ 2018-02-13 08:19       |    37 | 1,048M |    172 | 1,044M |    3 |     16K |   4 |     27K |
 Photos |   6 | @ 2018-02-13 08:20       |    42 | 1,085M |    179 | 1,081M |    3 |     17K |  10 | 38,305K |
 Photos |   7 | @ 2018-02-13 08:20       |    42 | 1,085M |    180 | 1,081M |    3 |     17K |   8 | 24,921K |
 Photos |   8 | @ 2018-02-13 08:20       |    42 | 1,085M |    178 | 1,081M |    5 | 15,115K |   5 | 15,115K |
 Photos | all |                          |       |        |    210 | 1,135M |  210 |  1,135M |     |         |
                                    Official        FB        Official        FB
--------------------------------------------------------------------------------------
                                 |      bytes |      bytes |     chunks |     chunks |
--------------------------------------------------------------------------------------
Stage 1 - Initial Backup         |        384 |        384 |          3 |          3 |
Stage 2 - Download SD Card       |     1,044M |     1,044M |        221 |        173 |
Stage 3 - Batch Rename Files     |    16,747K |    14,449K |        221 |        171 |
Stage 4 - Rename Project Folder  |         3K |         3K |        221 |        171 |
Stage 5 - Process RAW Files      |        30K |        27K |        222 |        172 |
Stage 6 - Render Outputs         |    38,309K |    38,305K |        227 |        179 |
Stage 7 - Reorganize             |    88,541K |    24,921K |        226 |        180 |
Stage 8 - Catalog                |    18,054K |    15,115K |        225 |        178 |
--------------------------------------------------------------------------------------
All Revisions                    |     1,259M |     1,135M |        267 |        210 |

This time, every file has :0 in its content key, so it appears to be working as expected.

I'll try out the hash_window branch a bit later, and see what happens with that approach.

Updated Sorry there was a copy/paste error in the bottom table. Correct Numbers now reflected.

TheBestPessimist commented 6 years ago

@jonreeves could you please also update the table with the number of chunks in each revision? that would also be useful since i am interested in the space saved <-> number of api calls (google drive is slow afterall, and its rate limiting is anoying!)

jonreeves commented 6 years ago

@TheBestPessimist good point, I'm also keeping an eye on this.

Updated.

kairisku commented 6 years ago

@jonreeves nice to see an improvement, but still a bit surprising to see that much data uploaded after stages 3, 7 and 8. If every file indeed has :0: in its content key after those stages, the chunks should be from the beginning of each file and consequently be identical between different stages. Where does it go wrong? Can you compare the content, hash, path and size keys for some file between renamed revisions?

Since you are not making changes to the actual content of the files (or are you?), I do not think the hash_window branch will make any difference to you.

jonreeves commented 6 years ago

@kairisku I'm not changing any existing files, but let me double check the file sizes, file dates and file hashes manually at each stage to make sure I'm not making a mistake somewhere along the process. I did check this before, but want to be certain because I'm running it by script now instead of by hand.

I will look through the logs to compare what you suggested also to see when/if they change.

At 5 the XMP files are introduced (so... New Files Added). These are named identical to the ARW files, so they end up shuffling the Sort Order, (but doesn't appear to impact the size adversely (yet).

At 6 new JPG files are Added. Again it doesn't appear to affect the existing chunks.

At 3, 7 and 8 no Files change or get Added. Only Renames or Moves of already Backed Up files. This was why I surprised too. I noted this in the second post in this Issue, where I distilled the process to just 3 Revisions to illustrate the problem... see above.

I'm away from the machine with the test files on it, so I'll have to pick it up again in a few hours. I'll get back to you with my findings.

TowerBR commented 6 years ago

(and it took me a while to realize the three dots at the end of the command should actually be there!)

I can't believe it was just that ...

Everything is fine now, and I'm going to run the tests.

I will use the repository with the mbox files, as I think it has a wider application, since it involves SQLite files and text files. I will use 3 jobs, all with 1M variable chunks: Duplicacy, yours hash_window branch and also the file_boundaries branch.

jonreeves commented 6 years ago

So I took the output of the cat that I recorded at each stage and parsed them to check and compare the hash, size and time of files at each Stage:

As far as Duplicacy is reporting, the everything is unmodified. Even so space is being taken up because of the rename/move.

jonreeves commented 6 years ago

I ran the hash_window branch, the updated table is below:

# Duplicacy 2.0.10 (Kairisku Hash Window)
#    -e -min 1M -c 4M -max 16M
   snap | rev |                          | files |  bytes | chunks |  bytes | uniq |   bytes | new |    bytes |
 Photos |   1 | @ 2018-02-13 21:26 -hash |       |        |      3 |    378 |    3 |     378 |   3 |      378 |
 Photos |   2 | @ 2018-02-13 21:26       |    32 | 1,048M |    217 | 1,044M |    4 |  5,381K | 217 |   1,044M |
 Photos |   3 | @ 2018-02-13 21:26       |    32 | 1,048M |    217 | 1,044M |    1 |      3K |   6 |  13,352K |
 Photos |   4 | @ 2018-02-13 21:26       |    32 | 1,048M |    217 | 1,044M |    1 |      3K |   1 |       3K |
 Photos |   5 | @ 2018-02-13 21:27       |    37 | 1,048M |    218 | 1,044M |    3 |     19K |   4 |      30K |
 Photos |   6 | @ 2018-02-13 21:27       |    42 | 1,085M |    226 | 1,081M |    3 |     20K |  11 |  38,309K |
 Photos |   7 | @ 2018-02-13 21:27       |    42 | 1,085M |    225 | 1,081M |    3 |     20K |  15 | 105,256K |
 Photos |   8 | @ 2018-02-13 21:27       |    42 | 1,085M |    226 | 1,081M |    7 | 15,056K |   7 |  15,056K |
 Photos | all |                          |       |        |    264 | 1,212M |  264 |  1,212M |     |          |
                                    Official        FB           HW        Official        FB           HW
----------------------------------------------------------------------------------------------------------------
                                 |      bytes |      bytes |      bytes |     chunks |     chunks |     chunks |
----------------------------------------------------------------------------------------------------------------
Stage 1 - Initial Backup         |        384 |        384 |        378 |          3 |          3 |          3 |
Stage 2 - Download SD Card       |     1,044M |     1,044M |     1,044M |        221 |        173 |        217 |
Stage 3 - Batch Rename Files     |    16,747K |    14,449K |    13,352K |        221 |        171 |        217 |
Stage 4 - Rename Project Folder  |         3K |         3K |         3K |        221 |        171 |        217 |
Stage 5 - Process RAW Files      |        30K |        27K |        30K |        222 |        172 |        218 |
Stage 6 - Render Outputs         |    38,309K |    38,305K |    38,309K |        227 |        179 |        226 |
Stage 7 - Reorganize             |    88,541K |    24,921K |   105,256K |        226 |        180 |        225 |
Stage 8 - Catalog                |    18,054K |    15,115K |    15,056K |        225 |        178 |        226 |
----------------------------------------------------------------------------------------------------------------
All Revisions                    |     1,259M |     1,135M |     1,212M |        267 |        210 |        264 |

There are some savings on Stage 3 and Stage 8, but Stage 7 increased in size. I suspect this could be because of other commits since the Nov 2017 release. As noted previously.

kairisku commented 6 years ago

@jonreeves I just committed a fix to the file_boundaries branch.

I found out that for the last file processed in a backup run it would accidentally split the last chunk into smaller fragments instead of keeping it together. This seems to explain why there was data uploaded after reorganizing files. In my limited tests I have now achieved 0 file chunks uploaded after renaming or reorganizing files. Do you have time to run your tests one more time with the latest fixes?

As expected, the hash_window branch did not really affect your scenario much, but I am eagerly waiting for results from @TowerBR when file contents are changed.

jonreeves commented 6 years ago

@kairisku ah very interesting. I'm looking forward to trying it out. This was my biggest issue with Duplicacy compared to restic etc...

I'll give it a go and report back this evening. Cheers.

TowerBR commented 6 years ago

but I am eagerly waiting for results from @TowerBR when file contents are changed.

The tests are running, and I decided to run both: with the Mbox files and with the Evernote folder, each one with 3 jobs. I'll post the results here in 2 or 3 days.

jonreeves commented 6 years ago

@kairisku unfortunately the results weren't so good on my side.

# Duplicacy 2.0.10 (Kairisku Hash Window - 5355dd834a01030e8c5c204c9861b14dc8fafadd)
#    -e -min 1M -c 4M -max 16M
   snap | rev |                          | files |  bytes | chunks |  bytes | uniq |   bytes | new |    bytes |
 Photos |   1 | @ 2018-02-14 18:19 -hash |       |        |      3 |    378 |    3 |     378 |   3 |      378 |
 Photos |   2 | @ 2018-02-14 18:19       |    32 | 1,048M |    216 | 1,044M |    4 |  6,790K | 216 |   1,044M |
 Photos |   3 | @ 2018-02-14 18:19       |    32 | 1,048M |    216 | 1,044M |    1 |      3K |   6 |  27,051K |
 Photos |   4 | @ 2018-02-14 18:20       |    32 | 1,048M |    216 | 1,044M |    1 |      3K |   1 |       3K |
 Photos |   5 | @ 2018-02-14 18:20       |    37 | 1,048M |    217 | 1,044M |    3 |     19K |   4 |      30K |
 Photos |   6 | @ 2018-02-14 18:20       |    42 | 1,085M |    224 | 1,081M |    3 |     20K |  10 |  38,307K |
 Photos |   7 | @ 2018-02-14 18:20       |    42 | 1,085M |    224 | 1,081M |    3 |     20K |  18 | 106,816K |
 Photos |   8 | @ 2018-02-14 18:20       |    42 | 1,085M |    225 | 1,081M |    7 | 23,466K |   7 |  23,466K |
 Photos | all |                          |       |        |    265 | 1,235M |  265 |  1,235M |     |          |

The usage jumped back up compared to the previous build 1,235M vs 1,135M.

kairisku commented 6 years ago

@jonreeves Based on the number of chunks (and the performance in general) it seems you are not running my branch. Perhaps the import in duplicacy_main.go is wrong again?

jonreeves commented 6 years ago

@kairisku urgh I'm sorry, I'm glad one of us is paying attention. The way GO references imports absolutely is really strange. I'll need to read up on how to handle that for forks in general, I'm assuming there is a better way than remembering to edit the code.

# Duplicacy 2.0.10 (Kairisku Hash Window - 5355dd834a01030e8c5c204c9861b14dc8fafadd)
#    -e -min 1M -c 4M -max 16M
   snap | rev |                          | files |  bytes | chunks |  bytes | uniq |  bytes | new |   bytes |
 Photos |   1 | @ 2018-02-15 09:08 -hash |       |        |      3 |    378 |    3 |    378 |   3 |     378 |
 Photos |   2 | @ 2018-02-15 09:08       |    32 | 1,048M |    161 | 1,044M |    3 |    15K | 161 |  1,044M |
 Photos |   3 | @ 2018-02-15 09:08       |    32 | 1,048M |    162 | 1,044M |    1 |     3K |   5 |  6,410K |
 Photos |   4 | @ 2018-02-15 09:08       |    32 | 1,048M |    162 | 1,044M |    1 |     3K |   1 |      3K |
 Photos |   5 | @ 2018-02-15 09:08       |    37 | 1,048M |    163 | 1,044M |    3 |    15K |   4 |     26K |
 Photos |   6 | @ 2018-02-15 09:08       |    42 | 1,085M |    168 | 1,081M |    3 |    16K |   8 | 38,303K |
 Photos |   7 | @ 2018-02-15 09:08       |    42 | 1,085M |    166 | 1,081M |    3 |    16K |   8 | 34,215K |
 Photos |   8 | @ 2018-02-15 09:09       |    42 | 1,085M |    166 | 1,081M |    4 | 8,020K |   4 |  8,020K |
 Photos | all |                          |       |        |    194 | 1,129M |  194 | 1,129M |     |         |

Definitely an improvement on 3 and 8 where the files are just being moved or renamed. 7 increased a little though where the order of the files is being changed by putting them in subfolders.

Thanks again for your help on this.

jonreeves commented 6 years ago

I've included a file listing for each stage, to clarify the changes happening, incase that helps.

Stage 1 - Initial Backup

Source
    |- .duplicacy/
    |- Backlog/
    |- Catalog/

Stage 2 - Download SD Card

Source
    |- .duplicacy/
    |- Backlog/
    |   |
    |   |- 2018-01 - DCIM/
    |       |- C0001.MP4
    |       |- C0001M01.XML
    |       |- DSC05272.ARW
    |       |- DSC05272.JPG
    |       |- DSC05273.ARW
    |       |- DSC05273.JPG
    |       |- DSC05274.ARW
    |       |- DSC05274.JPG
    |       |- DSC05275.ARW
    |       |- DSC05275.JPG
    |       |- DSC05276.ARW
    |       |- DSC05276.JPG
    |       |- DSC05277.ARW
    |       |- DSC05277.JPG
    |       |- DSC05278.ARW
    |       |- DSC05278.JPG
    |       |- DSC05279.ARW
    |       |- DSC05279.JPG
    |       |- DSC05280.ARW
    |       |- DSC05280.JPG
    |       |- DSC05281.ARW
    |       |- DSC05281.JPG
    |       |- DSC05282.ARW
    |       |- DSC05282.JPG
    |       |- DSC05283.ARW
    |       |- DSC05283.JPG
    |       |- DSC05284.ARW
    |       |- DSC05284.JPG
    |       |- DSC05285.ARW
    |       |- DSC05285.JPG
    |       |- DSC05286.ARW
    |       |- DSC05286.JPG
    |- Catalog/

Stage 3 - Batch Rename Files

Source
    |- .duplicacy/
    |- Backlog/
    |   |
    |   |- 2018-01 - DCIM/
    |       |- Example Holiday Photos (Jan 2017) - 0001 - DSC05272.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0001 - DSC05272.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0002 - DSC05273.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0002 - DSC05273.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0003 - DSC05274.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0003 - DSC05274.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0004 - DSC05275.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0004 - DSC05275.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0005 - DSC05276.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0005 - DSC05276.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0006 - DSC05277.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0006 - DSC05277.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0007 - DSC05278.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0007 - DSC05278.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0008 - DSC05279.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0008 - DSC05279.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0009 - DSC05280.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0009 - DSC05280.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0010 - DSC05281.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0010 - DSC05281.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0011 - DSC05282.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0011 - DSC05282.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0012 - DSC05283.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0012 - DSC05283.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0013 - DSC05284.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0013 - DSC05284.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0014 - DSC05285.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0014 - DSC05285.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0015 - DSC05286.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0015 - DSC05286.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0016 - C0001.MP4
    |       |- Example Holiday Photos (Jan 2017) - 0016 - C0001M01.XML
    |- Catalog/

Stage 4 - Rename Project Folder

Source
    |- .duplicacy/
    |- Backlog/
    |   |
    |   |- Example Holiday Photo Folder with a Long Name - Jan 2018 (A7RM2)/
    |       |- Example Holiday Photos (Jan 2017) - 0001 - DSC05272.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0001 - DSC05272.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0002 - DSC05273.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0002 - DSC05273.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0003 - DSC05274.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0003 - DSC05274.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0004 - DSC05275.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0004 - DSC05275.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0005 - DSC05276.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0005 - DSC05276.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0006 - DSC05277.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0006 - DSC05277.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0007 - DSC05278.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0007 - DSC05278.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0008 - DSC05279.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0008 - DSC05279.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0009 - DSC05280.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0009 - DSC05280.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0010 - DSC05281.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0010 - DSC05281.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0011 - DSC05282.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0011 - DSC05282.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0012 - DSC05283.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0012 - DSC05283.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0013 - DSC05284.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0013 - DSC05284.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0014 - DSC05285.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0014 - DSC05285.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0015 - DSC05286.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0015 - DSC05286.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0016 - C0001.MP4
    |       |- Example Holiday Photos (Jan 2017) - 0016 - C0001M01.XML
    |- Catalog/

Stage 5 - Process RAW Files

Source
    |- .duplicacy/
    |- Backlog/
    |   |
    |   |- Example Holiday Photo Folder with a Long Name - Jan 2018 (A7RM2)/
    |       |- Example Holiday Photos (Jan 2017) - 0001 - DSC05272.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0001 - DSC05272.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0002 - DSC05273.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0002 - DSC05273.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0003 - DSC05274.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0003 - DSC05274.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0004 - DSC05275.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0004 - DSC05275.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0005 - DSC05276.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0005 - DSC05276.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0006 - DSC05277.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0006 - DSC05277.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0007 - DSC05278.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0007 - DSC05278.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0008 - DSC05279.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0008 - DSC05279.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0008 - DSC05279.xmp
    |       |- Example Holiday Photos (Jan 2017) - 0009 - DSC05280.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0009 - DSC05280.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0009 - DSC05280.xmp
    |       |- Example Holiday Photos (Jan 2017) - 0010 - DSC05281.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0010 - DSC05281.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0010 - DSC05281.xmp
    |       |- Example Holiday Photos (Jan 2017) - 0011 - DSC05282.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0011 - DSC05282.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0012 - DSC05283.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0012 - DSC05283.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0013 - DSC05284.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0013 - DSC05284.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0013 - DSC05284.xmp
    |       |- Example Holiday Photos (Jan 2017) - 0014 - DSC05285.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0014 - DSC05285.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0014 - DSC05285.xmp
    |       |- Example Holiday Photos (Jan 2017) - 0015 - DSC05286.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0015 - DSC05286.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0016 - C0001.MP4
    |       |- Example Holiday Photos (Jan 2017) - 0016 - C0001M01.XML
    |- Catalog/

Stage 6 - Render Outputs

Source
    |- .duplicacy/
    |- Backlog/
    |   |
    |   |- Example Holiday Photo Folder with a Long Name - Jan 2018 (A7RM2)/
    |       |- Example Holiday Photos (Jan 2017) - 0001 - DSC05272.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0001 - DSC05272.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0001 - DSC05272.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0002 - DSC05273.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0002 - DSC05273.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0003 - DSC05274.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0003 - DSC05274.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0004 - DSC05275.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0004 - DSC05275.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0005 - DSC05276.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0005 - DSC05276.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0006 - DSC05277.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0006 - DSC05277.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0007 - DSC05278.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0007 - DSC05278.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0008 - DSC05279.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0008 - DSC05279.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0008 - DSC05279.xmp
    |       |- Example Holiday Photos (Jan 2017) - 0009 - DSC05280.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0009 - DSC05280.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0009 - DSC05280.xmp
    |       |- Example Holiday Photos (Jan 2017) - 0010 - DSC05281.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0010 - DSC05281.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0010 - DSC05281.xmp
    |       |- Example Holiday Photos (Jan 2017) - 0011 - DSC05282.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0011 - DSC05282.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0012 - DSC05283.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0012 - DSC05283.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0013 - DSC05284.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0013 - DSC05284.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0013 - DSC05284.xmp
    |       |- Example Holiday Photos (Jan 2017) - 0014 - DSC05285.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0014 - DSC05285.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0014 - DSC05285.xmp
    |       |- Example Holiday Photos (Jan 2017) - 0015 - DSC05286.ARW
    |       |- Example Holiday Photos (Jan 2017) - 0015 - DSC05286.JPG
    |       |- Example Holiday Photos (Jan 2017) - 0016 - C0001.MP4
    |       |- Example Holiday Photos (Jan 2017) - 0016 - C0001M01.XML
    |       |
    |       |- output/
    |           |- Example Holiday Photos (Jan 2017) - 0008 - DSC05279.jpg
    |           |- Example Holiday Photos (Jan 2017) - 0009 - DSC05280.jpg
    |           |- Example Holiday Photos (Jan 2017) - 0010 - DSC05281.jpg
    |           |- Example Holiday Photos (Jan 2017) - 0013 - DSC05284.jpg
    |           |- Example Holiday Photos (Jan 2017) - 0014 - DSC05285.jpg
    |- Catalog/

Stage 7 - Reorganize

Source
    |- .duplicacy/
    |- Backlog/
    |   |
    |   |- Example Holiday Photo Folder with a Long Name - Jan 2018 (A7RM2)/
    |       |- display/
    |       |   |- Example Holiday Photos (Jan 2017) - 0008 - DSC05279.ARW
    |       |   |- Example Holiday Photos (Jan 2017) - 0008 - DSC05279.JPG
    |       |   |- Example Holiday Photos (Jan 2017) - 0008 - DSC05279.xmp
    |       |   |- Example Holiday Photos (Jan 2017) - 0013 - DSC05284.ARW
    |       |   |- Example Holiday Photos (Jan 2017) - 0013 - DSC05284.JPG
    |       |   |- Example Holiday Photos (Jan 2017) - 0013 - DSC05284.xmp
    |       |   |- Example Holiday Photos (Jan 2017) - 0014 - DSC05285.ARW
    |       |   |- Example Holiday Photos (Jan 2017) - 0014 - DSC05285.JPG
    |       |   |- Example Holiday Photos (Jan 2017) - 0014 - DSC05285.xmp
    |       |   |- Example Holiday Photos (Jan 2017) - 0016 - C0001.MP4
    |       |   |- Example Holiday Photos (Jan 2017) - 0016 - C0001M01.XML
    |       |
    |       |- output/
    |       |   |- Example Holiday Photos (Jan 2017) - 0008 - DSC05279.jpg
    |       |   |- Example Holiday Photos (Jan 2017) - 0009 - DSC05280.jpg
    |       |   |- Example Holiday Photos (Jan 2017) - 0010 - DSC05281.jpg
    |       |   |- Example Holiday Photos (Jan 2017) - 0013 - DSC05284.jpg
    |       |   |- Example Holiday Photos (Jan 2017) - 0014 - DSC05285.jpg
    |       |
    |       |- revisit/
    |       |   |- Example Holiday Photos (Jan 2017) - 0001 - DSC05272.ARW
    |       |   |- Example Holiday Photos (Jan 2017) - 0001 - DSC05272.JPG
    |       |   |- Example Holiday Photos (Jan 2017) - 0003 - DSC05274.ARW
    |       |   |- Example Holiday Photos (Jan 2017) - 0003 - DSC05274.JPG
    |       |   |- Example Holiday Photos (Jan 2017) - 0004 - DSC05275.ARW
    |       |   |- Example Holiday Photos (Jan 2017) - 0004 - DSC05275.JPG
    |       |   |- Example Holiday Photos (Jan 2017) - 0007 - DSC05278.ARW
    |       |   |- Example Holiday Photos (Jan 2017) - 0007 - DSC05278.JPG
    |       |   |- Example Holiday Photos (Jan 2017) - 0011 - DSC05282.ARW
    |       |   |- Example Holiday Photos (Jan 2017) - 0011 - DSC05282.JPG
    |       |   |- Example Holiday Photos (Jan 2017) - 0012 - DSC05283.ARW
    |       |   |- Example Holiday Photos (Jan 2017) - 0012 - DSC05283.JPG
    |       |
    |       |- trash/
    |           |- Example Holiday Photos (Jan 2017) - 0002 - DSC05273.ARW
    |           |- Example Holiday Photos (Jan 2017) - 0002 - DSC05273.JPG
    |           |- Example Holiday Photos (Jan 2017) - 0005 - DSC05276.ARW
    |           |- Example Holiday Photos (Jan 2017) - 0005 - DSC05276.JPG
    |           |- Example Holiday Photos (Jan 2017) - 0006 - DSC05277.ARW
    |           |- Example Holiday Photos (Jan 2017) - 0006 - DSC05277.JPG
    |           |- Example Holiday Photos (Jan 2017) - 0009 - DSC05280.ARW
    |           |- Example Holiday Photos (Jan 2017) - 0009 - DSC05280.JPG
    |           |- Example Holiday Photos (Jan 2017) - 0009 - DSC05280.xmp
    |           |- Example Holiday Photos (Jan 2017) - 0010 - DSC05281.ARW
    |           |- Example Holiday Photos (Jan 2017) - 0010 - DSC05281.JPG
    |           |- Example Holiday Photos (Jan 2017) - 0010 - DSC05281.xmp
    |           |- Example Holiday Photos (Jan 2017) - 0015 - DSC05286.ARW
    |           |- Example Holiday Photos (Jan 2017) - 0015 - DSC05286.JPG
    |- Catalog/

Stage 8 - Catalog

Source
    |- .duplicacy/
    |- Backlog/
    |- Catalog/
        |
        |- Example Holiday Photo Folder with a Long Name - Jan 2018 (A7RM2)/
            |- display/
            |   |- Example Holiday Photos (Jan 2017) - 0008 - DSC05279.ARW
            |   |- Example Holiday Photos (Jan 2017) - 0008 - DSC05279.JPG
            |   |- Example Holiday Photos (Jan 2017) - 0008 - DSC05279.xmp
            |   |- Example Holiday Photos (Jan 2017) - 0013 - DSC05284.ARW
            |   |- Example Holiday Photos (Jan 2017) - 0013 - DSC05284.JPG
            |   |- Example Holiday Photos (Jan 2017) - 0013 - DSC05284.xmp
            |   |- Example Holiday Photos (Jan 2017) - 0014 - DSC05285.ARW
            |   |- Example Holiday Photos (Jan 2017) - 0014 - DSC05285.JPG
            |   |- Example Holiday Photos (Jan 2017) - 0014 - DSC05285.xmp
            |   |- Example Holiday Photos (Jan 2017) - 0016 - C0001.MP4
            |   |- Example Holiday Photos (Jan 2017) - 0016 - C0001M01.XML
            |
            |- output/
            |   |- Example Holiday Photos (Jan 2017) - 0008 - DSC05279.jpg
            |   |- Example Holiday Photos (Jan 2017) - 0009 - DSC05280.jpg
            |   |- Example Holiday Photos (Jan 2017) - 0010 - DSC05281.jpg
            |   |- Example Holiday Photos (Jan 2017) - 0013 - DSC05284.jpg
            |   |- Example Holiday Photos (Jan 2017) - 0014 - DSC05285.jpg
            |
            |- revisit/
            |   |- Example Holiday Photos (Jan 2017) - 0001 - DSC05272.ARW
            |   |- Example Holiday Photos (Jan 2017) - 0001 - DSC05272.JPG
            |   |- Example Holiday Photos (Jan 2017) - 0003 - DSC05274.ARW
            |   |- Example Holiday Photos (Jan 2017) - 0003 - DSC05274.JPG
            |   |- Example Holiday Photos (Jan 2017) - 0004 - DSC05275.ARW
            |   |- Example Holiday Photos (Jan 2017) - 0004 - DSC05275.JPG
            |   |- Example Holiday Photos (Jan 2017) - 0007 - DSC05278.ARW
            |   |- Example Holiday Photos (Jan 2017) - 0007 - DSC05278.JPG
            |   |- Example Holiday Photos (Jan 2017) - 0011 - DSC05282.ARW
            |   |- Example Holiday Photos (Jan 2017) - 0011 - DSC05282.JPG
            |   |- Example Holiday Photos (Jan 2017) - 0012 - DSC05283.ARW
            |   |- Example Holiday Photos (Jan 2017) - 0012 - DSC05283.JPG
            |
            |- trash/
                |- Example Holiday Photos (Jan 2017) - 0002 - DSC05273.ARW
                |- Example Holiday Photos (Jan 2017) - 0002 - DSC05273.JPG
                |- Example Holiday Photos (Jan 2017) - 0005 - DSC05276.ARW
                |- Example Holiday Photos (Jan 2017) - 0005 - DSC05276.JPG
                |- Example Holiday Photos (Jan 2017) - 0006 - DSC05277.ARW
                |- Example Holiday Photos (Jan 2017) - 0006 - DSC05277.JPG
                |- Example Holiday Photos (Jan 2017) - 0009 - DSC05280.ARW
                |- Example Holiday Photos (Jan 2017) - 0009 - DSC05280.JPG
                |- Example Holiday Photos (Jan 2017) - 0009 - DSC05280.xmp
                |- Example Holiday Photos (Jan 2017) - 0010 - DSC05281.ARW
                |- Example Holiday Photos (Jan 2017) - 0010 - DSC05281.JPG
                |- Example Holiday Photos (Jan 2017) - 0010 - DSC05281.xmp
                |- Example Holiday Photos (Jan 2017) - 0015 - DSC05286.ARW
                |- Example Holiday Photos (Jan 2017) - 0015 - DSC05286.JPG
kairisku commented 6 years ago

@jonreeves Better, but still not perfect. Without very detailed information on how the files actually are chunked, I cannot say why you still get megabytes of uploaded data just by reorganizing files.

It might well be that the minimum chunksize causes the end of one file and the beginning of another file to be bundled in the same chunk, and in that case reorganization can result in new chunks being formed. This can be checked from the content keys of the snapshot data for different revision. If there are files that do not start from offset zero of a chunk, bundling has happened and reorganization will generate new chunks. In this case, lowering the minimum chunksize will decrease the probability of bundling happening (you could e.g. keep the 4M average chunksize, but change the minimum chunksize to 64k).

If all files of after all stages start at offset zero of chunks, there should be no new chunks from just reorganizing files. If this is what happens, something in the code is not working properly and needs to be further debugged.

jonreeves commented 6 years ago

@kairisku I've gone through the Contents keys at each stage and can see that almost everything starts at :0: until the introduction of the .xmp files which are small (~12KB). The only file that also doesn't start at :0: is an .xml file (1.15KB) that comes along with the .mp4. I've attached a txt file with the contents listings at each stage: contents.txt

Obviously these small files are below the minChunkSize, so seem to throw off the boundaries for everything else when things get reshuffled. I just tested with the minChunkSize at 1KB, and the result is perfect. Its probably not ideal to have the chunksize that low in general though (@TowerBR incase your interested the chunk count jumps to 378).

TowerBR commented 6 years ago

the chunk count jumps to 378

Interesting! Seems like the same situation as my test 2, when I used the 128k chunks, and as Gilbert commented, overhead becomes an important factor (even more so if you use a cloud storage).

thrnz commented 6 years ago

Assuming my understanding is right (which it may or may not be!) , would a potential improvement to kairisku's file_boundaries branch be to have Duplicacy either sort the list of files to backup by size so that all files smaller than minChunkSuze are processed first? That way they won't get in the way when larger files are processed.

Or an alternative that may be cheaper processing wise would be to do two passes when chunking. Simply skip over files > minChunkSize on the first pass and then process them on the second pass.

fracai commented 6 years ago

I considered a similar approach when I was looking at this. I never got anywhere close to implementing anything, but I was concerned it might lead to chunk stream being reordered and changing even more. Adding or removing a file could lead to a sequence of changed chunks for a while until the stream syncs up with previously stored chunks. Files growing and shrinking will do the same. Reordering files by size seems like it would lead to even more changes to the chunk stream. Plus, ordering the list as the files are found in the file system allows discovering them in real time. Ordering by size would require iterating the entire backup set and keeping that information in memory. Likewise, two passes could accentuate real time changes. Backing up from a snapshot would mitigate that.

I do wonder what the effect would be from ordering in different ways, but my instinct says it would not be worth the downsides.

TowerBR commented 6 years ago

I published the results of the two new tests:

test_07_Thunderbird_kairasku_branches

and

test_08_Evernote_kairasku_branches

There's a lot of data there. ;-)

jonreeves commented 6 years ago

@thrnz I had a similar thought, but actually wondered if you could process as usual but keep the smaller files on the side accumulating and purging to chunks while the large files are done normally. I wasn't sure how this would impact things though.

I actually started to write this up and make a diagram, but didn't finish it. The gist of what I was going to ask was...

Given that its only ever going to be small files (smaller than the minChunkSize) that have a bad affect on larger files during reshuffles, I wonder if its possible to still process these small files in order, but skip storing them and instead put them to the side in memory until the next small file can be appended to it, doing this until it's chunk reaches the Min or Avg chunksize.

It shouldn't affect memory usage that much (you're only retaining upto an additional maxChunkSize in memory). Big files would continue to be processed and split as usual, and their take advantage of their content starting at the beginning of a chunk. Small files would exist in chunks together.

Something like:

  1. File > minChunkSize... Start new Chunk and Split across multiple
  2. File < minChunkSize... Process, Hash & Add to a queue for Bundled Chunk
  3. When sum of filesizes in queue > minChunkSize but < maxChunkSize... Output Chunk

I was trying to then articulate what might happen with new, edited or moved files but didn't get there.

@fracai what do you think about this approach?

TowerBR commented 6 years ago

Interesting, I had not thought of that: ratio ...

Perhaps, the issue is not the use of fixed or variable chunks, as in the tests I reported, but rather the relationship between minimum and maximum chunks sizes and their application in the repository.

For example, when checking file sizes (~ 11,000 files) in my Thunderbird repository, we have this:

all

If we take only files with more than 10 Mb (183 files) we have:

gt10Mb

That is, there are more than 10,000 files with less than 10 Mb ...

If we also filter the files with the same patterns from the filters file, we have:

gt10Mb_filered

So it might be worth trying an init with a different ratio: -c 1M -min 128k -max 10M

(also posted in Duplicacy forum)

kairisku commented 6 years ago

Lots of data, and many good ideas here.

@jonreeves You are correct in that the tiny .xmp-files cause problems after reorganization, because it indeed causes each .xml -file to be bundled with the following large file and that introduces lots of new chunks. As you proved, tweaking the chunk size limits can work around the difficulties, but it feels very unfeasible to require that level of insight to do that successfully. I think renaming/reorganizing otherwise unmodified files is a very specific special case that would require a dedicated solution. I am thinking Duplicacy should use the full file hashes collected from previous snapshots and when finding identical files in subsequent runs could just shortcut to the previous chunk segments. For files smaller than the buffer size, this should be quite simple to implement, but to support larger files the processing should be done in two phases. First collect hashes for all files to see if they are identical to files from previous snapshots, and then a new run to actually do chunking for new/updated files. Could be a lot of work, and possibly slow down the whole backup run significantly. I do not know if it is worth the effort..

Any attempt to sort the files before processing can also be catastrophic, because you never know how the files are manipulated in a way that drastically affects how they are sorted.

The current Duplicacy algorithm can be described as bundle-AND-chunk, because it tries to bundle the files regardless of size and then cut them into chunks partly to be convenient to send to cloud storage and partly to perform deduplication. @thrnz and @fracai (and others as well?) have been thinking in a direction that I would call bundle-OR-chunk. Small files are bundled to meet the minium chunk size requirements, while large files are chunked to get deduplicable pieces. This could be implemented with two parallell processing pipelines, simply tossing the smaller files into one pipeline and the larger files into another, each pipeline spitting out chunks.

I am still trying to wrap my head around all the data that @TowerBR produced. A few initial conclusions can be drawn. I think the Evernote database optimization simply changes so much of the database that there simply is not much of the old data that can be reused (deduped). The total number of chunks do not radically differ between branches (the graphs exaggerates by being relative). Certain types of files, such as databases and thick disk images, are probably always best handled with fixed size chunks (because data do not shift up or down with random edits), alternatively a more optimal hash window size needs to be found. Any other significant takeaways that should be considered?