Closed biggestsonicfan closed 2 months ago
Okay, so this wasn't the problem. I'm assuming the above was the result of Copy on Write or something.
My actual issue was attempting to dedupe a very large dataset to get the following when it was done:
Loading only duplicated hashes from hashfile.
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Nothing to dedupe
But that's just not true. I deduplicated a different set of data to see that duperemove would just periodically (seemingly at random even) dedupe extents. For example:
Kernel processed data (excludes target files): 174.1MB
Comparison of extent info shows a net change in shared extents of: 330.3MB
Loading only duplicated hashes from hashfile.
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Nothing to dedupe.
Loading only identical files from hashfile.
Simple read and compare of file data found 2 instances of files that might benefit from deduplication.
Showing 2 identical files of length 1.5MB with id 0baae2eb
Start Filename
0.0B "/run/media/rob/bfd20/dl/pixiv/fanbox/17778544/2784185_p10_j3VBkUFOYaAeWF4r05Xnxpko - 夜這いカーマちゃん.jpeg"
0.0B "/run/media/rob/bfd20/dl/pixiv/fanbox/17778544/2784185_p22_NAzl35WJlR3ssykD9b2gUmLw - 夜這いカーマちゃん.jpeg"
Showing 2 identical files of length 2.3MB with id 0d575b00
Start Filename
0.0B "/run/media/rob/bfd20/dl/pixiv/fanbox/17778544/2684524_p13_SgQl17wn28LrOigoNQ7HzX8A - 妖精騎士トリスタン徹底的にわからせた.jpeg"
0.0B "/run/media/rob/bfd20/dl/pixiv/fanbox/17778544/2684524_p28_Xi7oKsqzIk4XbFd4tcEJ4mDc - 妖精騎士トリスタン徹底的にわからせた.jpeg"
Using 16 threads for dedupe phase
[0x1a391f0] (1/2) Try to dedupe extents with id 0baae2eb
[0x1f75980] (2/2) Try to dedupe extents with id 0d575b00
Files are being reported at offset 0.0B
and duperemove is processing sequences of lists ((1/3)
, (1/37)
, etc) so I don't get a real idea of how much duperemove is deduping in a single run.
I don't have any old logs of duperemove but I am fairly sure it didn't look like this?
I'm starting to think this is related to the threads?
Dupremove is just listing the net size change in extents after the threads are finished? Is there any way for duperemove to keep track of and provide a net total of changes after it's finished with all directories?
Forgive me for my complete ignorance here, because I am obviously missing something in how this should work, but I finally ran duperemove on my largest dataset.... and frankly, I am absolutely baffled why it decides that after over six million extents and 11 GiB sqlite file that it decides to only try to dedupe 35 extents, restart with Loading only duplicated hashes from hashfile
, only to find 0 instances of extents that might benefit from deduplication
twice in row, followed by 61 deduplication attempts. Is there something I'm missing here for it to, just, try more extents? I do not understand why it's chunking them like this.
EDIT: After several hours, it's now running a 10,437 attempt chunk... I don't understand at all...
You want to increase the batchsize to allow for bigger chunks.
-B N, --batchsize=N
Run the deduplication phase every N files newly scanned. This greatly reduces memory usage for large dataset, or when you are doing partial extents lookup, but reduces multithreading efficiency.
Because of that small overhead, its value shall be selected based on the average file size and blocksize.
The default is a sane value for extents-only lookups, while you can go as low as 1 if you are running duperemove on very large files (like virtual machines etc).
By default, batching is set to 1024.
Oh wow if that's been it the whole time that'd be great! I can't believe I missed that documentation. Currently running one now, which may take a long time for it to actually attempt, lol.
Starting to think this is a non-issue and just going to close this as not planned. Increasing the batch size seemed to help but there seems to be more to it on my end.
Perhaps I'm missing something here, as I've rebuilt duperemove with the latest git, but this output seems... very strange to me:
sha256sum
result of both files: