dupremove attempts to dedupe chunks of extents from database

biggestsonicfan commented 6 months ago

Perhaps I'm missing something here, as I've rebuilt duperemove with the latest git, but this output seems... very strange to me:

rob@DESKTOP-5ETERDG:~> /home/rob/gits/duperemove/duperemove -rdhv "/run/media/rob/bfd20/dl/Test"
Increased open file limit from 1024 to 1048576.
Using 128K blocks
Using extent hashing
Gathering file list...
        Files scanned: 2/2 (100.00%)
        Bytes scanned: 2.7GB/2.7GB (100.00%)
        File listing: completed
Hashfile "(null)" written
Loading only identical files from hashfile.
Simple read and compare of file data found 1 instances of files that might benefit from deduplication.
Showing 2 identical files of length 1.3GB with id 6634f67b
Start           Filename
0.0B    "/run/media/rob/bfd20/dl/Test/1.mkv"
0.0B    "/run/media/rob/bfd20/dl/Test/2.mkv"
Using 16 threads for dedupe phase
[0xdd1300] (1/1) Try to dedupe extents with id 6634f67b
[0xdd1300] Add extent for file "/run/media/rob/bfd20/dl/Test/1.mkv" at offset 0.0B (3)
[0xdd1300] Add extent for file "/run/media/rob/bfd20/dl/Test/2.mkv" at offset 0.0B (4)
[0xdd1300] Dedupe 1 extents (id: 6634f67b) with target: (0.0B, 1.3GB), "/run/media/rob/bfd20/dl/Test/1.mkv"
Kernel processed data (excludes target files): 1.3GB
Comparison of extent info shows a net change in shared extents of: 2.7GB
Loading only duplicated hashes from hashfile.
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Nothing to dedupe.

sha256sum result of both files:

rob@DESKTOP-5ETERDG:~> sha256sum /run/media/rob/bfd20/dl/Test/1.mkv
77e98880bc56e5b9c82f9ff6cfb42935897efbd41eb09c95c436dfa9abb1e917  /run/media/rob/bfd20/dl/Test/1.mkv
rob@DESKTOP-5ETERDG:~> sha256sum /run/media/rob/bfd20/dl/Test/2.mkv
77e98880bc56e5b9c82f9ff6cfb42935897efbd41eb09c95c436dfa9abb1e917  /run/media/rob/bfd20/dl/Test/2.mkv

biggestsonicfan commented 6 months ago

Okay, so this wasn't the problem. I'm assuming the above was the result of Copy on Write or something.

My actual issue was attempting to dedupe a very large dataset to get the following when it was done:

Loading only duplicated hashes from hashfile.
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Nothing to dedupe

But that's just not true. I deduplicated a different set of data to see that duperemove would just periodically (seemingly at random even) dedupe extents. For example:

Kernel processed data (excludes target files): 174.1MB
Comparison of extent info shows a net change in shared extents of: 330.3MB
Loading only duplicated hashes from hashfile.
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Nothing to dedupe.
Loading only identical files from hashfile.
Simple read and compare of file data found 2 instances of files that might benefit from deduplication.
Showing 2 identical files of length 1.5MB with id 0baae2eb
Start           Filename
0.0B    "/run/media/rob/bfd20/dl/pixiv/fanbox/17778544/2784185_p10_j3VBkUFOYaAeWF4r05Xnxpko - 夜這いカーマちゃん.jpeg"
0.0B    "/run/media/rob/bfd20/dl/pixiv/fanbox/17778544/2784185_p22_NAzl35WJlR3ssykD9b2gUmLw - 夜這いカーマちゃん.jpeg"
Showing 2 identical files of length 2.3MB with id 0d575b00
Start           Filename
0.0B    "/run/media/rob/bfd20/dl/pixiv/fanbox/17778544/2684524_p13_SgQl17wn28LrOigoNQ7HzX8A - 妖精騎士トリスタン徹底的にわからせた.jpeg"
0.0B    "/run/media/rob/bfd20/dl/pixiv/fanbox/17778544/2684524_p28_Xi7oKsqzIk4XbFd4tcEJ4mDc - 妖精騎士トリスタン徹底的にわからせた.jpeg"
Using 16 threads for dedupe phase
[0x1a391f0] (1/2) Try to dedupe extents with id 0baae2eb
[0x1f75980] (2/2) Try to dedupe extents with id 0d575b00

Files are being reported at offset 0.0B and duperemove is processing sequences of lists ((1/3), (1/37), etc) so I don't get a real idea of how much duperemove is deduping in a single run.

I don't have any old logs of duperemove but I am fairly sure it didn't look like this?

biggestsonicfan commented 5 months ago

I'm starting to think this is related to the threads?

Dupremove is just listing the net size change in extents after the threads are finished? Is there any way for duperemove to keep track of and provide a net total of changes after it's finished with all directories?

biggestsonicfan commented 4 months ago

Forgive me for my complete ignorance here, because I am obviously missing something in how this should work, but I finally ran duperemove on my largest dataset.... and frankly, I am absolutely baffled why it decides that after over six million extents and 11 GiB sqlite file that it decides to only try to dedupe 35 extents, restart with Loading only duplicated hashes from hashfile, only to find 0 instances of extents that might benefit from deduplication twice in row, followed by 61 deduplication attempts. Is there something I'm missing here for it to, just, try more extents? I do not understand why it's chunking them like this.

EDIT: After several hours, it's now running a 10,437 attempt chunk... I don't understand at all...

pongo1231 commented 3 months ago

You want to increase the batchsize to allow for bigger chunks.

 -B N, --batchsize=N
        Run  the  deduplication  phase  every N files newly scanned.  This greatly reduces memory usage for large dataset, or when you are doing partial extents lookup, but reduces multithreading efficiency.

        Because of that small overhead, its value shall be selected based on the average file size and blocksize.

        The default is a sane value for extents-only lookups, while you can go as low as 1 if you are running duperemove on very large files (like virtual machines etc).

        By default, batching is set to 1024.

biggestsonicfan commented 3 months ago

Oh wow if that's been it the whole time that'd be great! I can't believe I missed that documentation. Currently running one now, which may take a long time for it to actually attempt, lol.

biggestsonicfan commented 2 months ago

Starting to think this is a non-issue and just going to close this as not planned. Increasing the batch size seemed to help but there seems to be more to it on my end.

markfasheh / duperemove

dupremove attempts to dedupe chunks of extents from database #336