Running for multiple days on highly redundant data

MichaelDietzel commented 2 years ago

When testing the new version from #275 I now have duperemove running for multiple days on a nvme ssd with a decently fast CPU and a healthy amount of RAM. I cannot see any progress in the "Using 6 threads to search within extents extents for additional dedupe."-progress bar for about 2 days now. So I ran hashstats hashfile and saw that my data have some highly redundant blocks, some even in a single file:

Raw header info for "/root/hashfile":
  version: 3.0  block_size: 4096
  num_files: 10873  num_hashes: 962926
Print top 10 hashes (this may take some time)
Hash, # Blocks, # Files
cb4a9d0113a22e449ef44f046ca6711e, 45135, 6
1806426440e44344dadd4700e80754fe, 41567, 18
c3a0f5852fb36baed3ec587d0e8c2d94, 39793, 7
55417259b7e15c3c560bf42615d6c56a, 38716, 6
b56127af9108311e6de0aec0e79e8cf6, 37423, 6
b4d9205be7e11119995c490a92221b5f, 37205, 6
a6a9935d7483d46c8d7c325666f8fedd, 32266, 8
dc3c043361fe23b6e166ee17797f0f50, 31101, 1
2f39982dd69ab241557cf3ed0f140779, 30829, 6
0ac9d0022ec7f438c7363479c5267230, 29912, 5

The command I am running is duperemove -hdr -b4k --dedupe-options=same,partial --skip-zeroes --hashfile=hashfile /mnt/vmstorage/* on a complete xfs file system.

I only took a short look at the code and only understood some of it, but could the slowdown be related to having really long runs of repeated extents? Does the code traverse the tail of duplicated block again and again for each extent? One example: There is a file with duplicated block A consisting of 1k extents, then some random data and then block A again. Block A consists of 1k the same extent repeated. Does the scanning of the first extent in the second block A scan for 1000 extents. Then, once hat is done, the scanning is restarted at the next extent and then scans for 999 extents and so on? (Maybe having block A just one time would lead to the same result?) Or maybe I am completely wrong but you may still want to take a look at it? The file system is about 250gB in size, the compressed hashfile is about 2gB, so maybe I could provide it. I guess the hashfile would contain the file names?

lorddoskias commented 2 years ago

If you actually want to see if there is any progress you'd have to run duperemove with the --debug which should print more information as dedupe is going on. But generally the partial search is more or less O(n^2) search so it's possible that it takes a long time. In any case the changes iin the fix for #275 simply extend the size of some variables so that we don't get premature failures.

MichaelDietzel commented 2 years ago

Sorry for my long absence. Thanks for your reply. I will try to use the --debug switch once I get the opportunity to run my computer for a few days nonstop again.

Maybe the O(n²) could be fixed by using a suffix trie for searching? But that could be a lot of work. However there may be a workaround that is hopefully easier to implement (just a wild guess, maybe this is already implemented or I just misinterpret it): When during scanning for duplicate data a extent with identical data is found a 2nd time in a block, the search can be stopped at this place. Both occurrences of this data should be merged to the same extent, so there cannot be a longer common, uninterrupted block and thus further scanning is useless.

JackSlateur commented 1 year ago

@MichaelDietzel shared-block search is costly based on the number of block to consider per lookup

Could you check out the last version, especially the -B option ? It should improve your situation by reducing the scope of each lookup and keep the overall execution more acceptable

EDIT: checking a bit more, I see that partial lookup has still some issues. So while batching improves the situation, I am afraid your issue is not fixed at the moment

JackSlateur commented 1 year ago

Hello @MichaelDietzel Batching has been improved since my last comment While not being as efficient as I'd like, it is now more usable than before Could you check the latest code and reopen this if you still face this issue ?

Best regards,

markfasheh / duperemove

Running for multiple days on highly redundant data #279