Extremely slow for large files

stoecker commented 5 years ago

I have a harddisk with 3 nearly identical 100GB images for virtual machines. I think that most parts of the images are simply empty space.

Running duperemove on these three files takes many days to complete even though it should be pretty easy to say "Take first empty block an reference everything to it". I have been trying different options with and without hashfile and with and without --dedupe-options=same.

Using only two was much faster, but still awfull slow.

After only two files have been deduped I hoped that running the same files again with the third one could speed up the process, but that's not the case.

Please speed up the process a lot for this type of files. It seems the initial checksum generating is fast, but the "what to do steps" take long. It should free multiple 100 GB on this harddisk with all the images I have on it, but I can't even try to run it with more than one file.

How to reproduce (untested): Create a 100GB VM image and clone it 2 times (copy the files, so they are actually suing same space).

I'm using duperemove 0.11.1 on btrfs with openSUSE Factory OS on a 4 core 20GB memory machine.

jfikar commented 5 years ago

I think I'm seeing the same issue. I have HP microserver N40L, it has quite slow CPU (2x1.5GHz). I'm trying to dedupe 6TB of data, same of the files are disk images. The process is very very very slow, takes more than week and it is still not finished.

At first I thought it is just stuck, but with -v parameter I see it slowly progresses. The slowest it is during 'Compare files "a.img" and "b.img"'. The a.img and b.img being images of disks of Windows computers with some free space zeroed out. These files are relatively well compressed by zstd. During the long 'Compare...' there is no I/O activity, but both CPU are 100% busy. The N40L has 16GB RAM.

Is there a way to dedupe the disk faster? Or at least some way to keep the progress? When stopped it seems it starts again from beginning. I know the hashes are kept in hashfile, but that's actually not the slow part. The slow part is the file compare.

markfasheh commented 5 years ago

The file compare (finding dupes) stage is easily the slowest in duperemove right now. In master, I've actually taken it out and duperemove is just calculating extent checksums based on what we get from fiemap. The good part about this is that we have zero fragmentation and the compare step is extremely fast, the downside is that we miss some cases of dedupe. I'm working on adding back the most important cases. I'll leave a note here when it's in a state that can be tested.

markfasheh commented 5 years ago

If you're willing to try the latest master it should be much faster and better on memory as the lengthy extent search has been put behind an option. I'd appreciate any feedback you all might have.

jfikar commented 5 years ago

It is actually much faster, takes only about 4.5 hours instead of weeks on N40L. But somehow does not find anything to dedupe:

Total files:  324397
Total extent hashes: 0
Loading only duplicated hashes from hashfile.
Found 0 identical extents.
Using 2 threads to search within extents extents for additional dedupe. This process will take some time, during which Duperemove can safely be ctrl-c'd.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Nothing to dedupe.

I've tried default options and also --dedupe-options=partial,same.

alexalouit commented 5 years ago

Same for me, 128acd99fc4ff1c6735083ffd69951ba9d7c997e is very quick (10-15mn vs 1-2 weeks) but dedup nothing

lorddoskias commented 4 years ago

There was a bug in the upstream dedupe code which has now been fixed. Furthermore there've been some optimisations especially for large files so if you can give it a try.

lorddoskias commented 3 years ago

I have re-worked the way reading is done for both the new, extent-based format as well as the old, v2, block-based file format. It now reads files in chunks of 8 mb and does the subsequent work in-memory. This should provide nice speed up for large files if anyone wants to try it.

lorddoskias commented 3 years ago

Closing due to inactivity if the same problem persists please open a new issue.

markfasheh / duperemove

Extremely slow for large files #216