markfasheh / duperemove

Tools for deduping file systems
GNU General Public License v2.0
833 stars 83 forks source link

Please include a time to finish estimation #159

Open ytrezq opened 8 years ago

ytrezq commented 8 years ago

I ran duperemove on a 298Gb volume containing about 200 000 files. One of them is 170Gb large full of zeros which I used for finding cve‑2016‑2315.

I wanted to reduce it’s size with --dedupe-options=same. After 11h hours I computed with gdb that it would require at least 35 000 000 of seconds in order to finish (405 days) because it was deduping 4096 bytes of that file every 0.7 seconds.

The strange thing was also the process pool was also using only 3 threads even with the--cpu-threads=5 option (I have a 4 core hyperthreaded processor) among 8 threads.

Please include an automatic version of estimated time to finish in duperemove.

markfasheh commented 8 years ago

Hmm, we should be printing out total dedupe requests and which request is being currently processed. To be fair, that number is a bit wonky though because we might break a large request up into multiple ones.

Generally though, I have a feeling that you're hitting this issue:

https://github.com/markfasheh/duperemove/issues/156

Does performance improve when you run with --dedupe-options=nofiemap ?

We need an FAQ entry for this at least so I'll work something up when I get a chance. I don't want to disable fiemap during dedupe because it has the downside of disabling our space savings estimate.

EDIT: FYI, you wanted --io-threads. Perhaps I could be more clear in the man page, --cpu-threads only affects the optional find-dupes stage.

ytrezq commented 8 years ago

@markfasheh : I used --io-thread=8. I got 8 thread but only 3 were running alternatively.

Does performance improve when you run with  --dedupe-options=nofiemap  ?

Unfortunately, the duperemove process disliked receiving mySIGKILL. So the filesystem is damaged and can’t be fixed because of that brtrfsck bug. Of course, this prevent me running deupermove again safely.

Generally though, I have a feeling that you're hitting this issue: #156

No, because I don’t have 2 identical files larger than 10 Mb. However, the duperemove process was telling it was deduping only one (the same) file since the beginning. That file was 170Gb large and full of zeros (I was able to delete it before shutting down the machine because of duperemove).

Here’s the capture from btrfs-image -w -c 0 -t 8 /dev/dm-7 : https://web.archive.org/web/20161020220914/https://filebin.net/7ni8kfpog1dxw4jc/btrfs-image_capture.xz