markfasheh / duperemove

Tools for deduping file systems
GNU General Public License v2.0
832 stars 82 forks source link

feature request: process largest chunks first #292

Open MaxRower opened 1 year ago

MaxRower commented 1 year ago

Is there a reason why the smallest chunks get deduplicated first? Sometimes, there is limited time for deduplication, and it would be nice if it was possible to deduplicate the largest chunks with the biggest impact on freespace first.

GottZ commented 3 months ago

totally agree..

I'm running duperemove on a 16 tb ssd volume right now and it's stuck at this 15 gb file for a couple days now: duperemove -drh --io-threads=8 --cpu-threads=8 -b 256k --dedupe-options=partial --hashfile=/mnt/spark/dupehash --exclude "/mnt/spark/dupehash*" /mnt/spark image this is a raid 0 with two 8 tb ssd's: image resulting in this volume: image apparently running duperemove on this also increased recorded read sectors by a lot. image before it started, they were pretty equal at around 30tb for each ssd so.. duperemove read a TON of data to dedupe one file, while being only 1/7 through. this is 20 minutes later: image so I assume duperemove is reading file metadata over and over and over after each dedupe operation, even though it could technically chain the operations together in one go.

at least filefrag is confirming, that it's chopping down the file more and more: filefrag -v idontknow.img image

this is just one of six million files..

at least read rate is consistent: graph and as you can see, the drive can do a lot more than that.

iops went up a lot tho: graph2

edit a month later: image