I have a 128giB raw VM image on xfs with some holes (about 25giB) that I want to dedupe to remove duplication inside it.
So I run duperemove for the first time on it duperemove -hdrq -b4k --dedupe-options=same,partial --skip-zeroes win.raw
It takes about 24min and results in some deduplication (I cannot tell how much, though, sorry. The output filled my console and so I cannot look back at the reported size before. Should have written the output to a file...). I get a lot of error messages despite the file not being opened by any other program: Dedupe for file "/mnt/vmstorage/win.raw" had status (1) "data changed".
Shortly after that and without any changes to the file I run duperemove a 2nd time with identical parameters. This time it takes about 60 mins. It results in additional savings of nearly 3giB. Also again I get a lot of the same error messages.
Now there are some things that I am wondering:
Did I do something wrong so I didn't get a summary about the dedup success in the end?
Why am I getting all these errors despite the file not being opened anywhere else? Just guessing: Could this be caused by an error in the algorithm which tries to find long blocks of duplicated extents so that the same extent is used multiple times for different blocks or something?
I would expect the 2nd run to not get any additional deduplication. Maybe this is related to the error messages?
Why does the 2nd run take so much longer?
could it be related to fiemap ioctl? Although I supect that using nofiemap could have changed the dedup gains of the 2nd run. I could try to run it on a reproducible setup (xfs on zfs and using snapshots to reset).
could it be a result of the increased fragmentation?
Some more info:
the image contains Windows on ntfs with 4kiB "extent" size
I use a custom compile of duperemove: current master branch + a patch for > 2giB extents
I have a 128giB raw VM image on xfs with some holes (about 25giB) that I want to dedupe to remove duplication inside it.
So I run duperemove for the first time on it
duperemove -hdrq -b4k --dedupe-options=same,partial --skip-zeroes win.raw
It takes about 24min and results in some deduplication (I cannot tell how much, though, sorry. The output filled my console and so I cannot look back at the reported size before. Should have written the output to a file...). I get a lot of error messages despite the file not being opened by any other program:Dedupe for file "/mnt/vmstorage/win.raw" had status (1) "data changed".
Shortly after that and without any changes to the file I run duperemove a 2nd time with identical parameters. This time it takes about 60 mins. It results in additional savings of nearly 3giB. Also again I get a lot of the same error messages.
Now there are some things that I am wondering:
Some more info: