markfasheh / duperemove

Tools for deduping file systems
GNU General Public License v2.0
687 stars 75 forks source link

increasing fragmentation #168

Open guni77 opened 7 years ago

guni77 commented 7 years ago

I noticed a unusual behavior, when I run dupremove the fragmentation measured with 'filefrag' grow to very large fragments, making filesystem slow.

HOW to reproduce: create standart btrfs make subvolume fill with some data make snapshot of subvolume run dupremove -rv

2241 btrfs subvolume create files 2242 cp -r /home/test/.cache/ files/cache/ 2243 cp -r /home/test/.cache/ files/cache 2244 mkdir files/cache/ 2245 cp -r /home/test/.cache/ files/cache/ 2247 btrfs subvolume snapshot files files.1 2249 find -type f |xargs filefrag >/tmp/filefrag.txt 2251 awk -v x=3 '$2 >= x' /tmp/filefrag.txt 2252 /home/test/duperemove/duperemove -rhd 2253 find -type f |xargs filefrag >/tmp/filefrag.txt 2254 awk -v x=3 '$2 >= x' /tmp/filefrag.txt

./files/cache/thunderbird/xxxxxx.default/Cache/_CACHE001: 32 extents found ./files/cache/thunderbird/xxxxxx.default/Cache/_CACHE002: 32 extents found ./files/cache/thunderbird/xxxxxx.default/Cache/_CACHE003: 32 extents found ./files.1/cache/thunderbird/xxxxxx.default/Cache/_CACHE001: 32 extents found ./files.1/cache/thunderbird/xxxxxx.default/Cache/_CACHE002: 32 extents found ./files.1/cache/thunderbird/xxxxxx.default/Cache/_CACHE003: 32 extents found

guni77 commented 7 years ago

additional information

you can get rid of this with btrfs fi defrag -czlib filename btrfs fi defrag filename

which remove the shared extents obviously

markfasheh commented 7 years ago

Hi, thanks for the details. Generally speaking, fragementation is a side effect of deduplication. That said, you can mitigate this by asking duperemove to build extents from the blocks it discovers. The tradeoff is a lot of cpu and memory. Try running with the following option:

--dedupe-options=noblock

if the process takes too long you can ctrl-c the program and safely rerun with the same hashfile (and different options) afterwards.

guni77 commented 7 years ago

I'm not shure if i got this right. If I run dupremove without and without the mentioned option different number of files are targetet. Afterwards with your option all files show small number of extents. So working allright.

The only thing i wondered is why I can't just run dupremove and afterwards run btrfs fi defrag -r. This seems to break the snapshots and increase filesize, e.g. unlinking/uncloning files.

lapsio commented 7 years ago

So i shouldn't run defrag after dedupe? Can I run balance?

guni77 commented 7 years ago

Yes as far as I know you could, I tried it once to free space after a dedup. you can see it when you run btrfs fi df -h data total shrinks and is same with data used after balance is finished

guni77 commented 7 years ago

I am still not shure about the defrag after the dedup, To my knowlede after certain kernel version (and I am using 4.9.0.1) this should not happen. Two files fro which "share" their content should not be divided into two files after a defrag.

guni77 commented 7 years ago

I got some extra information about this, snapshot-aware defrag is disabled in most kernels and will be enabled in the future (https://btrfs.wiki.kernel.org/index.php?title=Changelog). Obviously files which are not fragmented do not get defraged and thus the link between shared extents does not break.

chrysn commented 7 years ago

i've made the observation that a sudo duperemove (which implies -A) does not increase fragmentation in copied 1GiB files (files end up with about 6 extents as counted by filefrag, which is in the order of magnitude the previously separate files had), while an unpriviledged read-write removal ends me up with ~500 extents (and that's with -b1M; with the default 128k, it's around 4000 extents).

the good (~5 extents) results could also be obtained by running in --fdupes mode.

to mitigate bad experiences, i'd like to suggest either detecting long-streaked similarities (at least full-file identities) even without --fdupes, or advertising the -A and --fdupes options as combatting fragmentation (as the man page reads now, --fdupes rather looks like a simplification to users already familiar with fdupes, and would -A / root operation be useful rather when required eg by snapshots).