markfasheh / duperemove

Tools for deduping file systems
GNU General Public License v2.0
805 stars 80 forks source link

Deduplicating snapshots of root filesystems with small files #191

Open Massimo-B opened 6 years ago

Massimo-B commented 6 years ago

Hi, I have a central backup btrfs where snapshots of 2 different machine root filesystems are transferred via btrbk. Those snapshots are quite equal in most parts as they were cloned Gentoo installations but binary entirely rebuild for a different platform. I would like to deduplicate the initial snapshots of both, later doing that again periodically. For duperemove I've set both snapshots to read-write first.

I compare the result of the deduplication using compsize.

foo="/mnt/usb/mobiledata/snapshots/mob/root/root.20171024T105543+0200/"
bar="/mnt/usb/mobiledata/snapshots/bur/root/root.20171023T125129+0200/"
duperemove --hashfile=/root/.duperemove/hashfile_usbmobile_root -dhvr $foo $bar

First I wondered that most files were "Skipping small file...". Resulting hashfile size compared to the target files:

# duperemove --hashfile hashfile_usbmobile_root -L |wc -l
19473
# find $foo $bar |wc -l
1215710

Decreasing the -b blocksize to -b 4096 did not find new files with the same hashfile. However after deleting the hashfile the new hashfile increased a bit but still many skipped files. This run took ~1 day instead of ~1 hour:

# duperemove --hashfile hashfile_usbmobile_root -L |wc -l
282440

Afaik the minimal usable blocksize is limited by kernel and the btrfs blocksize, how can I detect the blocksize of my btrfs?

Would --dedupe-options=noblock change the amount of deduplication or just optimizes the number of extents?

Could --lookup-extents=yes help finding more duplicates?

Finally I still wonder that compsize "Referenced" is equal to "Uncompressed" just like before the dedup:

# compsize $foo ; compsize $bar
Processed 455445 files.
Type       Perc     Disk Usage   Uncompressed Referenced  
Data        50%      6.9G          13G          13G       
none       100%      2.6G         2.6G         2.2G       
zlib        39%      4.3G          11G          10G       
Processed 453330 files.
Type       Perc     Disk Usage   Uncompressed Referenced  
Data        47%      5.8G          12G          12G       
none       100%      1.7G         1.7G         1.7G       
zlib        39%      4.0G          10G          11G       
nefelim4ag commented 6 years ago

For dedup files use fdupes mode, not small block size. noblock - bad idea, because duperemove then, try calculate extents for dedup (that increase runtime and can decrease deduplication rate in theory) --lookup-extents - nope, that only allow avoid deduplication of already deduplicated extents (AFAIK)


Also you miss use compsize - Compressed size, it's not for deduplication proporse For deduplication use: btrfs fi du -s

Massimo-B commented 6 years ago

You mean this mode?

fdupes -r $foo $bar | duperemove --fdupes

Why would that find more than duperemove? I thought duperemove would always find more as it also compares the extents of files. Like (as for the $HOME snapshots later) often I have pictures inside emails and as plain files. And I have some caches for webmounts that hold duplicate data with the plain files for davfs2. fdupes would not find those parts in the different file formats.

nefelim4ag commented 6 years ago

@Massimo-B , You say that you have a problem with tiny files, i say that better use fdupes for that (for fix tiny files). You can use default block mode for that after, but fdupes make thing slightly better, because duperemove will not change extent maps of that files while deduplication, because files already deduplicated.

Also, if we talk about emails with different attachments, data in different file formats & etc, deduplication will not help. Because deduplication work with block size boundary. i.e. if you have 2 100500 GiB files in different emails and that files (attachment) not aligned to each other by fs block size - deduplication will not help with that.

Massimo-B commented 6 years ago

Thanks for the outline.

As for compsize, I was just using it wrong. It must include all involved space to determine the referenced part. After using fdupes | duperemove there is a large part referenced now, though I still can't interpret all details of it:

# compsize /mnt/usb/mobiledata/snapshots/bur/root/root.20171023T125129+0200/ /mnt/usb/mobiledata/snapshots/mob/root/root.20171024T105543+0200/
Processed 908775 files, 358748 regular extents (4422383 refs), 591502 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
Data        48%      8.5G          17G          26G       
none       100%      2.6G         2.6G         4.0G       
zlib        39%      5.9G          14G          21G       

About the /home snapshots, before:

# compsize /mnt/usb/mobiledata/snapshots/bur/home/home.20171023T125129+0200/ /mnt/usb/mobiledata/snapshots/mob/home/home.20171024T073301+0200/
Processed 621227 files, 1003180 regular extents (1228103 refs), 171877 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
Data        84%       98G         117G         150G       
none       100%       78G          78G         105G       
zlib        52%       20G          38G          44G     

After:

# compsize /mnt/usb/mobiledata/snapshots/bur/home/home.20171023T125129+0200/ /mnt/usb/mobiledata/snapshots/mob/home/home.20171024T073301+0200/
Processed 621227 files, 382966 regular extents (1168160 refs), 171877 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
Data        82%       30G          36G         150G       
none       100%       23G          23G         111G       
zlib        49%      6.3G          12G          39G       

As for duperemove in combination with fdupes, is it still advised to use the smallest -b blocksize, as duperemove would still skip the small files provided by fdupes?

Massimo-B commented 6 years ago

Why does fdupes find more duplicate files than duperemove? Drawback of periodical fdupes runs is, that it doesn't support hash cache files for speeding up the next runs. I usually had duperemove as a cronjob on let's say weekly or monthly base.

kentfredric commented 3 years ago

a chain of fdupes | duperemove --fdupes still skips small files.

Wouldn't it just be better for duperemove to have a knob that controls what is considered a "small file" ?

kentfredric commented 3 years ago

Also, fupes | duperemove --fdupes tries to re-deduplicate already deduplicated files, because fdupes has no understanding of btrfs native extents...

And duperemove just takes its output and goes "Ok", and then wastes a lot of IO doing nothing.

Like:

Queue entire file for dedupe: /mnt/btrfs/vm/m68k/chroot-0/var/cache/portage/.git/objects/pack/pack-075e6c8edaaee4b5824536ddb5795c58bcdcc4ec.idx
Queue entire file for dedupe: /mnt/btrfs/vm/m68k/chroot-0-snapshot-oct-2016/var/cache/portage/.git/objects/pack/pack-075e6c8edaaee4b5824536ddb5795c58bcdcc4ec.idx
Dedupe pass on 2 files completed

There is Zero way those files have changed since I made the snapshot.

lorddoskias commented 3 years ago

a chain of fdupes | duperemove --fdupes still skips small files.

Wouldn't it just be better for duperemove to have a knob that controls what is considered a "small file" ?

In a way there is - it's the -b switch which controls the blocksize that duperemove would use to scan file in block-scan mode. However if the passed in blocksize from command line is different than the blocksize in the db file ( in case an existing db file is loaded) then the value from the db takes precedence.

kentfredric commented 3 years ago

In a way there is - it's the -b switch which controls the blocksize that duperemove would use to scan file in block-scan mode.

But that has a min cap of 4K, which effectively prohibits deduplicating the vast quantities of small files which are duplicates.

Here's a breakdown so you can get a sense of the problem:

 Count Size
    343 10-100
    420 10-50
    501 0-10
   6006 >4000
  13349 2000-4000
  24888 1000-2000
  35010 500-1000
  41173 100-500

Deduplications within the repo may not be possible, however, when you literally have multiple copies of the repo checked out, there will be vast amounts of duplication.

However if the passed in blocksize from command line is different than the blocksize in the db file ( in case an existing db file is loaded) then the value from the db takes precedence

Sure, but even then, fdupes mode doesn't support a database anyway, and yet it still complains about small files with the smallest permissable -b mode.

lorddoskias commented 3 years ago

The sizes you show are in bytes right? So with the current status quo you are able to dedup only at most 6006 files (those are above 4000 bytes) and the rest duperemove treats as way too small ?

kentfredric commented 3 years ago

I suspect so. Though I guess in theory I could duct-tape together something that takes the output of fdupes and just does `cp --reflink=always a b"....

Its hard to know exactly what is happening in part to dedupremove repeatedly re-deduplicating things that are already sharing the underlying extents when run in fdupes mode. Also deduperemove combined with "-v" producing so much output, it ceases to be useful, as it is impossible to see what is happening beyond complaining about small files.

lorddoskias commented 3 years ago

So one possible way to fix this would be to simply ignore small files when run in fdupes mode. This makes sense because it's expected that the user knows what they are doing. Having said that I wonder why would anyone want to use fdupe + duperemove and not simply fdupes directly in the context of small files? Duperemove doesn't record any hashes when working in fdupe mode for subsequent uses.

kentfredric commented 3 years ago

and not simply fdupes directly in the context of small files?

Er, fdupes is not useful here.

Deleting the duplicates is not a feature I desire. SOMETHING is using those files, and they have to continue to exist at their given path with the given content.

Avoiding the duplication of the blocks is the feature.

But I do have to say, its not all doom and gloom ;).

I get good results from deduping my ~/.rustup and ~/.cargo dirs, which have heavy duplication, multiple copies of the same library, multiple copies of the same files ( I have 48 copies of rust, and doing agressive testing on like, 180 versions of the same thing, but actually removing those files is not an option, it would cause massive breakage.

     Total   Exclusive  Set shared  Filename
  28.58GiB     9.28GiB    11.30GiB  ./.rustup

8 GB saved

     Total   Exclusive  Set shared  Filename
 375.95MiB   148.35MiB   165.25MiB  ./.rustup/toolchains/1.0.0-x86_64-unknown-linux-gnu
 308.04MiB    93.54MiB   144.65MiB  ./.rustup/toolchains/1.10.0-x86_64-unknown-linux-gnu
 386.50MiB   140.67MiB   183.30MiB  ./.rustup/toolchains/1.1.0-x86_64-unknown-linux-gnu
 306.83MiB    81.38MiB   155.65MiB  ./.rustup/toolchains/1.11.0-x86_64-unknown-linux-gnu
 368.39MiB   100.38MiB   200.96MiB  ./.rustup/toolchains/1.12.1-x86_64-unknown-linux-gnu
 272.55MiB    82.80MiB   101.06MiB  ./.rustup/toolchains/1.13.0-x86_64-unknown-linux-gnu
 322.07MiB    26.90MiB   151.80MiB  ./.rustup/toolchains/1.14.0-x86_64-unknown-linux-gnu
 221.20MiB    27.93MiB   108.40MiB  ./.rustup/toolchains/1.15.1-x86_64-unknown-linux-gnu
 227.55MiB    16.75MiB   123.35MiB  ./.rustup/toolchains/1.16.0-x86_64-unknown-linux-gnu
 371.36MiB    94.36MiB   180.59MiB  ./.rustup/toolchains/1.17.0-x86_64-unknown-linux-gnu
 422.30MiB    82.08MiB   230.18MiB  ./.rustup/toolchains/1.18.0-x86_64-unknown-linux-gnu
 444.09MiB    92.60MiB   233.27MiB  ./.rustup/toolchains/1.19.0-x86_64-unknown-linux-gnu
 459.69MiB    97.08MiB   242.17MiB  ./.rustup/toolchains/1.20.0-x86_64-unknown-linux-gnu
 279.43MiB    97.89MiB   128.18MiB  ./.rustup/toolchains/1.2.0-x86_64-unknown-linux-gnu
 469.19MiB   106.27MiB   244.03MiB  ./.rustup/toolchains/1.21.0-x86_64-unknown-linux-gnu
 489.62MiB   142.70MiB   226.57MiB  ./.rustup/toolchains/1.22.1-x86_64-unknown-linux-gnu
 581.41MiB   171.71MiB   256.81MiB  ./.rustup/toolchains/1.23.0-x86_64-unknown-linux-gnu
 610.45MiB   188.35MiB   265.38MiB  ./.rustup/toolchains/1.24.1-x86_64-unknown-linux-gnu
 537.54MiB   321.75MiB   137.04MiB  ./.rustup/toolchains/1.25.0-x86_64-unknown-linux-gnu
 766.87MiB   243.68MiB   413.24MiB  ./.rustup/toolchains/1.26.2-x86_64-unknown-linux-gnu
 780.83MiB   386.40MiB   276.62MiB  ./.rustup/toolchains/1.27.2-x86_64-unknown-linux-gnu
 763.04MiB   146.36MiB   501.81MiB  ./.rustup/toolchains/1.28.0-x86_64-unknown-linux-gnu
 685.93MiB   244.47MiB   324.24MiB  ./.rustup/toolchains/1.29.2-x86_64-unknown-linux-gnu
 763.38MiB   311.58MiB   331.18MiB  ./.rustup/toolchains/1.30.1-x86_64-unknown-linux-gnu
 279.83MiB   118.04MiB   110.83MiB  ./.rustup/toolchains/1.3.0-x86_64-unknown-linux-gnu
 827.60MiB   108.26MiB   591.06MiB  ./.rustup/toolchains/1.31.0-x86_64-unknown-linux-gnu
 770.02MiB   167.61MiB   475.54MiB  ./.rustup/toolchains/1.31.1-x86_64-unknown-linux-gnu
 808.75MiB   250.06MiB   432.71MiB  ./.rustup/toolchains/1.32.0-x86_64-unknown-linux-gnu
 865.49MiB   309.24MiB   350.46MiB  ./.rustup/toolchains/1.33.0-x86_64-unknown-linux-gnu
 828.39MiB   333.77MiB   288.04MiB  ./.rustup/toolchains/1.34.2-x86_64-unknown-linux-gnu
 878.77MiB   215.89MiB   446.83MiB  ./.rustup/toolchains/1.35.0-x86_64-unknown-linux-gnu
 893.25MiB   301.61MiB   370.53MiB  ./.rustup/toolchains/1.36.0-x86_64-unknown-linux-gnu
 840.39MiB   248.17MiB   372.66MiB  ./.rustup/toolchains/1.37.0-x86_64-unknown-linux-gnu
1067.11MiB   387.14MiB   480.45MiB  ./.rustup/toolchains/1.38.0-x86_64-unknown-linux-gnu
1079.76MiB   404.79MiB   474.46MiB  ./.rustup/toolchains/1.39.0-x86_64-unknown-linux-gnu
 663.15MiB   288.47MiB   240.87MiB  ./.rustup/toolchains/1.40.0-x86_64-unknown-linux-gnu
 278.62MiB   101.63MiB   117.72MiB  ./.rustup/toolchains/1.4.0-x86_64-unknown-linux-gnu
 670.31MiB   238.15MiB   296.41MiB  ./.rustup/toolchains/1.41.1-x86_64-unknown-linux-gnu
 674.76MiB   275.57MiB   262.04MiB  ./.rustup/toolchains/1.42.0-x86_64-unknown-linux-gnu
 686.07MiB   240.50MiB   305.02MiB  ./.rustup/toolchains/1.43.1-x86_64-unknown-linux-gnu
 695.68MiB   293.00MiB   260.64MiB  ./.rustup/toolchains/1.44.1-x86_64-unknown-linux-gnu
 639.38MiB   324.91MiB   250.28MiB  ./.rustup/toolchains/1.45.2-x86_64-unknown-linux-gnu
 659.86MiB   397.66MiB   190.44MiB  ./.rustup/toolchains/1.46.0-x86_64-unknown-linux-gnu
 791.61MiB    32.30MiB   684.37MiB  ./.rustup/toolchains/1.47.0-x86_64-unknown-linux-gnu
 262.83MiB    81.79MiB   120.42MiB  ./.rustup/toolchains/1.5.0-x86_64-unknown-linux-gnu
 277.99MiB    60.99MiB   165.76MiB  ./.rustup/toolchains/1.6.0-x86_64-unknown-linux-gnu
 277.78MiB   101.78MiB   116.42MiB  ./.rustup/toolchains/1.7.0-x86_64-unknown-linux-gnu
 288.34MiB    88.25MiB   126.02MiB  ./.rustup/toolchains/1.8.0-x86_64-unknown-linux-gnu
 294.76MiB    95.18MiB   132.70MiB  ./.rustup/toolchains/1.9.0-x86_64-unknown-linux-gnu
 652.27MiB   339.18MiB   295.33MiB  ./.rustup/toolchains/nightly-x86_64-unknown-linux-gnu
 962.66MiB   115.62MiB   768.41MiB  ./.rustup/toolchains/stable-x86_64-unknown-linux-gnu
     Total   Exclusive  Set shared  Filename
5953.21MiB  3641.29MiB   879.73MiB  ./.cargo/

1432.19 MB saved

     Total   Exclusive  Set shared  Filename
  74.34MiB    21.95MiB    52.40MiB  ./.cargo/registry/cache/github.com-0a35038f75765ae4
 180.96MiB   132.76MiB    48.21MiB  ./.cargo/registry/cache/github.com-1ecc6299db9ec823
  76.63MiB    31.00MiB    45.63MiB  ./.cargo/registry/cache/github.com-88ac128001ac3a9a
 728.87MiB   487.30MiB   241.57MiB  ./.cargo/registry/index/github.com-0a35038f75765ae4
 861.57MiB   621.80MiB   239.77MiB  ./.cargo/registry/index/github.com-1ecc6299db9ec823
 843.56MiB   836.08MiB     7.48MiB  ./.cargo/registry/index/github.com-88ac128001ac3a9a
 577.99MiB   143.04MiB   309.84MiB  ./.cargo/registry/src/github.com-0a35038f75765ae4
1916.00MiB  1074.80MiB   405.48MiB  ./.cargo/registry/src/github.com-1ecc6299db9ec823
 593.78MiB   248.54MiB   221.89MiB  ./.cargo/registry/src/github.com-88ac128001ac3a9a

( Though that last one has seen lots of changes since I last did a dedup pass )