Open Massimo-B opened 6 years ago
For dedup files use fdupes mode, not small block size. noblock - bad idea, because duperemove then, try calculate extents for dedup (that increase runtime and can decrease deduplication rate in theory) --lookup-extents - nope, that only allow avoid deduplication of already deduplicated extents (AFAIK)
Also you miss use compsize - Compressed size, it's not for deduplication proporse
For deduplication use: btrfs fi du -s
You mean this mode?
fdupes -r $foo $bar | duperemove --fdupes
Why would that find more than duperemove? I thought duperemove would always find more as it also compares the extents of files. Like (as for the $HOME snapshots later) often I have pictures inside emails and as plain files. And I have some caches for webmounts that hold duplicate data with the plain files for davfs2. fdupes would not find those parts in the different file formats.
@Massimo-B , You say that you have a problem with tiny files, i say that better use fdupes for that (for fix tiny files). You can use default block mode for that after, but fdupes make thing slightly better, because duperemove will not change extent maps of that files while deduplication, because files already deduplicated.
Also, if we talk about emails with different attachments, data in different file formats & etc, deduplication will not help. Because deduplication work with block size boundary. i.e. if you have 2 100500 GiB files in different emails and that files (attachment) not aligned to each other by fs block size - deduplication will not help with that.
Thanks for the outline.
As for compsize, I was just using it wrong. It must include all involved space to determine the referenced part. After using fdupes | duperemove there is a large part referenced now, though I still can't interpret all details of it:
# compsize /mnt/usb/mobiledata/snapshots/bur/root/root.20171023T125129+0200/ /mnt/usb/mobiledata/snapshots/mob/root/root.20171024T105543+0200/
Processed 908775 files, 358748 regular extents (4422383 refs), 591502 inline.
Type Perc Disk Usage Uncompressed Referenced
Data 48% 8.5G 17G 26G
none 100% 2.6G 2.6G 4.0G
zlib 39% 5.9G 14G 21G
About the /home snapshots, before:
# compsize /mnt/usb/mobiledata/snapshots/bur/home/home.20171023T125129+0200/ /mnt/usb/mobiledata/snapshots/mob/home/home.20171024T073301+0200/
Processed 621227 files, 1003180 regular extents (1228103 refs), 171877 inline.
Type Perc Disk Usage Uncompressed Referenced
Data 84% 98G 117G 150G
none 100% 78G 78G 105G
zlib 52% 20G 38G 44G
After:
# compsize /mnt/usb/mobiledata/snapshots/bur/home/home.20171023T125129+0200/ /mnt/usb/mobiledata/snapshots/mob/home/home.20171024T073301+0200/
Processed 621227 files, 382966 regular extents (1168160 refs), 171877 inline.
Type Perc Disk Usage Uncompressed Referenced
Data 82% 30G 36G 150G
none 100% 23G 23G 111G
zlib 49% 6.3G 12G 39G
As for duperemove in combination with fdupes, is it still advised to use the smallest -b blocksize, as duperemove would still skip the small files provided by fdupes?
Why does fdupes find more duplicate files than duperemove? Drawback of periodical fdupes runs is, that it doesn't support hash cache files for speeding up the next runs. I usually had duperemove as a cronjob on let's say weekly or monthly base.
a chain of fdupes | duperemove --fdupes
still skips small files.
Wouldn't it just be better for duperemove
to have a knob that controls what is considered a "small file" ?
Also, fupes | duperemove --fdupes
tries to re-deduplicate already deduplicated files, because fdupes has no understanding of btrfs native extents...
And duperemove
just takes its output and goes "Ok", and then wastes a lot of IO doing nothing.
Like:
Queue entire file for dedupe: /mnt/btrfs/vm/m68k/chroot-0/var/cache/portage/.git/objects/pack/pack-075e6c8edaaee4b5824536ddb5795c58bcdcc4ec.idx
Queue entire file for dedupe: /mnt/btrfs/vm/m68k/chroot-0-snapshot-oct-2016/var/cache/portage/.git/objects/pack/pack-075e6c8edaaee4b5824536ddb5795c58bcdcc4ec.idx
Dedupe pass on 2 files completed
There is Zero way those files have changed since I made the snapshot.
a chain of
fdupes | duperemove --fdupes
still skips small files.Wouldn't it just be better for
duperemove
to have a knob that controls what is considered a "small file" ?
In a way there is - it's the -b
switch which controls the blocksize that duperemove would use to scan file in block-scan mode. However if the passed in blocksize from command line is different than the blocksize in the db file ( in case an existing db file is loaded) then the value from the db takes precedence.
In a way there is - it's the -b switch which controls the blocksize that duperemove would use to scan file in block-scan mode.
But that has a min cap of 4K, which effectively prohibits deduplicating the vast quantities of small files which are duplicates.
Here's a breakdown so you can get a sense of the problem:
Count Size
343 10-100
420 10-50
501 0-10
6006 >4000
13349 2000-4000
24888 1000-2000
35010 500-1000
41173 100-500
Deduplications within the repo may not be possible, however, when you literally have multiple copies of the repo checked out, there will be vast amounts of duplication.
However if the passed in blocksize from command line is different than the blocksize in the db file ( in case an existing db file is loaded) then the value from the db takes precedence
Sure, but even then, fdupes mode doesn't support a database anyway, and yet it still complains about small files with the smallest permissable -b
mode.
The sizes you show are in bytes right? So with the current status quo you are able to dedup only at most 6006 files (those are above 4000 bytes) and the rest duperemove treats as way too small ?
I suspect so. Though I guess in theory I could duct-tape together something that takes the output of fdupes and just does `cp --reflink=always a b"....
Its hard to know exactly what is happening in part to dedupremove repeatedly re-deduplicating things that are already sharing the underlying extents when run in fdupes mode. Also deduperemove combined with "-v" producing so much output, it ceases to be useful, as it is impossible to see what is happening beyond complaining about small files.
So one possible way to fix this would be to simply ignore small files when run in fdupes mode. This makes sense because it's expected that the user knows what they are doing. Having said that I wonder why would anyone want to use fdupe + duperemove and not simply fdupes directly in the context of small files? Duperemove doesn't record any hashes when working in fdupe mode for subsequent uses.
and not simply fdupes directly in the context of small files?
Er, fdupes is not useful here.
Deleting the duplicates is not a feature I desire. SOMETHING is using those files, and they have to continue to exist at their given path with the given content.
Avoiding the duplication of the blocks is the feature.
But I do have to say, its not all doom and gloom ;).
I get good results from deduping my ~/.rustup and ~/.cargo dirs, which have heavy duplication, multiple copies of the same library, multiple copies of the same files ( I have 48 copies of rust, and doing agressive testing on like, 180 versions of the same thing, but actually removing those files is not an option, it would cause massive breakage.
Total Exclusive Set shared Filename
28.58GiB 9.28GiB 11.30GiB ./.rustup
8 GB saved
Total Exclusive Set shared Filename
375.95MiB 148.35MiB 165.25MiB ./.rustup/toolchains/1.0.0-x86_64-unknown-linux-gnu
308.04MiB 93.54MiB 144.65MiB ./.rustup/toolchains/1.10.0-x86_64-unknown-linux-gnu
386.50MiB 140.67MiB 183.30MiB ./.rustup/toolchains/1.1.0-x86_64-unknown-linux-gnu
306.83MiB 81.38MiB 155.65MiB ./.rustup/toolchains/1.11.0-x86_64-unknown-linux-gnu
368.39MiB 100.38MiB 200.96MiB ./.rustup/toolchains/1.12.1-x86_64-unknown-linux-gnu
272.55MiB 82.80MiB 101.06MiB ./.rustup/toolchains/1.13.0-x86_64-unknown-linux-gnu
322.07MiB 26.90MiB 151.80MiB ./.rustup/toolchains/1.14.0-x86_64-unknown-linux-gnu
221.20MiB 27.93MiB 108.40MiB ./.rustup/toolchains/1.15.1-x86_64-unknown-linux-gnu
227.55MiB 16.75MiB 123.35MiB ./.rustup/toolchains/1.16.0-x86_64-unknown-linux-gnu
371.36MiB 94.36MiB 180.59MiB ./.rustup/toolchains/1.17.0-x86_64-unknown-linux-gnu
422.30MiB 82.08MiB 230.18MiB ./.rustup/toolchains/1.18.0-x86_64-unknown-linux-gnu
444.09MiB 92.60MiB 233.27MiB ./.rustup/toolchains/1.19.0-x86_64-unknown-linux-gnu
459.69MiB 97.08MiB 242.17MiB ./.rustup/toolchains/1.20.0-x86_64-unknown-linux-gnu
279.43MiB 97.89MiB 128.18MiB ./.rustup/toolchains/1.2.0-x86_64-unknown-linux-gnu
469.19MiB 106.27MiB 244.03MiB ./.rustup/toolchains/1.21.0-x86_64-unknown-linux-gnu
489.62MiB 142.70MiB 226.57MiB ./.rustup/toolchains/1.22.1-x86_64-unknown-linux-gnu
581.41MiB 171.71MiB 256.81MiB ./.rustup/toolchains/1.23.0-x86_64-unknown-linux-gnu
610.45MiB 188.35MiB 265.38MiB ./.rustup/toolchains/1.24.1-x86_64-unknown-linux-gnu
537.54MiB 321.75MiB 137.04MiB ./.rustup/toolchains/1.25.0-x86_64-unknown-linux-gnu
766.87MiB 243.68MiB 413.24MiB ./.rustup/toolchains/1.26.2-x86_64-unknown-linux-gnu
780.83MiB 386.40MiB 276.62MiB ./.rustup/toolchains/1.27.2-x86_64-unknown-linux-gnu
763.04MiB 146.36MiB 501.81MiB ./.rustup/toolchains/1.28.0-x86_64-unknown-linux-gnu
685.93MiB 244.47MiB 324.24MiB ./.rustup/toolchains/1.29.2-x86_64-unknown-linux-gnu
763.38MiB 311.58MiB 331.18MiB ./.rustup/toolchains/1.30.1-x86_64-unknown-linux-gnu
279.83MiB 118.04MiB 110.83MiB ./.rustup/toolchains/1.3.0-x86_64-unknown-linux-gnu
827.60MiB 108.26MiB 591.06MiB ./.rustup/toolchains/1.31.0-x86_64-unknown-linux-gnu
770.02MiB 167.61MiB 475.54MiB ./.rustup/toolchains/1.31.1-x86_64-unknown-linux-gnu
808.75MiB 250.06MiB 432.71MiB ./.rustup/toolchains/1.32.0-x86_64-unknown-linux-gnu
865.49MiB 309.24MiB 350.46MiB ./.rustup/toolchains/1.33.0-x86_64-unknown-linux-gnu
828.39MiB 333.77MiB 288.04MiB ./.rustup/toolchains/1.34.2-x86_64-unknown-linux-gnu
878.77MiB 215.89MiB 446.83MiB ./.rustup/toolchains/1.35.0-x86_64-unknown-linux-gnu
893.25MiB 301.61MiB 370.53MiB ./.rustup/toolchains/1.36.0-x86_64-unknown-linux-gnu
840.39MiB 248.17MiB 372.66MiB ./.rustup/toolchains/1.37.0-x86_64-unknown-linux-gnu
1067.11MiB 387.14MiB 480.45MiB ./.rustup/toolchains/1.38.0-x86_64-unknown-linux-gnu
1079.76MiB 404.79MiB 474.46MiB ./.rustup/toolchains/1.39.0-x86_64-unknown-linux-gnu
663.15MiB 288.47MiB 240.87MiB ./.rustup/toolchains/1.40.0-x86_64-unknown-linux-gnu
278.62MiB 101.63MiB 117.72MiB ./.rustup/toolchains/1.4.0-x86_64-unknown-linux-gnu
670.31MiB 238.15MiB 296.41MiB ./.rustup/toolchains/1.41.1-x86_64-unknown-linux-gnu
674.76MiB 275.57MiB 262.04MiB ./.rustup/toolchains/1.42.0-x86_64-unknown-linux-gnu
686.07MiB 240.50MiB 305.02MiB ./.rustup/toolchains/1.43.1-x86_64-unknown-linux-gnu
695.68MiB 293.00MiB 260.64MiB ./.rustup/toolchains/1.44.1-x86_64-unknown-linux-gnu
639.38MiB 324.91MiB 250.28MiB ./.rustup/toolchains/1.45.2-x86_64-unknown-linux-gnu
659.86MiB 397.66MiB 190.44MiB ./.rustup/toolchains/1.46.0-x86_64-unknown-linux-gnu
791.61MiB 32.30MiB 684.37MiB ./.rustup/toolchains/1.47.0-x86_64-unknown-linux-gnu
262.83MiB 81.79MiB 120.42MiB ./.rustup/toolchains/1.5.0-x86_64-unknown-linux-gnu
277.99MiB 60.99MiB 165.76MiB ./.rustup/toolchains/1.6.0-x86_64-unknown-linux-gnu
277.78MiB 101.78MiB 116.42MiB ./.rustup/toolchains/1.7.0-x86_64-unknown-linux-gnu
288.34MiB 88.25MiB 126.02MiB ./.rustup/toolchains/1.8.0-x86_64-unknown-linux-gnu
294.76MiB 95.18MiB 132.70MiB ./.rustup/toolchains/1.9.0-x86_64-unknown-linux-gnu
652.27MiB 339.18MiB 295.33MiB ./.rustup/toolchains/nightly-x86_64-unknown-linux-gnu
962.66MiB 115.62MiB 768.41MiB ./.rustup/toolchains/stable-x86_64-unknown-linux-gnu
Total Exclusive Set shared Filename
5953.21MiB 3641.29MiB 879.73MiB ./.cargo/
1432.19 MB saved
Total Exclusive Set shared Filename
74.34MiB 21.95MiB 52.40MiB ./.cargo/registry/cache/github.com-0a35038f75765ae4
180.96MiB 132.76MiB 48.21MiB ./.cargo/registry/cache/github.com-1ecc6299db9ec823
76.63MiB 31.00MiB 45.63MiB ./.cargo/registry/cache/github.com-88ac128001ac3a9a
728.87MiB 487.30MiB 241.57MiB ./.cargo/registry/index/github.com-0a35038f75765ae4
861.57MiB 621.80MiB 239.77MiB ./.cargo/registry/index/github.com-1ecc6299db9ec823
843.56MiB 836.08MiB 7.48MiB ./.cargo/registry/index/github.com-88ac128001ac3a9a
577.99MiB 143.04MiB 309.84MiB ./.cargo/registry/src/github.com-0a35038f75765ae4
1916.00MiB 1074.80MiB 405.48MiB ./.cargo/registry/src/github.com-1ecc6299db9ec823
593.78MiB 248.54MiB 221.89MiB ./.cargo/registry/src/github.com-88ac128001ac3a9a
( Though that last one has seen lots of changes since I last did a dedup pass )
Hi, I have a central backup btrfs where snapshots of 2 different machine root filesystems are transferred via btrbk. Those snapshots are quite equal in most parts as they were cloned Gentoo installations but binary entirely rebuild for a different platform. I would like to deduplicate the initial snapshots of both, later doing that again periodically. For duperemove I've set both snapshots to read-write first.
I compare the result of the deduplication using compsize.
First I wondered that most files were "Skipping small file...". Resulting hashfile size compared to the target files:
Decreasing the -b blocksize to -b 4096 did not find new files with the same hashfile. However after deleting the hashfile the new hashfile increased a bit but still many skipped files. This run took ~1 day instead of ~1 hour:
Afaik the minimal usable blocksize is limited by kernel and the btrfs blocksize, how can I detect the blocksize of my btrfs?
Would --dedupe-options=noblock change the amount of deduplication or just optimizes the number of extents?
Could --lookup-extents=yes help finding more duplicates?
Finally I still wonder that compsize "Referenced" is equal to "Uncompressed" just like before the dedup: