markfasheh / duperemove

Tools for deduping file systems
GNU General Public License v2.0
816 stars 81 forks source link

Trying to only have duplication ratio #204

Open giox069 opened 6 years ago

giox069 commented 6 years ago

I'm trying to have a data duplication report of my btrfs filesystem before deciding to deduplicate it or not.

According to the duperemove man page,

When run without -d (the default) duperemove will print out one or more tables of matching extents it has determined would be ideal candidates for deduplication. As a result, readonly mode is useful for seeing what duperemove might do when run with -d. The output could also be used by some other software to submit the extents for deduplication at a later time.

But when I run

duperemove /mnt/d1/shared/tmp/

The output is

Using 128K blocks
Using hash: murmur3
Gathering file list...
Using 4 threads for file hashing phase
[1/6] (16.67%) csum: /mnt/d1/shared/tmp/dscn0808.jpg
[2/6] (33.33%) csum: /mnt/d1/shared/tmp/mm_20171020.tar.gz
[3/6] (50.00%) csum: /mnt/d1/shared/tmp/LO impress bug.odp
[4/6] (66.67%) csum: /mnt/d1/shared/tmp/MyDoc1.docx
[5/6] (83.33%) csum: /mnt/d1/shared/tmp/Cats labels.odg
[6/6] (100.00%) csum: /mnt/d1/shared/tmp/mm_20171020.tar.gz.bak
Total files:  6
Total hashes: 96
Loading only duplicated hashes from hashfile.

Where are the above mentioned "...out one or more tables of matching extents it has determined would be ideal candidates for deduplication." ? I'm sure mm_20171020.tar.gz.bak is a duplicate of mm_20171020.tar.gz (870KB in size), but I can't see duperemove telling it somewhere.

I also tested with -r and on bigger filesystem (320GB with 285000 files), but I can't see deduplication reports or tables.

What I'm doing wrong? I'm using Ubuntu 18.04 stock duperemove. I also tested on the latest compiled master branch from here.

Thank you

markfasheh commented 6 years ago

I don't think you're doing anything wrong - this looks like a bug, I'll take a look.

markfasheh commented 6 years ago

Ahh so we don't print this table when using the block dedupe algorithm. Most likely because the results would be too big. Still, a summary of total duped bytes found would be useful. In the meantime can you add --dedupe-options=noblock to your command line? It will print what you are looking for and will probably result in better dedupe for that data set anyway.

giox069 commented 6 years ago

Yes, --dedupe-options=noblock shows me a list of duplicate extents (also coming from different part of diffrernt files):

...
Showing 2 identical extents of length 131072 with id 3164a4ad
Start       Filename
2359296 "/mnt/d1/shared/tmp/Optiplex755/Tools/b5bfb671-bbf5-45d8-a9c4-0baec8f7bb44.HDR"
3407872 "/mnt/d1/shared/tmp/Optiplex755/BiosUpdate_O755-A22_SLIC.exe"
Showing 2 identical extents of length 208896 with id 2c81894e
Start       Filename
0   "/mnt/d1/shared/tmp/Optiplex755/Tools/dellcm_cleanup_service/DCMStandaloneSvc1.exe"
0   "/mnt/d1/shared/tmp/Optiplex755/Tools/DCMStandaloneSvc.exe"
...

Thank you for the workaround.