markfasheh / duperemove

Tools for deduping file systems
GNU General Public License v2.0
816 stars 81 forks source link

Deduplication does not work for two folders with the same files #347

Closed Ratio2 closed 1 month ago

Ratio2 commented 2 months ago

It's very strange that deduplication doesn't work in such a simple case with default settings (without -B/--batchsize option). Version: master

#!/usr/bin/env bash
set -e

mkdir tmp
btrfs property set tmp compression none
mkdir tmp/1
for i in {0..1023}; do
    dd bs=128k count=1 if=/dev/urandom of=tmp/1/$i
done
cp -a --reflink=never tmp/1 tmp/2
sync
duperemove -dr tmp
sync
sudo compsize -x tmp
Processed 2048 files, 2048 regular extents (2048 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL      100%      256M         256M         256M       
none       100%      256M         256M         256M  
JackSlateur commented 2 months ago

@Ratio2 Thank you for the detailed bug report

I believe this is now fixed, feel free to reopen the issue if not

Ratio2 commented 2 months ago

Looks like a similar change needs to be added for blocks and extents

#!/usr/bin/env bash
set -e

rm -rf tmp tmp.sqlite3
mkdir tmp
btrfs property set tmp compression none
mkdir tmp/1
for i in {0..1023}; do
    dd bs=128k count=1 if=/dev/urandom >tmp/1/$i
done
cp -a --reflink=never tmp/1 tmp/2
for i in {0..1023}; do
    dd bs=128k count=1 if=/dev/urandom >>tmp/1/$i
done
sync -f .
duperemove -dr --hashfile=tmp.sqlite3 --dedupe-options=partial tmp
sync -f .
sudo compsize -x tmp
Processed 2048 files, 2048 regular extents (2048 refs), 0 inline.
Type       Perc     Disk Usage   Uncompressed Referenced  
TOTAL      100%      384M         384M         384M       
none       100%      384M         384M         384M