markfasheh / duperemove

Tools for deduping file systems
GNU General Public License v2.0
794 stars 78 forks source link

No duplicates for identical file found #282

Closed JsBergbau closed 11 months ago

JsBergbau commented 2 years ago

Take this script

#!/bin/bash

rm testfile1.txt
rm testfile2.txt

for i in {1..10485760}
do
    echo "Testfile1" >> testfile1.txt
done

cp testfile1.txt testfile2.txt

to create a 100 MB file. Takes a while. 100 MB just to ensure file is large enough.

Then run

Gathering file list...
Using 4 threads for file hashing phase
[1/2] (50.00%) csum: /home/theuser/mnt/testfile1.txt
[2/2] (100.00%) csum: /home/theuser/mnt/testfile2.txt
Total files:  2
Total extent hashes: 4
Loading only duplicated hashes from hashfile.
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.

When doing this with file from dd it works

duperemove .
Gathering file list...
Using 4 threads for file hashing phase
[1/5] (20.00%) csum: /home/theuser/mnt/hashes.db
[2/5] (40.00%) csum: /home/theuser/mnt/testfile1.txt
[3/5] (60.00%) csum: /home/theuser/mnt/testfile2.txt
[4/5] (80.00%) csum: /home/theuser/mnt/random1
[5/5] (100.00%) csum: /home/theuser/mnt/random2
Total files:  5
Total extent hashes: 10
Loading only duplicated hashes from hashfile.
Found 2 identical extents.
Simple read and compare of file data found 1 instances of extents that might benefit from deduplication.
Showing 2 identical extents of length 104857600 with id 061cbee2
Start           Filename
0       "/home/theuser/mnt/random2"
0       "/home/theuser/mnt/random1"

Whats wrong here?

Zygo commented 2 years ago

Try cp --reflink=never; otherwise, there will be nothing to dedupe. Since 2020, coreutils cp uses reflinks by default.

JsBergbau commented 2 years ago

Thanks for the fast answer.

cp uses no reflink when copying the textfile because

theuser@thesystem:~/mnt$ df -h /dev/loop0
Filesystem      Size  Used Avail Use% Mounted on
/dev/loop0      507M  230M  278M  46% /home/theuser/mnt

theuser@thesystem:~/mnt$ ls -lh .
total 201M
-rwxr-xr-x 1 theuser theuser  146 Jun  1 17:20 create.sh
-rw-r--r-- 1 theuser theuser 100M Jun  1 17:23 testfile1.txt
-rw-r--r-- 1 theuser theuser 100M Jun  1 17:23 testfile2.txt

So 200 MB are used for the 2 textfiles.

When files created with dd are dedupped afterwards

Filesystem      Size  Used Avail Use% Mounted on
/dev/loop0      507M  130M  378M  26% /home/theuser/mnt

So only 100 MB are used because of deduplication, as expected.

EDIT: I've also changed last line of bash-script to cp --reflink=never testfile1.txt testfile2.txt but as expected, no difference

duperemove -d .
Gathering file list...
Using 4 threads for file hashing phase
[1/2] (50.00%) csum: /home/theuser/mnt/testfile1.txt
[2/2] (100.00%) csum: /home/theuser/mnt/testfile2.txt
Total files:  2
Total extent hashes: 3
Loading only duplicated hashes from hashfile.
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Nothing to dedupe.
JackSlateur commented 11 months ago

Hello @JsBergbau

Identitical files are now deduplicated, regardless of the mapping of their extents

Thank you for your report, feel free to reopen this issue if you still have an issue