markfasheh / duperemove

Tools for deduping file systems
GNU General Public License v2.0
795 stars 80 forks source link

duperemove and cp --reflink=always file #205

Open minecraft2048 opened 6 years ago

minecraft2048 commented 6 years ago

Does duperemove skips files that are cp --reflink=always copied? I think that it should, but it doesn't finish the operation immediately

How I tested it:

mkdir testfile
cd testfile
fallocate -l 5G test.img
cp --reflink=always test.img test1.imt
duperemove --lookup-extents=yes -d  .

I expect this to start and then it recognizes that they are both the same file sharing the same reflink thing and it skips those file, but then duperemove uses one of my cpu and it takes 5-10 minutes to finish it to only result with 0 extents being changed

markfasheh commented 6 years ago

Hi, can you try with --dedupe-options=noblock ?

minecraft2048 commented 6 years ago

It is much faster, and the output matches the example output at README:

With --dedupe-options=noblock:

duperemove  --lookup-extents=yes --dedupe-options=noblock  -dhr .
Using 128K blocks
Using hash: murmur3
Gathering file list...
Using 8 threads for file hashing phase
[1/2] (50.00%) csum: /home/feanor/testfile/test.img
[2/2] (100.00%) csum: /home/feanor/testfile/test1.imt
Total files:  2
Total hashes: 81920
Loading only duplicated hashes from hashfile.
Hashing completed. Using 4 threads to calculate duplicate extents. This may take some time.
[########################################]
Search completed with no errors.             
Simple read and compare of file data found 1 instances of extents that might benefit from deduplication.
Showing 2 identical extents of length 5.0G with id 78d76f4b
Start       Filename
0.0 "/home/feanor/testfile/test1.imt"
0.0 "/home/feanor/testfile/test.img"
Using 8 threads for dedupe phase
[0x560769e78d40] (1/1) Try to dedupe extents with id 78d76f4b
[0x560769e78d40] Dedupe 1 extents (id: 78d76f4b) with target: (0.0, 5.0G), "/home/feanor/testfile/test1.imt"
Kernel processed data (excludes target files): 5.0G
Comparison of extent info shows a net change in shared extents of: 0.0

Without --dedupe-options=noblock:

Using 128K blocks
Using hash: murmur3
Gathering file list...
Using 8 threads for file hashing phase
[1/2] (50.00%) csum: /home/feanor/testfile/test.img
[2/2] (100.00%) csum: /home/feanor/testfile/test1.imt
Total files:  2
Total hashes: 81920
Loading only duplicated hashes from hashfile.
Using 8 threads for dedupe phase
[0x55e2165f4e80] (1/1) Try to dedupe extents with id e47862ea

and then it continues, but as my primary problem has been solved, this is good enough

reedriley commented 6 years ago

+1, this is happening for me as well. It also happens with files I've previously deduplicated using duperemove if I remove the hashfile.

More specifically, I'm running BTRFS (kernel: 4.14.65-gentoo, userspace tools: v4.16.1) on my NAS. I'm using duperemove v0.11, and running it as root. Here's the specific command line I've reproduced this with:

/usr/sbin/duperemove --hashfile=/root/storage.hash -b 1M --skip-zeroes --dedupe-options=noblock -A -dhr --io-threads=2 --cpu-threads=2 /storage
reedriley commented 6 years ago

Ah, nm - I may have answered my own question here.... I'm not passing --lookup-extents=yes. If that's it, please disregard!

JackSlateur commented 9 months ago

Hello,

Extents that are already-shared are always hashed : without this, we would not be able to deduplicate a new, unshared-file which may store the same data

While the existing hashes could be copied instead of being recomputed, I believe the complexity is not worth it Your test script is now quite fast:

$ fallocate -l 5G test.img
$ cp --reflink=always test.img test1.imt
$ /usr/bin/time -v duperemove -drh . --dedupe-options=partial |& grep -e Elapsed -e "Maximum resident set size"
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:07.38
    Maximum resident set size (kbytes): 21472

If I test with actual data:

$ dd if=/dev/urandom bs=1G count=5 of=test.img
$ cp --reflink=always test.img test1.imt
$ /usr/bin/time -v duperemove -drh . --dedupe-options=partial |& grep -e Elapsed -e "Maximum resident set size"
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:27.81
    Maximum resident set size (kbytes): 33476

Indeed, the time spent on the dedupe phase could be improved (in that last test, it took 4 sec to hash the 10GB, and 23sec to deduplicate the already-shared extents)