Open minecraft2048 opened 6 years ago
Hi, can you try with --dedupe-options=noblock
?
It is much faster, and the output matches the example output at README:
With --dedupe-options=noblock
:
duperemove --lookup-extents=yes --dedupe-options=noblock -dhr .
Using 128K blocks
Using hash: murmur3
Gathering file list...
Using 8 threads for file hashing phase
[1/2] (50.00%) csum: /home/feanor/testfile/test.img
[2/2] (100.00%) csum: /home/feanor/testfile/test1.imt
Total files: 2
Total hashes: 81920
Loading only duplicated hashes from hashfile.
Hashing completed. Using 4 threads to calculate duplicate extents. This may take some time.
[########################################]
Search completed with no errors.
Simple read and compare of file data found 1 instances of extents that might benefit from deduplication.
Showing 2 identical extents of length 5.0G with id 78d76f4b
Start Filename
0.0 "/home/feanor/testfile/test1.imt"
0.0 "/home/feanor/testfile/test.img"
Using 8 threads for dedupe phase
[0x560769e78d40] (1/1) Try to dedupe extents with id 78d76f4b
[0x560769e78d40] Dedupe 1 extents (id: 78d76f4b) with target: (0.0, 5.0G), "/home/feanor/testfile/test1.imt"
Kernel processed data (excludes target files): 5.0G
Comparison of extent info shows a net change in shared extents of: 0.0
Without --dedupe-options=noblock
:
Using 128K blocks
Using hash: murmur3
Gathering file list...
Using 8 threads for file hashing phase
[1/2] (50.00%) csum: /home/feanor/testfile/test.img
[2/2] (100.00%) csum: /home/feanor/testfile/test1.imt
Total files: 2
Total hashes: 81920
Loading only duplicated hashes from hashfile.
Using 8 threads for dedupe phase
[0x55e2165f4e80] (1/1) Try to dedupe extents with id e47862ea
and then it continues, but as my primary problem has been solved, this is good enough
+1, this is happening for me as well. It also happens with files I've previously deduplicated using duperemove
if I remove the hashfile.
More specifically, I'm running BTRFS (kernel: 4.14.65-gentoo, userspace tools: v4.16.1) on my NAS. I'm using duperemove v0.11, and running it as root. Here's the specific command line I've reproduced this with:
/usr/sbin/duperemove --hashfile=/root/storage.hash -b 1M --skip-zeroes --dedupe-options=noblock -A -dhr --io-threads=2 --cpu-threads=2 /storage
Ah, nm - I may have answered my own question here.... I'm not passing --lookup-extents=yes
. If that's it, please disregard!
Hello,
Extents that are already-shared are always hashed : without this, we would not be able to deduplicate a new, unshared-file which may store the same data
While the existing hashes could be copied instead of being recomputed, I believe the complexity is not worth it Your test script is now quite fast:
$ fallocate -l 5G test.img
$ cp --reflink=always test.img test1.imt
$ /usr/bin/time -v duperemove -drh . --dedupe-options=partial |& grep -e Elapsed -e "Maximum resident set size"
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:07.38
Maximum resident set size (kbytes): 21472
If I test with actual data:
$ dd if=/dev/urandom bs=1G count=5 of=test.img
$ cp --reflink=always test.img test1.imt
$ /usr/bin/time -v duperemove -drh . --dedupe-options=partial |& grep -e Elapsed -e "Maximum resident set size"
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:27.81
Maximum resident set size (kbytes): 33476
Indeed, the time spent on the dedupe phase could be improved (in that last test, it took 4 sec to hash the 10GB, and 23sec to deduplicate the already-shared extents)
Does duperemove skips files that are
cp --reflink=always
copied? I think that it should, but it doesn't finish the operation immediatelyHow I tested it:
I expect this to start and then it recognizes that they are both the same file sharing the same reflink thing and it skips those file, but then duperemove uses one of my cpu and it takes 5-10 minutes to finish it to only result with 0 extents being changed