markfasheh / duperemove

Tools for deduping file systems
GNU General Public License v2.0
794 stars 78 forks source link

Doesn't find duplicates without `--dedupe-options=nofiemap` #267

Closed derobert closed 3 years ago

derobert commented 3 years ago

For some reason, unless I run duperemove with --dedupe-options=nofiemap, it doesn't notice duplicates. In particular:

I've got two files, in different snapshots on a btrfs volume:

anthony@Watt:~$ /usr/sbin/filefrag -Xe '/media/anthony/BigBackup/snap/2020-09-07T02:06:21-04:00/backup/srv_videos/NEWS/name/name.avi' '/media/anthony/BigBackup/backup/srv_videos/Sorted/00_Show_Complete/name/name.avi' 
Filesystem type is: 9123683e
File size of /media/anthony/BigBackup/snap/2020-09-07T02:06:21-04:00/backup/srv_videos/NEWS/name/name.avi is 183486794 (44797 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..    7fff:  1468394f3.. 1468414f2:   8000:             shared
   1:     8000..    aefc:  14684c3f2.. 14684f2ee:   2efd:  1468414f3: last,shared,eof
/media/anthony/BigBackup/snap/2020-09-07T02:06:21-04:00/backup/srv_videos/NEWS/name/name.avi: 2 extents found
Filesystem type is: 9123683e
File size of /media/anthony/BigBackup/backup/srv_videos/Sorted/00_Show_Complete/name/name.avi is 183486794 (44797 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..     9bf:  152ed1134.. 152ed1af3:    9c0:            
   1:      9c0..    89bf:  15cb2cebe.. 15cb34ebd:   8000:  152ed1af4:
   2:     89c0..    aefc:  15cb41cfe.. 15cb4423a:   253d:  15cb34ebe: last,eof
/media/anthony/BigBackup/backup/srv_videos/Sorted/00_Show_Complete/name/name.avi: 3 extents found

These files are perfect duplicates:

anthony@Watt:~$ cmp -l '/media/anthony/BigBackup/snap/2020-09-07T02:06:21-04:00/backup/srv_videos/NEWS/name/name.avi' '/media/anthony/BigBackup/backup/srv_videos/Sorted/00_Show_Complete/name/name.avi'; echo "exit code = $?"
exit code = 0

duperemove doesn't find them:

anthony@Watt:~$ duperemove '/media/anthony/BigBackup/snap/2020-09-07T02:06:21-04:00/backup/srv_videos/NEWS/name/name.avi' '/media/anthony/BigBackup/backup/srv_videos/Sorted/00_Show_Complete/name/name.avi' 
Gathering file list...
Using 8 threads for file hashing phase
[1/2] (50.00%) csum: /media/anthony/BigBackup/snap/2020-09-07T02:06:21-04:00/backup/srv_videos/NEWS/name/name.avi
[2/2] (100.00%) csum: /media/anthony/BigBackup/backup/srv_videos/Sorted/00_Show_Complete/name/name.avi
Total files:  2
Total extent hashes: 5
Loading only duplicated hashes from hashfile.
Found 0 identical extents.
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.

... unless I use nofiemap:

anthony@Watt:~$ duperemove --dedupe-options=nofiemap '/media/anthony/BigBackup/snap/2020-09-07T02:06:21-04:00/backup/srv_videos/NEWS/name/name.avi' '/media/anthony/BigBackup/backup/srv_videos/Sorted/00_Show_Complete/name/name.avi' 
Gathering file list...
Using 8 threads for file hashing phase
[1/2] (50.00%) csum: /media/anthony/BigBackup/snap/2020-09-07T02:06:21-04:00/backup/srv_videos/NEWS/name/name.avi
[2/2] (100.00%) csum: /media/anthony/BigBackup/backup/srv_videos/Sorted/00_Show_Complete/name/name.avi
Total files:  2
Total extent hashes: 2800
Loading only duplicated hashes from hashfile.
Hashing completed. Using 4 threads to calculate duplicate extents. This may take some time.
[########################################]
Search completed with no errors.             
Simple read and compare of file data found 1 instances of extents that might benefit from deduplication.
Showing 2 identical extents of length 183486794 with id 69f5c272
Start           Filename
0       "/media/anthony/BigBackup/backup/srv_videos/Sorted/00_Show_Complete/name/name.avi"
0       "/media/anthony/BigBackup/snap/2020-09-07T02:06:21-04:00/backup/srv_videos/NEWS/name/name.avi"

Seems like something is amiss there.

lorddoskias commented 3 years ago

Check the man page, in particular the question: " I got two identical files, why are they not deduped?"