finddupe v1.25 incorrectly detects duplicate files on Windows 10

jeremitu / finddupe

Port of finddupe duplicate file detector for Windows by Matthias Wandel http://www.sentex.net/~mwandel/finddupe/

17 stars 4 forks source link

finddupe v1.25 incorrectly detects duplicate files on Windows 10 #1

Closed vocatus closed 4 years ago

vocatus commented 5 years ago

Hi, I run the Tron project and the project has used your excellent finddupe.exe port for a few years now to clean up duplicate files in users' download folders.

However, recently got a user report that the tool is incorrectly marking files as duplicates, even ones that don't even have the same file extension.

See the post with details here.

Would it be possible to add a switch to instruct finddupe.exe to read the ENTIRE file for hash computation instead of just the first 32KB?

jeremitu commented 4 years ago

Hi, finddupe.exe does not look into file name or extension, only file content matters. The first 32KB are used to find candidates for comparison, ENTIRE files are compared byte for byte before they are marked as duplicates. Perhaps the files in question were empty?

vocatus commented 4 years ago

Hi @jeremitu ,

Thanks for the response. I understand finddupe reads the first 32KB, but we were having the issue where it was deleting duplicate files that were entirely different from each other. Here is one example from the thread I linked:

Deleted duplicate
Duplicate: 'C:\Users\splin\Downloads\Formacao do Brasil Contemporane - Caio Prado Jr.pdf'
With: 'C:\Users\splin\Downloads\StataCorp Stata 14.2 (Revision May 4, 2017)\utilities\java\windows-i586\jre1.8.0_121\lib\security\trusted.libraries'"

I don't think these files would be empty? One is a PDF and one is trusted.libraries text file.

jeremitu commented 4 years ago

Hi @vocatus,

your understanding is not backed with enough evidence. Please refer to the function EliminateDuplicate() which reads and compares the entire files. A bug is always possible, but you have not verified that the contents of the files you mentioned were actually different. You do not even know if they were empty or not.

I created some test files: bla.pdf trusted.txt. test1.txt test1 - Copy.txt test1 - different.txt Please go ahead and check with finddupe.

Perhaps it would make sense to add an option like -sameext to compare also by extension. Feel free to add it, it is FOSS.