Closed vocatus closed 4 years ago
Hi, finddupe.exe
does not look into file name or extension, only file content matters. The first 32KB are used to find candidates for comparison, ENTIRE files are compared byte for byte before they are marked as duplicates. Perhaps the files in question were empty?
Hi @jeremitu ,
Thanks for the response. I understand finddupe reads the first 32KB, but we were having the issue where it was deleting duplicate files that were entirely different from each other. Here is one example from the thread I linked:
Deleted duplicate
Duplicate: 'C:\Users\splin\Downloads\Formacao do Brasil Contemporane - Caio Prado Jr.pdf'
With: 'C:\Users\splin\Downloads\StataCorp Stata 14.2 (Revision May 4, 2017)\utilities\java\windows-i586\jre1.8.0_121\lib\security\trusted.libraries'"
I don't think these files would be empty? One is a PDF and one is trusted.libraries
text file.
Hi @vocatus,
your understanding is not backed with enough evidence. Please refer to the function EliminateDuplicate()
which reads and compares the entire files. A bug is always possible, but you have not verified that the contents of the files you mentioned were actually different. You do not even know if they were empty or not.
I created some test files:
bla.pdf
trusted.txt.
test1.txt
test1 - Copy.txt
test1 - different.txt
Please go ahead and check with finddupe
.
Perhaps it would make sense to add an option like -sameext
to compare also by extension. Feel free to add it, it is FOSS.
Hi, I run the Tron project and the project has used your excellent
finddupe.exe
port for a few years now to clean up duplicate files in users' download folders.However, recently got a user report that the tool is incorrectly marking files as duplicates, even ones that don't even have the same file extension.
See the post with details here.
Would it be possible to add a switch to instruct
finddupe.exe
to read the ENTIRE file for hash computation instead of just the first 32KB?