pauldreik / rdfind

find duplicate files utility
Other
973 stars 79 forks source link

Is rdfind safe for sha1 (or other) checksum collisions? #126

Open slavanap opened 1 year ago

slavanap commented 1 year ago

I've attempted to read the code and haven't found a part that reads and compares 2 full files. So the determination of duplicates is seemed to be done only based on checksums (similarly as ZFS dedup=on), but not on contents (ZFS dedup=verify), is that correct?

fire-eggs commented 1 year ago

That is my interpretation as well. Admittedly, SHA-256 is fairly collision resistant.

slavanap commented 1 year ago

Then I'd suggest to point out it in documentation somewhere (like it did in ZFS documentation).

GerHobbelt commented 1 year ago

May I suggest BLAKE3 as a (very fast and collision-resistant) alternative to SHA256?

Anyway, SHA1 is known-broken (See https://shattered.io/), and real-world collision examples exist (for example two different PDFs which hash the same: https://shattered.io/)

There's also https://github.com/corkami/collisions to consider, so I'ld say MD5 is definitely out.

GerHobbelt commented 1 year ago

BTW: I have collected some of these (PDF files) collision file examples as part of a large, slowly growing, test corpus for another application and I'ld love it when default rdfind would be safe running over such a set.

:thinking: Ultimately, "safe" would then mean we'ld have to settle for an additional final verification round where the file content is compared byte-for-byte as you can never be absolutely sure with a (cryptographically secure) hash. Ah well, performance be damned.

Surely SHA256 has been tested and investigated quite thoroughly (BLAKE3 to a lesser extent), so the chances of that happening with SHA256 are astronomically thin, but then my paranoid brain thinks of Sir Pratchett (✝RIP): a million-to-one chance succeeds nine times out of ten so better safe then sorry for the paranoid = file content comparison. :smile: