Open slavanap opened 1 year ago
That is my interpretation as well. Admittedly, SHA-256 is fairly collision resistant.
Then I'd suggest to point out it in documentation somewhere (like it did in ZFS documentation).
May I suggest BLAKE3 as a (very fast and collision-resistant) alternative to SHA256?
Anyway, SHA1 is known-broken (See https://shattered.io/), and real-world collision examples exist (for example two different PDFs which hash the same: https://shattered.io/)
There's also https://github.com/corkami/collisions to consider, so I'ld say MD5 is definitely out.
BTW: I have collected some of these (PDF files) collision file examples as part of a large, slowly growing, test corpus for another application and I'ld love it when default rdfind
would be safe running over such a set.
:thinking: Ultimately, "safe" would then mean we'ld have to settle for an additional final verification round where the file content is compared byte-for-byte as you can never be absolutely sure with a (cryptographically secure) hash. Ah well, performance be damned.
Surely SHA256 has been tested and investigated quite thoroughly (BLAKE3 to a lesser extent), so the chances of that happening with SHA256 are astronomically thin, but then my paranoid brain thinks of Sir Pratchett (✝RIP): a million-to-one chance succeeds nine times out of ten so better safe then sorry for the paranoid = file content comparison. :smile:
I've attempted to read the code and haven't found a part that reads and compares 2 full files. So the determination of duplicates is seemed to be done only based on checksums (similarly as ZFS dedup=on), but not on contents (ZFS dedup=verify), is that correct?