pkolaczk / fclones

Efficient Duplicate File Finder
MIT License
1.94k stars 75 forks source link

Why hash whole files? #266

Open matthiasgoergens opened 4 months ago

matthiasgoergens commented 4 months ago

Thanks for providing this useful tool!

The Readme writes:

Note that there is no byte-by-byte comparison of files anywhere. All available hash functions are at least 128-bit wide, and you don't need to worry about hash collisions. At 1015 files, the probability of collision is 0.000000001 when using a 128-bit hash, without taking into account the requirement for the files to also match by size.

When you have a group of, say, N files of uniform size S bytes each, then you need to to read N S bytes to compute all the hashes. In contrast, to compare them byte-by-byte for equality, you need to read N S bytes.

You might notice that both values are the same.

To make the byte-for-byte comparison work you need to be somewhat careful about the order you read in: I assume you want to avoid having the whole N * S bytes in memory at the same time. On the one hand, that's a complication. On the other hand, you can simplify because you no longer rely on the hash for correctness, only for speed.

Here's a sketch:

If N is less than outrageously large, you can (conceptually) divide the files into chunks, 'zip' up the chunks, and the compare all the first chunks of all files, then all the second chunks of all files, etc. (Smaller chunks size needs less memory, bigger chunks size might be more efficient for reading. But the benefits of bigger chunks are quickly diminishing, even for spinning Rust.)

If N * chunks size is still too large, you can further split and compare some number M << N chunks at the same time, and use divide and conquer. (I can elaborate, if you are interested.)

Does this seem reasonable? Did I miss anything?

kapitainsky commented 4 months ago

Byte-by-byte comparison cost disadvantage is not about number of bytes program has to read (as indeed it is the same as with hashes calculation) but with cost of comparing N arrays (files) of S bytes vs hash calculations. Latter scales linearly with number of files when former computational cost scales exponentially.

matthiasgoergens commented 4 months ago

How do you make it scale exponentially?

Even the most naive version I can come up with would only scale quadratically.

And it's fairly easy to come up with a linear version. I can elaborate on that, if you are interested.