Open matthiasgoergens opened 4 months ago
Byte-by-byte comparison cost disadvantage is not about number of bytes program has to read (as indeed it is the same as with hashes calculation) but with cost of comparing N arrays (files) of S bytes vs hash calculations. Latter scales linearly with number of files when former computational cost scales exponentially.
How do you make it scale exponentially?
Even the most naive version I can come up with would only scale quadratically.
And it's fairly easy to come up with a linear version. I can elaborate on that, if you are interested.
Thanks for providing this useful tool!
The Readme writes:
When you have a group of, say, N files of uniform size S bytes each, then you need to to read N S bytes to compute all the hashes. In contrast, to compare them byte-by-byte for equality, you need to read N S bytes.
You might notice that both values are the same.
To make the byte-for-byte comparison work you need to be somewhat careful about the order you read in: I assume you want to avoid having the whole N * S bytes in memory at the same time. On the one hand, that's a complication. On the other hand, you can simplify because you no longer rely on the hash for correctness, only for speed.
Here's a sketch:
If N is less than outrageously large, you can (conceptually) divide the files into chunks, 'zip' up the chunks, and the compare all the first chunks of all files, then all the second chunks of all files, etc. (Smaller chunks size needs less memory, bigger chunks size might be more efficient for reading. But the benefits of bigger chunks are quickly diminishing, even for spinning Rust.)
If N * chunks size is still too large, you can further split and compare some number M << N chunks at the same time, and use divide and conquer. (I can elaborate, if you are interested.)
Does this seem reasonable? Did I miss anything?