pkolaczk / fclones

Efficient Duplicate File Finder
MIT License
1.82k stars 70 forks source link

Sort file chunks instead of hashing #256

Closed verhovsky closed 5 months ago

verhovsky commented 5 months ago

My suggestion is to try a different algorithm. After you've grouped the files by size (step 3), instead of hashing the entire file in each group, you should instead read all the files in a group CHUNK_SIZE bytes at a time into an array and then essentially sort that array. If all the chunks are the same it would just be a linear iteration of comparisons over the bytes. If they're not the same you would split into new groups at that very chunk, if one of the groups has only one file after this sorting + re-grouping step then you don't need to process that file further since it has no duplicates.

In fact you don't need to sort, you just need different groups of identical bytes to cluster together, but computationally this is equivalent to sorting.

CHUNK_SIZE would have to be set dynamically depending on the number of members in the group of files and the amount of available memory. If there's too many files in a group then I guess you have to hash each file or you'll run out of memory.

Why I think this will be faster:

Why it might not be faster:

Finally, what we're really trying to do is to find a chunk that proves that the files are different in as few file reads as possible. You could have some sort of tiny machine learning model (or just check the first chunk, last chunk and a bunch of random chunks) that looks at the first chunk of the file and the shared file size (and possibly the file name?) and generates 3-10 guesses about which chunk might contain a difference. If all of the guesses fail to find different data you're already statistically pretty sure that it's a duplicated file, which you could also take advantage of and have a flag called something like --guess or --unsafe that returns after this step for people whose data isn't really important, but that's probably a bad idea. You obviously still have to check the duplicate files in their entirety (by sorting each file chunk, as I described above) if you want 100% certainty.

I haven't read any of your code or used your tool (I just use fdupes), so apologies if you already do early stopping.

verhovsky commented 5 months ago

Actually, reading https://news.ycombinator.com/item?id=28360866 and https://pkolaczk.github.io/disk-access-ordering/ you've clearly know about this idea and have thought about this more than me. Still, I could imagine a system that does both efficient sequential reading and early stopping, if you checkpoint the hash state and then periodically compare the hashes. So you'd keep track that at file chunk 23, the hash state of each file in the group was <whatever> and whenever all files in the group have been read up to that chunk you could compare the hashes up to that point and prune the tree/stop reading unique files as necessary, but that's more unnecessary comparisons for when the files are duplicates, though it would be worth it for large files.