Performance gains from even more small hash tests?

clunkclunk commented 9 months ago

I really dig fclones - thank you for writing it!

This is purely a thought and question right now. I've not actually tested it on any real data, but it came to me when I was looking for some duplicates in a data set that contained a lot of relatively large files (10 to 80 GB - usually video files or backup images).

The algorithm's steps 4 and 5 are to calculate a hash of a tiny block of data at the start and the end of the file respectively, pruning the results based on those hashes. Based on the description of 'tiny', I assume the file reads on these are very short and the calculation very fast. Then the remaining files get a hash of the entire file contents. For large files, this can take some time. I assume it's fairly I/O bound on most systems, especially spinning disks.

Why not have some more 'tiny block hashes' at various points in the file aside from the first and last tiny blocks to determine uniqueness before doing the time consuming entire file hash? We're already calculating at the start and the end of the file, what kind of overhead would it be to add a check at every 10% increment in the file, then comparing the nine hashes (10%, 20%, etc all the way up to 90%), to be able to filter the list further, then proceeding with the full file hash if necessary?

(side note - how big are the 'tiny blocks' referenced in the algorithm?)

Also for some situations, matching hashes at 11 points (start, end and nine points at 10% intervals) on the file may be enough confidence to optionally skip the whole file hash check. I know a portion of my duplicate files are non-critical files and mostly are duplicates based on simple user mistakes copying things in multiple places, and 10 checks would be enough confidence to call it a match.

kapitainsky commented 9 months ago

(side note - how big are the 'tiny blocks' referenced in the algorithm?)

They are also configurable:

      --max-prefix-size <BYTES>
          Maximum prefix size to check in bytes
          Units like KB, KiB, MB, MiB, GB, GiB are supported.
          Default: 16KiB for hard drives, 4KiB for SSDs

      --max-suffix-size <BYTES>
          Maximum suffix size to check in bytes
          Units like KB, KiB, MB, MiB, GB, GiB are supported.
          Default: 16KiB for hard drives, 4KiB for SSDs

pkolaczk commented 8 months ago

Why not have some more 'tiny block hashes' at various points in the file aside from the first and last tiny blocks to determine uniqueness before doing the time consuming entire file hash?

Can you give me an example of a situation where files of the same size would match with the beginning and end, but not the middle? Even matching the ends turns out to filter out very few files. I believe such situations would be extremely rare and that wouldn't justify the added complexity and cost.

gcflymoto commented 8 months ago

@clunkclunk the algorithm you describe is implemented by https://github.com/kornelski/dupe-krill

johnpyp commented 8 months ago

For reference on a way to speed up dedupes, I typically use this is like so:

fclones group --cache /my/media/folder --min 500MiB --max-prefix-size 128MiB --max-suffix-size 64MiB --skip-content-hash

I set a large prefix size, suffix size, and skip the content hash alltogether. As well as skip small files where there's a higher false negative likelihood and less benefit to this approach. I've processed multiple thousands of media files and seen zero false negatives with cursory manual investigation.

For what it's worth, I do see value in random / "statistical" checks, though like mentioned it would add a lot of complexity. What I like about it is that it gives you a "reliable" statistical confidence instead of just the beginning and end. You get to set a threshold of how comfortable you are that is mostly independent of the kind of file, rather than relying on the fact that the middle chunks haven't been tampered with.

bmfrosty commented 8 months ago

I can't say what files they are (I can't seem to find a verbose option), but I see this when scanning a very large directory:

[2023-10-27 13:54:46.198] fclones:  info: Found 2303 (276.5 GB) candidates after grouping by prefix
[2023-10-27 13:54:53.994] fclones:  info: Found 2298 (274.9 GB) candidates after grouping by suffix
[2023-10-27 14:50:36.020] fclones:  info: Found 2186 (271.0 GB) redundant files

These should all be zip and rar files.

I should open up a ticket for a verbose option for what's unmatched between steps 4 and 5, but I definitely have files with the same size, prefix, and suffix that don't have the same content - or maybe just update to the latest version.

Another option would be to do random chunk matching of a configurable percentage of the file size using a seed based on the file size to pick which chunks.

pkolaczk / fclones

Performance gains from even more small hash tests? #230