pkolaczk / fclones

Efficient Duplicate File Finder
MIT License
1.83k stars 70 forks source link

Feat: --skip-content-hash, --max-prefix-size, --max-suffix-size options #202

Closed johnpyp closed 1 year ago

johnpyp commented 1 year ago

Partially fixes: https://github.com/pkolaczk/fclones/issues/201

(Completely open to changes in the naming/wording/api/etc.)

Changes

Potential Follow-up

Random chunk checks:

--random-chunk-checks=5 --random-chunk-size=16MiB

Though prefix and suffix size checks are a great pre-filtering step, they are of course the parts of the file that would seem the most likely to be the same among different files. However, there are still cases where fully-hashing the file would take a prohibitively long time or be too expensive.

Instead of a full hash, we could use the file's byte-size as a seed to randomly select n chunks to read from and group in the same fashion as the prefix and suffix checks. Doing this should make it very unlikely for duplication while still being orders of magnitude faster than full content hashing. It also has the nice side effect of being a great continuous tuning-lever to find a balance between safety and speed.