(Completely open to changes in the naming/wording/api/etc.)
Changes
Upgrade deps
Add --max-prefix-size and --max-suffix-size options
These options will set the max prefix and suffix size for the prescan, reducing the chance of duplicates before the full hash scan.
Add --skip-content-hash option
Skips the final stage content hash, and just returns the result after the suffix stage (didn't implement for --transform)
Potential Follow-up
Random chunk checks:
--random-chunk-checks=5--random-chunk-size=16MiB
Though prefix and suffix size checks are a great pre-filtering step, they are of course the parts of the file that would seem the most likely to be the same among different files. However, there are still cases where fully-hashing the file would take a prohibitively long time or be too expensive.
Instead of a full hash, we could use the file's byte-size as a seed to randomly select n chunks to read from and group in the same fashion as the prefix and suffix checks. Doing this should make it very unlikely for duplication while still being orders of magnitude faster than full content hashing. It also has the nice side effect of being a great continuous tuning-lever to find a balance between safety and speed.
Partially fixes: https://github.com/pkolaczk/fclones/issues/201
(Completely open to changes in the naming/wording/api/etc.)
Changes
--max-prefix-size
and--max-suffix-size
options--skip-content-hash
option--transform
)Potential Follow-up
Random chunk checks:
--random-chunk-checks=5
--random-chunk-size=16MiB
Though prefix and suffix size checks are a great pre-filtering step, they are of course the parts of the file that would seem the most likely to be the same among different files. However, there are still cases where fully-hashing the file would take a prohibitively long time or be too expensive.
Instead of a full hash, we could use the file's byte-size as a seed to randomly select
n
chunks to read from and group in the same fashion as the prefix and suffix checks. Doing this should make it very unlikely for duplication while still being orders of magnitude faster than full content hashing. It also has the nice side effect of being a great continuous tuning-lever to find a balance between safety and speed.