Open donmor opened 3 months ago
The speed of a hash function is rarely an issue. The tool is IO bound most of the time. Have you done any benchmarking?
I'd do it later.
Made some modifications to #148 , making it way faster to use same-hash
by calling short_hashes_all
before hashing each file.
Before:
$ time mergerfs.dedup -v --ignore=same-hash /tmp/C
rm -vf /tmp/B/2
rm -vf /tmp/B/4
rm -vf /tmp/B/5
rm -vf /tmp/B/6
rm -vf /tmp/B/7
rm -vf /tmp/B/8
rm -vf /tmp/B/A
rm -vf /tmp/B/C
# Total savings: 2.6GB
real 0m14.265s
user 0m13.363s
sys 0m0.900s
After:
$ time mergerfs.dedup -v --ignore=same-hash /tmp/C
rm -vf /tmp/B/2
rm -vf /tmp/B/4
rm -vf /tmp/B/5
rm -vf /tmp/B/6
rm -vf /tmp/B/7
rm -vf /tmp/B/8
rm -vf /tmp/B/A
rm -vf /tmp/B/C
# Total savings: 2.6GB
real 0m6.724s
user 0m6.286s
sys 0m0.432s
MD5 / SHA1 is considered unsafe, so it may use SHA256 (slower):
$ time mergerfs.dedup -v --ignore=same-hash --hash=sha256 /tmp/C
rm -vf /tmp/B/2
rm -vf /tmp/B/4
rm -vf /tmp/B/5
rm -vf /tmp/B/6
rm -vf /tmp/B/7
rm -vf /tmp/B/8
rm -vf /tmp/B/A
rm -vf /tmp/B/C
# Total savings: 2.6GB
real 0m16.079s
user 0m15.569s
sys 0m0.500s
Sometimes there can be very few bits corrupted in a file, leaking it from the random sampling of short_hash_file
. A --hash=crc32
can be specified before --hash=sha256
as acceleration.
Add an option:
-H, --hashing-algorithm=
used along with-i, --ignore
. Thus we can use faster algorithms like CRC32, or safer one like sha256, or multiple algorithms in turn (skip latter if former is different)