trapexit / mergerfs-tools

Optional tools to help manage data in a mergerfs pool
ISC License
372 stars 42 forks source link

Feature Request: Add more hashing algorithms to mergefs.dedup #147

Open donmor opened 3 months ago

donmor commented 3 months ago

Add an option: -H, --hashing-algorithm= used along with -i, --ignore. Thus we can use faster algorithms like CRC32, or safer one like sha256, or multiple algorithms in turn (skip latter if former is different)

donmor commented 3 months ago

148 is an implementation.

trapexit commented 3 months ago

The speed of a hash function is rarely an issue. The tool is IO bound most of the time. Have you done any benchmarking?

donmor commented 3 months ago

I'd do it later.

donmor commented 2 months ago

Made some modifications to #148 , making it way faster to use same-hash by calling short_hashes_all before hashing each file.

Before:

$ time mergerfs.dedup -v --ignore=same-hash /tmp/C
rm -vf /tmp/B/2
rm -vf /tmp/B/4
rm -vf /tmp/B/5
rm -vf /tmp/B/6
rm -vf /tmp/B/7
rm -vf /tmp/B/8
rm -vf /tmp/B/A
rm -vf /tmp/B/C
# Total savings: 2.6GB

real    0m14.265s
user    0m13.363s
sys     0m0.900s

After:

$ time mergerfs.dedup -v --ignore=same-hash /tmp/C
rm -vf /tmp/B/2
rm -vf /tmp/B/4
rm -vf /tmp/B/5
rm -vf /tmp/B/6
rm -vf /tmp/B/7
rm -vf /tmp/B/8
rm -vf /tmp/B/A
rm -vf /tmp/B/C
# Total savings: 2.6GB

real    0m6.724s
user    0m6.286s
sys     0m0.432s

MD5 / SHA1 is considered unsafe, so it may use SHA256 (slower):

$ time mergerfs.dedup -v --ignore=same-hash --hash=sha256 /tmp/C
rm -vf /tmp/B/2
rm -vf /tmp/B/4
rm -vf /tmp/B/5
rm -vf /tmp/B/6
rm -vf /tmp/B/7
rm -vf /tmp/B/8
rm -vf /tmp/B/A
rm -vf /tmp/B/C
# Total savings: 2.6GB

real    0m16.079s
user    0m15.569s
sys     0m0.500s

Sometimes there can be very few bits corrupted in a file, leaking it from the random sampling of short_hash_file. A --hash=crc32 can be specified before --hash=sha256 as acceleration.