Closed spock closed 10 months ago
This is somewhat related to closed #3
Rust library Python bindings do exist, default to 1 thread, and accept max_threads
parameter https://pypi.org/project/blake3/
from blake3 import blake3
# Hash a large input using multiple threads. Note that this can be slower for
# inputs shorter than ~1 MB, and it's a good idea to benchmark it for your use
# case on your platform.
large_input = bytearray(1_000_000)
hash_single = blake3(large_input).digest()
hash_two = blake3(large_input, max_threads=2).digest()
hash_many = blake3(large_input, max_threads=blake3.AUTO).digest()
assert hash_single == hash_two == hash_many
You can take a look in the blake3 branch. I have not had time to test it so please let me know how it performs.
Thank you for suggesting blake3. It really has a lot of improvements over md5 so I've made it the default.
Wow, thank you for such a quick integration! And sorry for not yet reacting to your request for testing - I would have done that within the next few days, as chkbit
is my current favorite for collection hash checks.
Now I'm definitely going to try the new algorithm 😋 I did record hashing/checking times with md5 :)
Thank you!
A small update on speeds:
chkbit
is now 6 minutes ( 20% ) faster on my dataset (was: 30m, is: 24m)Processed 48409 files in readonly mode.
- 120.24 files/second
- 1104.05 MB/second
Hmm, I forgot to include elapsed so I had to fix that first ;)
With md5 (10 workers)
Processed 41417 files in readonly mode.
- 0:02:21 elapsed
- 292.57 files/second
- 2439.38 MB/second
With blake3
Processed 41417 files in readonly mode.
- 0:01:59 elapsed
- 345.36 files/second
- 2879.54 MB/second
@spock I think your IO is not able to keep up.
I agree, it looks like IO is my bottleneck with blake3
. Hopefully the peculiarity of accessing a windows (NTFS) encrypted folder from within WSL2 😄
Hi, thank you for a promising-looking file bitrot/hash checker. I especially like the built-in logic of "modified content and date are fine, modified content alone is not" - this exactly what I've been looking for!
For file integrity checking there is a rather new BLAKE3 algorithm, that is significantly faster (like 9x) than md5, but also claims to be better; they published an article with more details and benchmarks. It was designed specifically for file (content) hashing.
Primary (binary) implementation is in Rust (with parallelization), but there are also reference/educational non-parallel implementations in C and pure Python.
If you think this could be a nice
--algo
option, what could be the best way to integrate it? As you already have multi-worker support, I guess calling their single-threaded C library (or asking for single-thread processing from the main Rust library) would be the best? I haven't yet checked if Python bindings exist, but I'd assume they do.