laktak / chkbit-py

Check your files for data corruption
MIT License
96 stars 6 forks source link

question: `--algo blake3` #6

Closed spock closed 6 months ago

spock commented 6 months ago

Hi, thank you for a promising-looking file bitrot/hash checker. I especially like the built-in logic of "modified content and date are fine, modified content alone is not" - this exactly what I've been looking for!

For file integrity checking there is a rather new BLAKE3 algorithm, that is significantly faster (like 9x) than md5, but also claims to be better; they published an article with more details and benchmarks. It was designed specifically for file (content) hashing.

Primary (binary) implementation is in Rust (with parallelization), but there are also reference/educational non-parallel implementations in C and pure Python.

If you think this could be a nice --algo option, what could be the best way to integrate it? As you already have multi-worker support, I guess calling their single-threaded C library (or asking for single-thread processing from the main Rust library) would be the best? I haven't yet checked if Python bindings exist, but I'd assume they do.

spock commented 6 months ago

This is somewhat related to closed #3

spock commented 6 months ago

Rust library Python bindings do exist, default to 1 thread, and accept max_threads parameter https://pypi.org/project/blake3/

from blake3 import blake3

# Hash a large input using multiple threads. Note that this can be slower for
# inputs shorter than ~1 MB, and it's a good idea to benchmark it for your use
# case on your platform.
large_input = bytearray(1_000_000)
hash_single = blake3(large_input).digest()
hash_two = blake3(large_input, max_threads=2).digest()
hash_many = blake3(large_input, max_threads=blake3.AUTO).digest()
assert hash_single == hash_two == hash_many
laktak commented 6 months ago

You can take a look in the blake3 branch. I have not had time to test it so please let me know how it performs.

laktak commented 6 months ago

Thank you for suggesting blake3. It really has a lot of improvements over md5 so I've made it the default.

spock commented 6 months ago

Wow, thank you for such a quick integration! And sorry for not yet reacting to your request for testing - I would have done that within the next few days, as chkbit is my current favorite for collection hash checks.

Now I'm definitely going to try the new algorithm 😋 I did record hashing/checking times with md5 :)

Thank you!

spock commented 6 months ago

A small update on speeds:

Processed 48409 files in readonly mode.

  • 120.24 files/second
  • 1104.05 MB/second
laktak commented 6 months ago

Hmm, I forgot to include elapsed so I had to fix that first ;)

With md5 (10 workers)

Processed 41417 files in readonly mode.
- 0:02:21 elapsed
- 292.57 files/second
- 2439.38 MB/second

With blake3

Processed 41417 files in readonly mode.
- 0:01:59 elapsed
- 345.36 files/second
- 2879.54 MB/second

@spock I think your IO is not able to keep up.

spock commented 6 months ago

I agree, it looks like IO is my bottleneck with blake3. Hopefully the peculiarity of accessing a windows (NTFS) encrypted folder from within WSL2 😄