idrassi / DirHash

Windows command line utility to compute hash of directories and files
BSD 3-Clause "New" or "Revised" License
111 stars 11 forks source link

cpu utilization #8

Closed bluelayer closed 3 years ago

bluelayer commented 3 years ago

hello!

maybe it's not really a issue or my problem but is dirhash multithread capable? I can only archive 25% cpu utilization in blake3

idrassi commented 3 years ago

DirHash is not multithread capable because it is a limitation of the underlying hash algorithms used. Actually, apart from Blake3, all other hash algorithms supported by DirHash can not be made parallelizable because of there design. Only Blake3 can be made parallelizable thanks to it design based on Merkle tree but unfortunately the official Blake3 C source code (which DirHash uses) doesn't implement it.

For now, Blake3 authors don't plan to provide parallelized version of their C code. Others can work on it but it will require a lot of work and also knowledge about Blake3 internals. I personally don't plan to spend time on this.

Hopefully, a parallelized C code of Blake3 will be available in the near future so that DirHash can benefit from it.

idrassi commented 3 years ago

Maybe your question is related to hashing multiple files in a directory in parallel more than making hash computation parallel in itself. If this was your question, then indeed this is something that can be implemented in DirHash to speedup SUM computation (-sum and -verify switch) since this is the only case where multi-threading is possible independently of the nature of the hash algorithm used.

I will have a look at it as it is definitely doable. I will update this issue when I have more to say about it.

bluelayer commented 3 years ago

Maybe your question is related to hashing multiple files in a directory in parallel more than making hash computation parallel in itself.

I only think about speed, I don't care from where it comes =)

idrassi commented 3 years ago

@bluelayer I have published version 1.15 that implements multithreading through the use of a dedicated switch -threads. The -threads switch applies only when combined with -sum or -verify switches.

The speedup depends on the disk speed, the number of files and their size (e.g. on slow disks or if there are many small files, most of the time is spent on I/O and not hash computation).

Can you please give a try and report your results?

On my test systems, speed up is between 8x and 2x depending on the configuration used.

bluelayer commented 3 years ago

I'm sorry but, speed limit is about to exxxxpppplllllooooodddddeeeeee 💯 💯 💯 💯 💯 💯

🥰 🥰

bluelayer commented 3 years ago

in my Ryzen3 now I get 100% cpu utilization and about 5 seconds to -sum/-verify 6897 files 🥇 with zero disk I/O (all cached may be)

idrassi commented 3 years ago

Thank you for this feedback. I'm glad the results are up to your expectations. This is an important enhancement to DirHash and I'm grateful to you for proposing this. I will close this issue. Don't hesitate to report any issues you may find.