Closed GitHubGeek closed 9 years ago
probably reasonable for md5deep, not for hashdeep, sha1deep, sha256 deep, etc.
Oh, I should add that this wasn't the case when the original multi-threading was done. CPUs have gotten faster at a faster rate than disks.
Done some more testing, I can get 160MB/s running sha256deep -j1 -r /foo
. No fancy hardware, just a 3TB SATA drive and a $60 Pentium CPU.
My argument about excessive seeks on mechanical HDDs still stands. I still think that the default should be -j 1 and the manual should advise users to experiment with higher no. of threads based on their needs. For example, multi-threading would be beneficial for many small files stored on SSDs.
Edit: Fixed typos; Added hardware info
It's a relatively simply change to make -j1 the default in all cases. Might be a reasonable thing to do.
Another (much more complicated :smile: ) fix is to implement a producer/consumers pattern. A single producer to read from disk and consumer threads to do the number crunching.
It does implement a producer/consumer pattern. However the hash algorithms it implements cannot be parallelized within a file.
Today, in 2023, I installed apt install md5deep (hashdeep) in Ubuntu 22.04 server, and behaviour is still the same: on a 24thread server md5deep -r /dir/ gives 30 MB/s, and same command with -j0 (same for -j1) gives ~200 MB/s. A weird default.
TL;DR: -j should have a sane default of 1 thread for checksumming and matching operations
Running md5deep 3.9.2 on Ubuntu 14.04. I noticed the during matching mode (
md5deep -x md5.txt -r /some/dir
) the I/O thoughtput is way below what my disk is capable of (25MB/s vs 160MB/s). Iniotop
I can see that 2md5deep
threads are utilising the bandwidth.Adding
-j1
to the command solves the problem.I can't see the benefits of multi-threading when checksumming / matching files on a single disk. Modern CPUs can crunch MD5 checksums in tens of GB/s. So, unless the user has an exotic multi-SSD RAID-0 setup, the bottleneck is in the disks and multi-threading would not improve performance. Moreover, performance on mechanical HDDs would almost certainly suffer when the threads are competing for reads and causing excessive seeks.
Probably related to #298
EDIT: Done some more testing, this issue applies to checksumming as well.