Default `-j 2` value hurting performance

jessek / hashdeep

Other

694 stars 130 forks source link

Default `-j 2` value hurting performance #325

Closed GitHubGeek closed 9 years ago

GitHubGeek commented 9 years ago

TL;DR: -j should have a sane default of 1 thread for checksumming and matching operations

Running md5deep 3.9.2 on Ubuntu 14.04. I noticed the during matching mode (md5deep -x md5.txt -r /some/dir) the I/O thoughtput is way below what my disk is capable of (25MB/s vs 160MB/s). In iotop I can see that 2 md5deep threads are utilising the bandwidth.

Adding -j1 to the command solves the problem.

I can't see the benefits of multi-threading when checksumming / matching files on a single disk. Modern CPUs can crunch MD5 checksums in tens of GB/s. So, unless the user has an exotic multi-SSD RAID-0 setup, the bottleneck is in the disks and multi-threading would not improve performance. Moreover, performance on mechanical HDDs would almost certainly suffer when the threads are competing for reads and causing excessive seeks.

Probably related to #298

EDIT: Done some more testing, this issue applies to checksumming as well.

simsong commented 9 years ago

probably reasonable for md5deep, not for hashdeep, sha1deep, sha256 deep, etc.

simsong commented 9 years ago

Oh, I should add that this wasn't the case when the original multi-threading was done. CPUs have gotten faster at a faster rate than disks.

GitHubGeek commented 9 years ago

Done some more testing, I can get 160MB/s running sha256deep -j1 -r /foo. No fancy hardware, just a 3TB SATA drive and a $60 Pentium CPU.

My argument about excessive seeks on mechanical HDDs still stands. I still think that the default should be -j 1 and the manual should advise users to experiment with higher no. of threads based on their needs. For example, multi-threading would be beneficial for many small files stored on SSDs.

Edit: Fixed typos; Added hardware info

simsong commented 9 years ago

It's a relatively simply change to make -j1 the default in all cases. Might be a reasonable thing to do.

GitHubGeek commented 9 years ago

Another (much more complicated :smile: ) fix is to implement a producer/consumers pattern. A single producer to read from disk and consumer threads to do the number crunching.

simsong commented 9 years ago

It does implement a producer/consumer pattern. However the hash algorithms it implements cannot be parallelized within a file.

dima-stefantsov commented 9 months ago

Today, in 2023, I installed apt install md5deep (hashdeep) in Ubuntu 22.04 server, and behaviour is still the same: on a 24thread server md5deep -r /dir/ gives 30 MB/s, and same command with -j0 (same for -j1) gives ~200 MB/s. A weird default.