jessek / hashdeep

Other
702 stars 132 forks source link

Support xxHash algorithm #409

Open bhagemeier opened 1 year ago

bhagemeier commented 1 year ago

Hi there,

at Juelich Supercomputing Centre, we've recently been researching convenient tools to generate and verify hash sums of large collections of data. The amounts we're typically talking about are in the area of several TB to PB. We've found hashdeep to be convenient and providing a good interface including parallelisation options that may be important to checksum and verify many small files.

We've also come across the xxHash algorithm, which has been specifically designed to create checksums over extremely large amounts of data.

We have found the commandline tools provided for xxHash to lack some functionality offered by hashdeep. Therefore, we propose to integrate xxHash into hashdeep to improve the support for use cases dealing with extremely large volumes of data. Moreover, we also support the idea of integrating Blake3, as mentioned in #397.

In the spirit of Open Source, we do offer our full support in doing the integration ourselves, but would like to learn about your willingness to include the code in the main branch afterwards. Additionally, if there were good reasons to omit algorithms such as xxHash or Blake3, please let us know about them.

In order to support our request in numbers, here's a comparison of various algorithms supported in hashdeep and xxHash on a 155GB data set of two files.

Tool Duration Speed (approx.)
xxHash 36s 4.3GB/s
hashdeep (default md5 and sha256) 564s 275MB/s
hashdeep (md5) 184s 840MB/s
hashdeep (sha1) 294s 530MB/s
hashdeep (tiger) 272s 570MB/s
hashdeep (whirlpool) 789s 200MB/s
hashdeep (mmap,md5,sha256) 629s 250MB/s

As you can see, xxHash it at least 5 times faster than the fastest algorithm supported by hashdeep.

keybreak commented 1 year ago

Still a very much needed feature! @jessek any plans for it?

bhagemeier commented 1 year ago

We have someone working on it now. The performance gain is not yet as much as we would have expected. Please stay tuned for updates.

keybreak commented 1 year ago

The performance gain is not yet as much as we would have expected.

That's weird...Hopefully it will be optimized! :+1:

oneEyedCharlie commented 9 months ago

We have someone working on it now. The performance gain is not yet as much as we would have expected. Please stay tuned for updates.

How is your project going along? I am CPU bottle-necked using hashdeep, and would greatly love a "xxhashdeep" or similar. Even small improvements would be helpful.