wangyi-fudan / wyhash

The FASTEST QUALITY hash function, random number generators (PRNG) and hash map.
The Unlicense
970 stars 73 forks source link

Wyhash v4 is not optimal for stream processing #45

Closed ManDeJan closed 4 years ago

ManDeJan commented 4 years ago

While updating my Zig port of wyhash I noticed that unlike the first version, the fourth version of Wyhash is not optimal for stream processing. This is because the fourth version requires the length at the start of the hashing algorithm because it immediately processes the remaining (mod 32) bytes. The previous version would first process the 32 byte chunks. This makes implementing a streaming version much more complex.

wangyi-fudan commented 4 years ago

yes, it comes as expected. I have this problem in mind when design it, but I need "quick return" for short string, it is a trade off. Just say sorry. If the len is known before streaming, it is still ok. But is the len is not known at function calling, it is not easy. This feature distinguish wyhash to traditional hash construction (Merkle–Damgård construction), and maybe it makes length extensnion attack a bit harder.

eldruin commented 4 years ago

This is also a problem for my Rust port since to implement the Hasher trait from Rust's standard library requires streamed processing. i.e. A number of data chunks are written into the Hasher and then the hash is calculated with finish().

This makes it impossible to replace the hashing function with wyhash in algorithms which are generic over the hash function in Rust since these would expect something that implements the Hasher trait.

wangyi-fudan commented 4 years ago

let me work harder...

wangyi-fudan commented 4 years ago

done! SMHasher report 1byte/cycle more bulk speed! I love you all!

wangyi-fudan commented 4 years ago
corpus: /usr/share/dict/words hash short hashmap bulk16M
wyhash_v4 264.111 47.523 17.348
wyhash_v4_old 241.616 44.990 18.686
wyhash_v3 248.903 46.319 17.605
XXH3_scalar 187.113 44.173 13.048
eldruin commented 4 years ago

Thank you for addressing this @wangyi-fudan! I think it should be possible to implement streamed hashing now.

Hint for streamed hash implementers in other languages: With WyHash v4 given this situation (in Rust but you get the idea):

let mut hasher = WyHash::with_seed(3);
hasher.write(&[0, 1, 2,...]); // some length <= 64
hasher.write(&[3, 4, 5,...]); // some length <= 64
let hash = hasher.finish();

It will not produce the same result if you calculate the hash after each write and if you would calculate the hash of the whole concatenated data in one step because the length is also factored in for strings <= 64 bytes. I would recommend keeping a buffer bigger than 64 bytes so that if the algorithm is being called with small data you can fill the buffer on each call and then only calculate the hash when the buffer is complete or if the hashing is finished. Of course receiving longer data, data bigger than the buffer, adding up the length of the data and so on needs proper handling.

The point would be keeping your hasher running in the main hashing loop until finish is called.