rurban / smhasher

Hash function quality and speed tests
https://rurban.github.io/smhasher/
Other
1.83k stars 177 forks source link

Better Hashmap tests #82

Open rurban opened 4 years ago

rurban commented 4 years ago

Originally I also checked various other much faster linear probing hashmaps, such as e.g. from rigtorp or the two swisstable variants (Google and Facebook), but I settled with the slow standard (seperate chaining) for fair comparison purposes. Maybe add other ones, and document them.

@dumblob Speaking about currently worldwide available hashmaps, definitely skim the comparison from Martin Ankerl https://martin.ankerl.com/2019/04/01/hashmap-benchmarks-05-conclusion/ to notice one very important thing. Namely that some hashmaps are almost not at all sensitive to collisions and throughput and latency (Throughput versus latency) of the hash function, but other hashmaps rely completely on the quality and throughput and/or latency of the hash function. To be fair in showing real use case scenarios, we would actually need to test one representative from the sensitive corner and one from the insensitive corner.

@rurban True. There should be a fast, sensitive linear probing also, besides the usual insensitive std chaining. About a robin_hood I'm not sure. It's much more stable than linear probing, and from outside just a fast chaining variant. we need to add the stddev to the mean cycles. I'll also add the initial insert/delete timings, but without trials and outlier removal. For poor hash functions add better security measures, like collision counting or tree-conversion. None of the usual hashmaps are secure, so maybe add fixed versions and test the overhead.

The goal should be to choose a good hash function, small (inlinable!) and not too many collisions, based on the timings, for slow and fast hash tables. With string keys, not numbers.

Parts taken from the discussions at https://github.com/rurban/smhasher/pull/80 and #61

data-man commented 4 years ago

https://gist.github.com/Tessil/72e11891fc155f5b2eb53de22cbc4053 https://tessil.github.io/2016/08/29/benchmark-hopscotch-map.html https://tessil.github.io/2017/06/22/hat-trie.html

wangyi-fudan commented 4 years ago

these results are a bit old @data-man . https://github.com/ktprime/emhash is a good place for hashmap benchmark. I stick on ska::bytell_hash_map for speed and space.

data-man commented 4 years ago

@wangyi-fudan

these results are a bit old

You can always generate new results. :)

dumblob commented 4 years ago

@wangyi-fudan just FYI Martin Ankerl in the link above compared emhash and commented it like this:

If you are dealing with small maps and integral data types as keys, emilib1::HashMap might be fastest. But be aware that I cannot vouch for it’s stability. It is very new, and while integrating it the author still had to fix a few kinks to get it working in my benchmark. If that does not bother you, it’s search performance is very fast.

If you want a more stable map with very good performance, I’d go for absl::flat_hash_map as the default map. It is very stable, has lots of features, and definitely well tested.

rurban commented 4 years ago

I'm more leaning to https://github.com/greg7mdp/parallel-hashmap, an improved abseil/swisstable variant, an open-addressing header-only C++ hashtable with quadratic probing. But it does it's own mixing for bad hash functions. Maybe disable that.

I'm not so sure our tested hashfuncs are actually inlined with the current HashMapTest. Definitely need to be templated with new pfhash classes, and not via std:function indirect calls.

thomasahle commented 4 years ago

I think it would be nice to disable quadratic probing as well. Being able to handle a pure linear probing is a classical hash function test, which probably requires a quite strong (5 independent) hash function. Multiply shift should certainly not be able to handle linear probing, but it seems to do alright in the current parallel hashtable tests.

rurban commented 4 years ago

Yes. The idea was to test it with the common default hashtable, and a good modern one. Adding a trivial linear probing or even a chained one is a good idea, because so many people are still using them.

thomasahle commented 4 years ago

Some would even say that if you have a good enough hash function, quadratic probing and tree-fallbacks only slow you down 😉

rurban commented 4 years ago

Others like me would argue that only the poor hash functions lead to acceptable hash table speed. Good hashes are for the weak. But it depends entirely on the usecase and the hash table impl.