martinus / robin-hood-hashing

Fast & memory efficient hashtable based on robin hood hashing for C++11/14/17/20
https://gitter.im/martinus/robin-hood-hashing
MIT License
1.5k stars 142 forks source link

Not faster than libstdc++ anymore? #82

Closed oscarfv closed 3 years ago

oscarfv commented 3 years ago

libstdc++'s unordered_set<unsigned> shows similar performance to robin_hood::unordered_set<unsigned> when inserting 100 million consecutive numbers. This is not what I expect after reading https://martin.ankerl.com/2019/04/01/hashmap-benchmarks-01-overview/

I'm using gcc 9.3 on Debian Testing 64 bits compiling with -O2 -march=native on Intel Skylake.

Memory efficiency is huge, though.

Did libstdc++ got better on gcc 9.3 or I'm using robin_hood incorrectly?

martinus commented 3 years ago

Well, benchmarking is hard. If you insert consecutive numbers, the libstdc++'s implementation has a huge advantage, because it uses an identity hash. So all numbers are consecutively placed, and all fit perfectly without getting any collision.

A better test is to use a (very fast) random generator to insert random numbers. But that is not fair either, because in that case no hashing would be necessary so identity hash has a clear advantage as well.

To get rid of the unfair advantage of the identity hash, you can create random numbers and shift them to the left by a few bits. A good hashmap will perform as fast as before, but with identity hash you will get dramatic slowdown.

martinus commented 3 years ago

I just ran 3 insertion benchmarks with g++ 10.1.0, -O3:

insert consecutive numbers
        4.01631s, size=100000000 for robin_hood::unordered_flat_set<uint64_t>
        12.4555s, size=100000000 for robin_hood::unordered_node_set<uint64_t>
        7.7876s, size=100000000 for std::unordered_set<uint64_t>
insert random numbers
        8.25911s, size=100000000 for robin_hood::unordered_flat_set<uint64_t>
        15.4305s, size=100000000 for robin_hood::unordered_node_set<uint64_t>
        34.7533s, size=100000000 for std::unordered_set<uint64_t>
insert consecutive, shifted left 32 bits
        3.32331s, size=100000000 for robin_hood::unordered_flat_set<uint64_t>
        11.5695s, size=100000000 for robin_hood::unordered_node_set<uint64_t>
        6.35094s, size=100000000 for std::unordered_set<uint64_t>

Inserting consecutive numbers or shifted numbers is indeed quite fast for std::unordered_set, so it seems they really have done some great improvements. But it's only faster when compared to robin_hood::unordered_node_set. robin_hood::unordered_flat_set is still by far the fastest in these benchmarks, for random data it's 4 times faster.

oscarfv commented 3 years ago

Thank you for the detailed response. Using the identity hash with robin_hood::unordered_set puts it ahead of std::unordered_set by about 18% (is there any reason for not using the identity hash with robin_hood?). Operations such as erase seem not affected and remain roughly as fast or slightly slower than std:: for my use case (32 bit unsigned integers which typically are added or removed on ascending order, with arbitrary gaps).

Since I'm mainly after an improvement on memory usage over std::unordered_set, robin_hood::unordered_set is good enough as long as it is not an speed regression over std::.

Thank you again.