martinus / robin-hood-hashing

Fast & memory efficient hashtable based on robin hood hashing for C++11/14/17/20
https://gitter.im/martinus/robin-hood-hashing
MIT License
1.52k stars 146 forks source link

Usage for avoiding duplicates #61

Closed e4lam closed 4 years ago

e4lam commented 4 years ago

In my benchmarks for using robin_hood for avoiding duplicate pointers/integers, the performance seems almost magical. In a test with 8 random unique integers repeated once (for a total of 16 insertions), even robin_hood::unordered_set<size_t> it is about 1.2x faster than an array that uses linear search to avoid duplicates. So much so that it feels like I'm doing something wrong. Is that expected?

How does robin_hood::unordered_set manage to be so fast at these small sizes while maintaining iterator stability? Or does it give that up?

On a separate note, I'm a bit worried that the README.md says that it falls back to std::hash in other cases. But as Malte Skarupke notes, this can be bad on Windows because Visual C++'s std::hash uses stronger hash functions to compensate for the fact that its unordered_set implementation uses power of 2 sizes and then only keeps the lower bits of the hash result. In fact, not only does it use FNV1 hash for integers, but also for enum, float, double, and pointers. So it feels dangerous to fallback to std::hash. Is there a customization point to use say boost::hash as the fallback instead?

martinus commented 4 years ago

Well, robin_hood map doesn't have iterator stability. unordered_node_map and unordered_node_set have reference and pointer stability, but not iterator stability. Note std::unordered_map also doesn't have iterator stability.

I don't know your benchmark code, but maybe the compiler has optimized something completely away? This tends to happen for small benchmarks. Usually robin_hood map shouldn't be faster than an array :)

Concerning the hashing, robin_hood has implementations for all basic types that std::hash also has specializations for, and std::string. (see here: https://en.cppreference.com/w/cpp/utility/hash)

It mostly only falls back to std::hash if you have provided a user defined implementation for a special type.

e4lam commented 4 years ago

Ok, right. But does both robin_hood::unordered_{map,set} and std::unordered_{map,set} have reference/pointer stability?

That is a good point about things being optimized away. I'll have to double-check. The benchmark is like the following, with lists of items that have varying number of unique items and duplicates.

void addIgnoringDups(size_t item)
{
    bool inserted = myItemSet.insert(item).second; // eg. unordered_set<size_t>
    if (!inserted)
         return;
    myItems.push_back(item); // this is a std::vector<size_t>
}

vs

void addIgnoringDups(size_t item)
{
    for (size_t i = 0, n = myItems.size(); i < n; ++i)
    {
        if (myItems[i] == item)
            return;
    }
    myItems.push_back(item);  // this is a std::vector<size_t>
}

Thanks for the answers!

martinus commented 4 years ago

std::unordered_{map,set} has reference & pointer stability, for robin_hood::unordered_{map,set} it depends. This type is basically either robin_hood::unordered_flat_{map,set} or robin_hood::unordered_node_{map,set}, based on what it thinks will be faster.

So in short:

You might want to have a look at my nanobench project, it has a doNotOptimizeAway to make sure a result is not optimized away by the compiler: https://github.com/martinus/nanobench

e4lam commented 4 years ago

Yep, I had started using nanobench for precisely for these current set of benchmarks (which lead me to making requests like https://github.com/martinus/nanobench/issues/14).

The benchmark code I'm using currently looks like this:

    SetNodeT node(items.size());
    cfg.run(name,
            [&]() {
                for (auto item : items)
                    node.addIgnoringDups(item);
            }).doNotOptimizeAway(node);
e4lam commented 4 years ago

Ah, I see the difference now, robin_hood::unordered_node_{map,set} chooses based on a heuristic around the size of the key type and its properties.

On reference & pointer stability, I was confused because I had first added ska::flat_hash_set and ska::unordered_set to my benchmark which I think doesn't map the same way to robin_hood::unordered_flat_set / robin_hood::unordered_set does for this.