ktprime / emhash

Fast and memory efficient c++ flat hash map/set
MIT License
435 stars 30 forks source link

Time complexity of iteration #34

Closed cookiestarfish closed 8 months ago

cookiestarfish commented 8 months ago

Is the time complexity of iterating all keys in a HashSet or HashMap a function of the number of buckets or of the number of stored items? If we insert and erase many keys, the number of buckets can become very large compared to the number of items. When we iterate all items, do we also have to iterate over all empty buckets?

For example: 1) insert 10M keys. 2) erase all keys except 100. 3) iterate over the 100 remaining keys.

how much time does (3) take?

ktprime commented 8 months ago

It depends on the kind of hash table used in your Git. If you are concerned about iterator performance, emhash8 is a good choice. Its memory usage is compact, and it does not create any holes in continuous memory even with frequent key insertions and deletions. The iteration performance is as fast as std::vector, and its complexity is O(size).

you also can call shrink_to_fit() to save memory . for emahsh6/7 the complexity of iteration is O(buckets/64) which is also very fast. some benchmark code in bench\martin_bench.cpp


template<class MAP>
static void bench_IterateIntegers(MAP& map)
{
    auto map_name = find_hash(typeid(MAP).name());
    if (!map_name)
        return;
    printf("\t%20s", map_name);

    size_t const num_iters = 50000;
    uint64_t result = 0;
    auto ts = now2sec();

    {
        MRNG rng(123);
        for (size_t n = 0; n < num_iters; ++n) {
            map[rng()] = n;
            for (const auto & keyVal : map)
#ifndef CK_HMAP
                result += keyVal.second;
#else
                result += 1;
#endif
        }
        assert(result == 20833333325000ull);
    }

    auto ts1 = now2sec();
    {
        MRNG rng(123);
        for (size_t n = 0; n < num_iters; ++n) {
            map.erase(rng());
            for (auto const& keyVal : map)
#ifndef CK_HMAP
                result += keyVal.second;
#else
                result += 1;
#endif
        }
    }
    assert(result == 62498750000000ull + 20833333325000ull);
    printf(", add/removing time = %.2f, %.2f|%lu\n", (ts1 - ts), now2sec() - ts1, result);
}

bench_IterateIntegers:
          std::unordered_map, add/removing time = 5.48, 5.31|2680875904
                     emhash5, add/removing time = 4.95, 7.16|2680875904
                     emhash8, add/removing time = 0.26, 0.26|2680875904
                     emhash7, add/removing time = 0.98, 1.15|2680875904
                     emhash6, add/removing time = 0.99, 1.07|2680875904
                martin dense, add/removing time = 0.26, 0.26|2680875904
                     emilib1, add/removing time = 1.44, 1.50|2680875904
                     emilib3, add/removing time = 3.12, 3.48|2680875904
                     emilib2, add/removing time = 2.39, 2.65|2680875904
                  boost flat, add/removing time = 2.95, 3.88|2680875904
                   absl flat, add/removing time = 4.31, 4.63|2680875904