Closed cookiestarfish closed 1 year ago
It depends on the kind of hash table used in your Git. If you are concerned about iterator performance, emhash8 is a good choice. Its memory usage is compact, and it does not create any holes in continuous memory even with frequent key insertions and deletions. The iteration performance is as fast as std::vector, and its complexity is O(size).
you also can call shrink_to_fit() to save memory . for emahsh6/7 the complexity of iteration is O(buckets/64) which is also very fast. some benchmark code in bench\martin_bench.cpp
template<class MAP>
static void bench_IterateIntegers(MAP& map)
{
auto map_name = find_hash(typeid(MAP).name());
if (!map_name)
return;
printf("\t%20s", map_name);
size_t const num_iters = 50000;
uint64_t result = 0;
auto ts = now2sec();
{
MRNG rng(123);
for (size_t n = 0; n < num_iters; ++n) {
map[rng()] = n;
for (const auto & keyVal : map)
#ifndef CK_HMAP
result += keyVal.second;
#else
result += 1;
#endif
}
assert(result == 20833333325000ull);
}
auto ts1 = now2sec();
{
MRNG rng(123);
for (size_t n = 0; n < num_iters; ++n) {
map.erase(rng());
for (auto const& keyVal : map)
#ifndef CK_HMAP
result += keyVal.second;
#else
result += 1;
#endif
}
}
assert(result == 62498750000000ull + 20833333325000ull);
printf(", add/removing time = %.2f, %.2f|%lu\n", (ts1 - ts), now2sec() - ts1, result);
}
bench_IterateIntegers:
std::unordered_map, add/removing time = 5.48, 5.31|2680875904
emhash5, add/removing time = 4.95, 7.16|2680875904
emhash8, add/removing time = 0.26, 0.26|2680875904
emhash7, add/removing time = 0.98, 1.15|2680875904
emhash6, add/removing time = 0.99, 1.07|2680875904
martin dense, add/removing time = 0.26, 0.26|2680875904
emilib1, add/removing time = 1.44, 1.50|2680875904
emilib3, add/removing time = 3.12, 3.48|2680875904
emilib2, add/removing time = 2.39, 2.65|2680875904
boost flat, add/removing time = 2.95, 3.88|2680875904
absl flat, add/removing time = 4.31, 4.63|2680875904
Is the time complexity of iterating all keys in a HashSet or HashMap a function of the number of buckets or of the number of stored items? If we insert and erase many keys, the number of buckets can become very large compared to the number of items. When we iterate all items, do we also have to iterate over all empty buckets?
For example: 1) insert 10M keys. 2) erase all keys except 100. 3) iterate over the 100 remaining keys.
how much time does (3) take?