greg7mdp / parallel-hashmap

A family of header-only, very fast and memory-friendly hashmap and btree containers.
https://greg7mdp.github.io/parallel-hashmap/
Apache License 2.0
2.47k stars 234 forks source link

Thread safe way to check existence of items #227

Closed easypickings closed 8 months ago

easypickings commented 8 months ago

I'm not quite familiar with thread safety issues, but suppose I have a hash map that has been initialized with some key-value pairs. Is it okay to store the data in a flat_hash_map and use STL-like ops to check if a key exists in the map with multi-threading (the data is assured not to be altered during the checking)?

Specifically, what I want to do is something like the code below:

phmap::flat_hash_map<int, int> map;

// fill map with some kv pairs

#pragma omp parallel for
for (int i = beg; i < end; ++i) {
  auto it = map.find(i);
  if (it != map.end()) {
    std::cout << it->second << std::endl;
  } else {
    std::cout << "not found\n";
  }
}
greg7mdp commented 8 months ago

Yes, if the map is not modified, it is perfectly safe to check whether it contains keys from multiple threads.

easypickings commented 8 months ago

Thanks Greg! Another question please: if I have a lot of (tens of millions maybe) simple <int, int> pairs to insert into a hash map, is it a good idea to use a parallel_flat_hash_map and insert using multiple threads? My concern is that the number of data to insert is huge, and a single insert operation is lightweight, which may make the mutex lock a overhead.

greg7mdp commented 8 months ago

Sure, it still would be much faster to use a parallel_flat_hash_map and multiple threads, just make the N template parameter larger than the default 4, maybe 10 or something, and there will be very little mutex contention.

Another possibility. Where are all those pairs you want to insert coming from? If you can iterate very quickly over them, you can actually insert in a parallel_flat_hash_map without locking at all.

easypickings commented 8 months ago

Another possibility. Where are all those pairs you want to insert coming from? If you can iterate very quickly over them, you can actually insert in a parallel_flat_hash_map without locking at all.

Hmm... I have an array storing m offsets of a file. What I want to do is iterate over the array, read some bytes starting at the offset into memory, then store the pair <offset, index> in the map. So I don't think it can be done quickly. But how can iterating quickly make the insertion lock-free anyway?

greg7mdp commented 8 months ago

Actually that would work well. The parallel-flat-hash internally has an array of submaps (When N=4 you have 16 submaps). What you would do is start 16 threads, and each thread would populate its own submap (thread # 0 populate submap 0, etc..... So each thread would:

that's it, no locking necessary.

This is what the bench does here. I really should write a better example.

easypickings commented 8 months ago

That's cool! By the way, when using a lock-free parallel hash map, is there a difference among insertion methods, like operator[]/insert/emplace?

greg7mdp commented 8 months ago

If you use the method I indicated above, use emplace_with_hash so you pass the hashval and it is not recomputed. Otherwise it doesn't make much difference when you insert a pair of integers.

easypickings commented 8 months ago

Good to know! And thanks again for this awesome work and all your help!