greg7mdp / parallel-hashmap

A family of header-only, very fast and memory-friendly hashmap and btree containers.
https://greg7mdp.github.io/parallel-hashmap/
Apache License 2.0
2.47k stars 234 forks source link

Computing std::hash once per key for the hash_bench example #201

Closed marioroy closed 1 year ago

marioroy commented 1 year ago

Ah, I just realized the hash_bench example today. Thank you for the small demonstration. The following is not an issue but mainly curiosity. I tried hash_bench2.cc, computing std::hash once per key.

hash_bench   7.016 seconds
hash_bench2  2.813 seconds

lkfm: 7360
elkl: 7720
pnjj: 7860
mloo: 7860
iokk: 7650

Seeing 2.5x is mind-boggling to me. So much so that I will try the same thing for the Long List is Long demonstration and report back.

greg7mdp commented 1 year ago

That's not really surprising, the whole purpose of the test was to test the speed of the hash method. I wanted to try polymur-hash - the test is in my gtl repo which is equivalent to phmap except that it requires c++17.

BTW your test has an incorrect comparison function, you should compare the str not the hash.

I'm curious to see how you would use this in llil. phmap has some APIs that take the hash value if precomputed (look for with_hash).

marioroy commented 1 year ago

Oh,... The mind-boggling part was not so much about the variant running faster, but the possibility of Long List is Long able to run faster. And the joy in trying things.

BTW your test has an incorrect comparison function, you should compare the str not the hash.

I will try both ways for the LLIL demonstration. I do not understand why one cannot use the hash value as it's tied to the key. Am I assuming wrongly for the hash value to be unique per key?

phmap has some APIs that take the hash value if precomputed (look for with_hash)

I saw them at a quick glance but did not realize the possibility. I will try. Thank you.

marioroy commented 1 year ago

I commented out the if statement and the output matches between hash_bench and hash_bench2.

    // if (++cnt == 6) break;
marioroy commented 1 year ago

Please disregard my hash_bench2 demonstration. I see now the purpose of the hash_bench example. To compare std::hash against polymur-hash.

marioroy commented 1 year ago

I'm closing the issue. The hash_value is computed one time only, already. It has been an interesting journey. Again, thank you for the hash map library.

marioroy commented 1 year ago

I compared equal l.str == r.str and l.hash == r.hash. The output is the same. Regardless, I changed the gist to l.str == r.str per your suggestion.

hashbench:                    7.016 seconds
hashbench2 l.str  == r.str:   4.318 seconds
hashbench2 l.hash == r.hash:  2.813 seconds
marioroy commented 1 year ago

The motivation for doing this was nothing more than curiosity or wondering if std::hash is computed more than once per key. And if so, how much time saved if computed one time. I saw no measurable time difference for llil. No issue either for doing equal return l.hash == r.hash;. The output is the same.

Edit: Changed equal to return l.str == r.str;.

I'm curious to see how you would use this in llil.


@@ -126,13 +126,19 @@

 using str_int_type     = std::pair<str_type, int_type>;
 using vec_str_int_type = std::vector<str_int_type>;
+using hash_type        = std::size_t;
+
+struct S {
+   str_type  str;
+   hash_type hash;
+};

 // create the parallel_flat_hash_map with internal mutexes
 using map_str_int_type = phmap::parallel_flat_hash_map<
-   str_type, int_type,
-   phmap::priv::hash_default_hash<str_type>,
-   phmap::priv::hash_default_eq<str_type>,
-   phmap::priv::Allocator<phmap::priv::Pair<const str_type, int_type>>,
+   S, int_type,
+   decltype( [](const S& k) { return k.hash; } ),
+   decltype( [](const S& l, const S& r) { return l.str == r.str; } ),
+   phmap::priv::Allocator<phmap::priv::Pair<const S, int_type>>,
    10, std::mutex
 >;

@@ -224,16 +230,19 @@
 #else
             str_type s(beg_ptr, klen);
 #endif
+            hash_type h = std::hash<std::basic_string_view<char>>{}(
+               std::basic_string_view<char>{ reinterpret_cast<const char*>(&s[0]), s.size() }
+            );
             // Use lazy_emplace to modify the map while the mutex is locked.
             map_ret.lazy_emplace_l(
-               s,
+               S{ s, h },
                [&](map_str_int_type::value_type& p) {
                   // called only when key was already present
                   p.second += count;
                },
                [&](const map_str_int_type::constructor& ctor) {
                   // construct value_type in place when key not present
-                  ctor(std::move(s), count);
+                  ctor(S{ std::move(s), h }, count);
                }
             );
          }
@@ -323,7 +332,7 @@

    // Store the properties into a vector
    vec_str_int_type propvec; propvec.reserve(8 + map.size());
-   for (auto const& x : map) propvec.emplace_back(x);
+   for (auto const& x : map) propvec.emplace_back(x.first.str, x.second);
    map.clear();

    cend2 = high_resolution_clock::now();```

It has been a lot of fun trying things and learned a lot.
greg7mdp commented 1 year ago

Am I assuming wrongly for the hash value to be unique per key?

Yes, there is no guarantee that this would be the case. Maybe this is the case here because we have a small number of different strings, but again this is not a given.

marioroy commented 1 year ago

Thank you for the extra clarity on why not to do that. In the end, I omitted the change altogether.

I came across with_submap inside the tests folder. That is helpful in llil to iterate the submaps in parallel (phmap to vector).

Before:

   // Store the properties into a vector
   vec_str_int_type propvec; propvec.reserve(8 + map.size());
   for (auto const& x : map) propvec.emplace_back(x);
   map.clear();

After:

   // Store the properties into a vector
   vec_str_int_type propvec; propvec.reserve(8 + map.size());

   #pragma omp parallel for schedule(static, 1) num_threads(4)
   for (size_t i = 0; i < map.subcnt(); ++i) {
      map.with_submap(i, [&](const map_str_int_type::EmbeddedSet& set) {
         if (set.size()) {
            vec_str_int_type v; v.reserve(8 + set.size());
            for (auto& p : set) v.emplace_back(p);
            #pragma omp critical
            propvec.insert(propvec.end(), v.begin(), v.end());
         }
      });
   }

   map.clear();

The llil4map.cc code is final.

marioroy commented 1 year ago

The emhash entry recently appeared under "Explore repositories" on my GitHub page. Like flat_hash_map, emhash is not thread-safe. So, I thought about having the application handle locking including sub maps. Doing this gave me a new appreciation for parallel_flat_hash_map.

  1. llil4map.cc (phmap::parallel_flat_hash_map)
  2. llil4hmap.cc (phmap::flat_hash_map)
  3. llil4emh.cc (emhash8::HashMap)

emhash performs similarly to flat_hash_map, but flies when processing input containing duplicate keys. I ran with 1 thread for get_properties. Behind the scene, the examples consume 4 threads during hash map to vector. Sorting consumes max 24 threads.

Running 1 thread (52 input files; 26 * 2):

$ NUM_THREADS=1 ./llil4map in/biga* in/biga* | cksum
llil4map (fixed string length=12) start
use OpenMP
use boost sort
get properties        43.310 secs
phmap to vector        0.290 secs
vector stable sort     0.687 secs
write stdout           1.608 secs
total time            45.897 secs
3291503642 712085184
$ NUM_THREADS=1 ./llil4hmap in/biga* in/biga* | cksum
llil4hmap (fixed string length=12) start
use OpenMP
use boost sort
get properties        33.839 secs
hashmap to vector      0.439 secs
vector stable sort     0.663 secs
write stdout           1.530 secs
total time            36.472 secs
3291503642 712085184
$ NUM_THREADS=1 ./llil4emh in/biga* in/biga* | cksum
llil4emh (fixed string length=12) start
use OpenMP
use boost sort
get properties        27.888 secs
emhash to vector       0.293 secs
vector stable sort     0.672 secs
write stdout           1.583 secs
total time            30.438 secs
3291503642 712085184

Running many threads (52 input files; 26 * 2):

$ NUM_THREADS=24 ./llil4map in/biga* in/biga* | cksum
llil4map (fixed string length=12) start
use OpenMP
use boost sort
get properties         3.296 secs
phmap to vector        0.318 secs
vector stable sort     0.671 secs
write stdout           1.540 secs
total time             5.827 secs
3291503642 712085184
$ NUM_THREADS=24 ./llil4hmap in/biga* in/biga* | cksum
llil4hmap (fixed string length=12) start
use OpenMP
use boost sort
get properties         3.145 secs
hashmap to vector      0.297 secs
vector stable sort     0.673 secs
write stdout           1.580 secs
total time             5.697 secs
3291503642 712085184
$ NUM_THREADS=24 ./llil4emh in/biga* in/biga* | cksum
llil4emh (fixed string length=12) start
use OpenMP
use boost sort
get properties         3.094 secs
emhash to vector       0.219 secs
vector stable sort     0.667 secs
write stdout           1.535 secs
total time             5.516 secs
3291503642 712085184
greg7mdp commented 1 year ago

That is helpful in llil to iterate the submaps in parallel (phmap to vector).

It would be more efficient to iterate once (not parallel) through the submaps to get their sizes, so you know exactly where to insert their content in the final (resized) vector, and you don't need the intermediate vector and the double copy.

So, I thought about having the application handle locking including sub maps.

Yes, you can certainly do that. It is equivalent to what the parallel_hash_map does, it just requires a little extra work.

marioroy commented 1 year ago

It would be more efficient to iterate once (not parallel) through the submaps to get their sizes, so you know exactly where to insert their content in the final (resized) vector, and you don't need the intermediate vector and the double copy.

Ah, yes. Thank you! This is fixed in the gists code. I added the run times for llil4hmap and llil4emh to the summary section in llil_results.txt.