greg7mdp / parallel-hashmap

A family of header-only, very fast and memory-friendly hashmap and btree containers.
https://greg7mdp.github.io/parallel-hashmap/
Apache License 2.0
2.47k stars 234 forks source link

question about performance of `at(...)` #210

Closed samuelpmish closed 11 months ago

samuelpmish commented 11 months ago

Hi, I'm trying out this library for a project where I have a performance bottleneck on a piece of code that uses a std::unordered_map to renumber some indices:

template < typename Map >
void renumber(const std::vector< uint64_t > & vertex_ids, 
              std::vector< std::array< uint64_t, 4 > > elements,
              int num_threads) {

  bool supports_parallel_insertion = 
    !std::is_same< Map, std::unordered_map<uint64_t, uint64_t> >::value;

  Map new_ids;
  std::atomic< uint64_t > new_id = 0;

  threadpool pool((supports_parallel_insertion) ? num_threads : 1);

  // insert vertex ids into a map, and give them a new compact index
  pool.parallel_for(vertex_ids.size(), [&](uint64_t i){
    auto id = new_id++;
    new_ids[vertex_ids[i]] = id;
  });

  // update connectivity lists to use the new indices
  pool.parallel_for(elements.size(), [&](uint64_t i){
    auto & elem = elements[i];
    elem[0] = new_ids.at(elem[0]);
    elem[1] = new_ids.at(elem[1]);
    elem[2] = new_ids.at(elem[2]);
    elem[3] = new_ids.at(elem[3]);
  });

Of course std::unordered_map doesn't let you insert things in parallel, so every other part of the code is benefitting from 32 threads, but this one becomes a bottleneck because of the serial insertion step. When I time these two operations, I get (first timing is insertion, second is updating indices)

std::unordered_map, (single threaded insertion, single threaded update): 1609.12ms 6296.07ms
std::unordered_map, (single threaded insertion, multithreaded update): 1815.36ms 608.119ms

So, the insertion can't benefit from multiple threads, but the update step can (~10x speedup).

When I try using

template < int n >
using pmap = phmap::parallel_node_hash_map< 
                uint64_t, 
                uint64_t,
                std::hash<uint64_t>,
                std::equal_to<uint64_t>, 
                std::allocator<std::pair<const uint64_t, uint64_t>>, 
                n, 
                std::mutex >;

as a drop in replacement for std::unordered_map<uint64_t, uint64_t>, I get better insertion times

pmap<4>, 1 thread: 709.521ms 13461.7ms
pmap<4>, 32 threads: 617.853ms 5174.52ms
pmap<6>, 1 thread: 1216.25ms 13500.4ms
pmap<6>, 32 threads: 442.394ms 2610.72ms
pmap<9>, 32 threads: 194.461ms 1679.37ms

but considerably worse access times in the update step.

Is this expected, or am I perhaps using the wrong kind of container (or configuring them with the wrong parameters)?

The complete sourcefile for this benchmark is available here.

Thank you

greg7mdp commented 11 months ago

Hi @samuelpmish , thank you for using phmap.

I have a fix for you. You''ll be happy with the speed:

  1. use phmap::parallel_flat_hash_map, it is faster.
  2. create another type alias (I called it pmap_nullmutex) which has phmap::NullMutex as the mutex type
  3. call the renumber function with this extra type, as in: renumber< pmap<4>, pmap_nullmutex<4> >(vertex_ids, elements, 1);
  4. before the update part, swap the pmap with the mutex with the one with the NullMutex, and then use the new map with no mutex locking (which is unneeded as we access the map only for read), as in:
    stopwatch.start();
    Map_nomutex new_ids_nc;
    new_ids_nc.swap(new_ids);
    pool.parallel_for(elements.size(), [&](uint64_t i){
    auto & elem = elements[i];
    elem[0] = new_ids_nc.at(elem[0]);
    elem[1] = new_ids_nc.at(elem[1]);
    elem[2] = new_ids_nc.at(elem[2]);
    elem[3] = new_ids_nc.at(elem[3]);
    });
    stopwatch.stop();

Here are the results I got:

generating dataset .. done
std::unordered_map, 1 thread: 1073.95ms 4651.66ms
std::unordered_map, 32 thread (single threaded insertion): 1253.99ms 464.183ms
pmap4, 1 thread: 649.7ms 2574.57ms
pmap4, 32 threads: 353.275ms 217.594ms
pmap6, 1 thread: 365.098ms 2615.59ms
pmap6, 32 threads: 204.003ms 219.373ms
greg7mdp commented 11 months ago

Here is your updated benchmark

PS: would you mind if I add it as an example in phmap? PS2: your benchmark feels very familiar. I used to develop a graphics lbrary for FE element analysis at Altair Engineering :-).

samuelpmish commented 11 months ago

thanks for the update, that was what I was missing!

would you mind if I add it as an example in phmap?

sounds good to me

your benchmark feels very familiar. I used to develop a graphics lbrary for FE element analysis at Altair Engineering :-)

you've figured me out, it's taken from a finite element project-- thanks again for the help!

greg7mdp commented 11 months ago

Oh, I just thought that your bench will be a bit faster if you reserve the size in your map:

    new_ids.reserve(vertex_ids.size() * 110 / 100);