Closed samuelpmish closed 11 months ago
Hi @samuelpmish , thank you for using phmap.
I have a fix for you. You''ll be happy with the speed:
phmap::parallel_flat_hash_map
, it is faster.pmap_nullmutex
) which has phmap::NullMutex
as the mutex typerenumber
function with this extra type, as in:
renumber< pmap<4>, pmap_nullmutex<4> >(vertex_ids, elements, 1);
update
part, swap the pmap
with the mutex with the one with the NullMutex
, and then use the new map with no mutex locking (which is unneeded as we access the map only for read), as in:
stopwatch.start();
Map_nomutex new_ids_nc;
new_ids_nc.swap(new_ids);
pool.parallel_for(elements.size(), [&](uint64_t i){
auto & elem = elements[i];
elem[0] = new_ids_nc.at(elem[0]);
elem[1] = new_ids_nc.at(elem[1]);
elem[2] = new_ids_nc.at(elem[2]);
elem[3] = new_ids_nc.at(elem[3]);
});
stopwatch.stop();
Here are the results I got:
generating dataset .. done
std::unordered_map, 1 thread: 1073.95ms 4651.66ms
std::unordered_map, 32 thread (single threaded insertion): 1253.99ms 464.183ms
pmap4, 1 thread: 649.7ms 2574.57ms
pmap4, 32 threads: 353.275ms 217.594ms
pmap6, 1 thread: 365.098ms 2615.59ms
pmap6, 32 threads: 204.003ms 219.373ms
Here is your updated benchmark
PS: would you mind if I add it as an example in phmap? PS2: your benchmark feels very familiar. I used to develop a graphics lbrary for FE element analysis at Altair Engineering :-).
thanks for the update, that was what I was missing!
would you mind if I add it as an example in phmap?
sounds good to me
your benchmark feels very familiar. I used to develop a graphics lbrary for FE element analysis at Altair Engineering :-)
you've figured me out, it's taken from a finite element project-- thanks again for the help!
Oh, I just thought that your bench will be a bit faster if you reserve the size in your map:
new_ids.reserve(vertex_ids.size() * 110 / 100);
Hi, I'm trying out this library for a project where I have a performance bottleneck on a piece of code that uses a
std::unordered_map
to renumber some indices:Of course
std::unordered_map
doesn't let you insert things in parallel, so every other part of the code is benefitting from 32 threads, but this one becomes a bottleneck because of the serial insertion step. When I time these two operations, I get (first timing is insertion, second is updating indices)So, the insertion can't benefit from multiple threads, but the update step can (~10x speedup).
When I try using
as a drop in replacement for
std::unordered_map<uint64_t, uint64_t>
, I get better insertion timesbut considerably worse access times in the update step.
Is this expected, or am I perhaps using the wrong kind of container (or configuring them with the wrong parameters)?
The complete sourcefile for this benchmark is available here.
Thank you