Closed marioroy closed 7 months ago
I reached out to NVIDIA. The parallel-hashmap usage in llil4hmap
is not a typical pattern, but something I tried out of curiosity at the time. I replied to try llil4map.cc instead for investigating why nvc++ is noticeably slower compared to clang++.
https://forums.developer.nvidia.com/t/nvfortran-23-9-problem/270618/13?u=marioeroy
Thank you @marioroy , I'm very busy right now, but I'll look into this when I can I promise.
Have met the same warnings with nvcc. Seems simply including the header will trigger the warning:
#include <phmap.h>
int main() {}
Compiled with nvcc -Iinclude/parallel_hashmap test.cu -o test
. Warnings:
include/parallel_hashmap/phmap.h(434): warning #68-D: integer conversion resulted in a change of sign
include/parallel_hashmap/phmap.h(442): warning #68-D: integer conversion resulted in a change of sign
include/parallel_hashmap/phmap.h(434): warning #68-D: integer conversion resulted in a change of sign
include/parallel_hashmap/phmap.h(442): warning #68-D: integer conversion resulted in a change of sign
I tried the suggestion, emitted by nvc++
. Individual warnings can be suppressed with --diag_suppress <warning-name>
.
nvc++ --diag_suppress=integer_sign_change ...
I also tried nvc++
to build a Python extension, Accelerating Python on GPUs with nvc++ and Cython.
nvc++ --diag_suppress=declared_but_not_referenced,identifier_not_keyword,set_but_not_used ...
Warnings are fixed with latest version.
Hi, @greg7mdp
The NVIDIA nvc++
compiler is interesting. I fixed the three map demonstrations llil4map.cc, llil4hmap.cc, and llil4emh.cc. No more slowness using nvc++
.
std::mutex
to included spinlock_mutex
. Noticeably better performance. However, this resolves nvc++
taking greater than 40 seconds to compile llil4hmap.cc
and llil4emh.cc
.out_properties
. This ran poorly using nvc++
, taking > 4 seconds. The program built with clang++
outputs in 0.7 seconds. It turns out that allocating a basic string inside a parallel loop causes nvc++
OpenMP to perform worst than non-parallel.#ifdef MAX_STR_LEN_L
// std::basic_string<char> s { it->first.data(), MAX_STR_LEN_L };
// str.append(s.c_str());
str.append(it->first.data());
#else
str.append(it->first.data(), it->first.size());
#endif
Results:
llil4map
consumes the least memory. The llil4hmap
demonstration is similar to llil4map
, but computes the hash_value
one time per key. This was done out of curiosity. The llil4emh
demonstration uses emhash7::HashMap
for comparison.
$ NUM_THREADS=60 ./llil4map in/big* in/big* in/big* | cksum
llil4map (fixed string length=12) start
use OpenMP
use boost sort g++ clang++ nvc++
get properties 7.864 secs 7.866 secs 7.973 secs
map to vector 1.140 secs 0.881 secs 0.876 secs
vector stable sort 1.681 secs 1.098 secs 1.105 secs
write stdout 0.624 secs 0.595 secs 0.590 secs
total time 11.311 secs 10.441 secs 10.546 secs
count lines 970195200
count unique 200483043
2057246516 1811140689
$ NUM_THREADS=60 ./llil4hmap in/big* in/big* in/big* | cksum
llil4hmap (fixed string length=12) start
use OpenMP
use boost sort g++ clang++ nvc++
get properties 7.245 secs 7.197 secs 7.617 secs
map to vector 1.179 secs 0.841 secs 0.809 secs
vector stable sort 1.682 secs 1.103 secs 1.113 secs
write stdout 0.647 secs 0.573 secs 0.571 secs
total time 10.754 secs 9.715 secs 10.112 secs
count lines 970195200
count unique 200483043
2057246516 1811140689
$ NUM_THREADS=60 ./llil4emh in/big* in/big* in/big* | cksum
llil4emh (fixed string length=12) start
use OpenMP
use boost sort g++ clang++ nvc++
get properties 6.150 secs 6.250 secs 6.392 secs
map to vector 1.016 secs 0.877 secs 0.832 secs
vector stable sort 1.681 secs 1.127 secs 1.094 secs
write stdout 0.608 secs 0.561 secs 0.595 secs
total time 9.457 secs 8.817 secs 8.914 secs
count lines 970195200
count unique 200483043
2057246516 1811140689
Thank you for the tip to reclaim memory.
MyMap().swap(map); // swap map with an empty temporary, which is immediately destroyed
Thanks, interesting data, emhash
seems to be pretty good. Personally, I'm not really focusing on benchmarks, but more on having the hash map perform well in most cases.
Previously, the program ran poorly using nvc++
, particularly "write stdout". The spinlock_mutex sped up "get properties".
$ ./llil4hmap /data1/input/big* | cksum
llil4hmap (fixed string length=12) start
use OpenMP
use boost sort Before After
get properties 3.804 secs 2.797 secs
hmap to vector 0.697 secs 0.792 secs
vector stable sort 1.104 secs 1.148 secs
write stdout 5.071 secs 0.564 secs
total time 10.678 secs 5.302 secs
count lines 323398400
count unique 200483043
701308064 1804347429
That's great, thanks for doing that. Maybe I should add the spinlock_mutex
to phmap so people can easily try it out?
I tried nvc++ (NVIDIA HPC SDK) and thought to pass this along. I ran
git pull
and have the latest from master.The llil4hmap.cc demonstration is found in my gist repo.
https://gist.github.com/marioroy/3924c48e140f8330f25f67cd98a815ef