Closed kc9jud closed 2 years ago
Some timing information:
Without packing:
reading file mfdn_MBgroups001 (1/89)
num_groups: 794363
size:794363 load_factor: 0.0059185107
elapsed time: 0.08 average time: 0.08
reading file mfdn_MBgroups002 (2/89)
num_groups: 794363
size:1588726 load_factor: 0.011837021
elapsed time: 0.12 average time: 0.10
reading file mfdn_MBgroups003 (3/89)
num_groups: 794363
size:2383089 load_factor: 0.017755533
elapsed time: 0.26 average time: 0.15
reading file mfdn_MBgroups004 (4/89)
num_groups: 794363
size:3177452 load_factor: 0.023674043
elapsed time: 0.43 average time: 0.22
reading file mfdn_MBgroups005 (5/89)
num_groups: 794363
size:3971815 load_factor: 0.029592553
elapsed time: 0.61 average time: 0.30
reading file mfdn_MBgroups006 (6/89)
num_groups: 794363
size:4766178 load_factor: 0.035511065
elapsed time: 0.79 average time: 0.38
reading file mfdn_MBgroups007 (7/89)
num_groups: 794363
size:5560541 load_factor: 0.041429576
elapsed time: 0.94 average time: 0.46
reading file mfdn_MBgroups008 (8/89)
num_groups: 794363
size:6354904 load_factor: 0.047348086
elapsed time: 1.11 average time: 0.54
reading file mfdn_MBgroups009 (9/89)
num_groups: 794363
size:7149267 load_factor: 0.053266596
elapsed time: 1.23 average time: 0.62
reading file mfdn_MBgroups010 (10/89)
num_groups: 794363
size:7943630 load_factor: 0.059185106
elapsed time: 1.36 average time: 0.69
reading file mfdn_MBgroups011 (11/89)
num_groups: 794363
size:8737993 load_factor: 0.06510362
elapsed time: 1.54 average time: 0.77
reading file mfdn_MBgroups012 (12/89)
num_groups: 794363
size:9532356 load_factor: 0.07102213
elapsed time: 1.66 average time: 0.84
[...]
with packing:
reading file mfdn_MBgroups001 (1/89)
num_groups: 794363
size:794363 load_factor: 0.0059185107
elapsed time: 0.11 average time: 0.11
reading file mfdn_MBgroups002 (2/89)
num_groups: 794363
size:1588726 load_factor: 0.011837021
elapsed time: 0.06 average time: 0.08
reading file mfdn_MBgroups003 (3/89)
num_groups: 794363
size:2383089 load_factor: 0.017755533
elapsed time: 0.06 average time: 0.07
reading file mfdn_MBgroups004 (4/89)
num_groups: 794363
size:3177452 load_factor: 0.023674043
elapsed time: 0.06 average time: 0.07
reading file mfdn_MBgroups005 (5/89)
num_groups: 794363
size:3971815 load_factor: 0.029592553
elapsed time: 0.05 average time: 0.07
reading file mfdn_MBgroups006 (6/89)
num_groups: 794363
size:4766178 load_factor: 0.035511065
elapsed time: 0.06 average time: 0.07
reading file mfdn_MBgroups007 (7/89)
num_groups: 794363
size:5560541 load_factor: 0.041429576
elapsed time: 0.05 average time: 0.06
reading file mfdn_MBgroups008 (8/89)
num_groups: 794363
size:6354904 load_factor: 0.047348086
elapsed time: 0.06 average time: 0.06
reading file mfdn_MBgroups009 (9/89)
num_groups: 794363
size:7149267 load_factor: 0.053266596
elapsed time: 0.06 average time: 0.06
reading file mfdn_MBgroups010 (10/89)
num_groups: 794363
size:7943630 load_factor: 0.059185106
elapsed time: 0.07 average time: 0.06
reading file mfdn_MBgroups011 (11/89)
num_groups: 794363
size:8737993 load_factor: 0.06510362
elapsed time: 0.06 average time: 0.06
reading file mfdn_MBgroups012 (12/89)
num_groups: 794363
size:9532356 load_factor: 0.07102213
elapsed time: 0.06 average time: 0.06
[...]
System info: 2x AMD EPYC 7763 (Milan), 128 total cores, 256 total threads, 512 GiB total DDR4 memory (https://docs.nersc.gov/systems/perlmutter/system_details/#cpu-only-compute-nodes) Compiler:
> CC --version
g++ (GCC) 11.2.0 20210728 (Cray Inc.)
Hum, I suspect it is the hash function with your larger keys causing lots of collisions. It is critical that different keys mostly hash to different values. If you don't think it is the case, please provide a sample program reproducing the issue and I'll look at it.
That seems to be most of it, yeah -- with this hash function
namespace std
{
template<> struct hash<groupid_t>
{
std::size_t operator()(groupid_t const &g) const
{
return phmap::HashState().combine(0,
(uint64_t(0xFFFu & g[0]))
| (uint64_t(0xFFFu & g[1])<<12)
| (uint64_t(0xFFFu & g[2])<<24)
| (uint64_t(0xFFFu & g[3])<<36)
| (uint64_t(0xFFFu & g[4])<<48)
| (uint64_t(0xFFFu & g[5])<<60),
(uint64_t(0xFFFu & g[5])>>4)
| (uint64_t(0xFFFu & g[6])<<8)
| (uint64_t(0xFFFu & g[7])<<20)
| (uint64_t(0xFFFu & g[8])<<32)
| (uint64_t(0xFFFu & g[9])<<44)
| (uint64_t(0xFFFu & g[10])<<56),
(uint64_t(0xFFFu & g[10])>>8)
| (uint64_t(0xFFFu & g[11])<<4)
| (uint64_t(0xFFFu & g[12])<<16)
| (uint64_t(0xFFFu & g[13])<<28)
| (uint64_t(0xFFFu & g[14])<<40)
| (uint64_t(0xFFFu & g[15])<<52),
(uint64_t(0xFFFu & g[16]))
| (uint64_t(0xFFFu & g[17])<<12)
| (uint64_t(0xFFFu & g[18])<<24)
| (uint64_t(0xFFFu & g[19])<<36)
| (uint64_t(0xFFFu & g[20])<<48)
);
}
};
}
it works just as well as using the bit-packed key.
I guess that you can get away with a bad hash function for longer as long as you can vectorize the key comparison. If you have a bad hash function and the keys are larger than 32 bytes, the hash collisions become very visible because key comparison is expensive.
I also expect that with my old key function, a sufficiently large map would start to show O(n) slowdowns -- it's just that, with a vectorized comparison, the coefficient on n is quite small.
Hum, can you try this as a hash function? Maybe there is a problem when using a tuple of small integers?
{
template<> struct hash<groupid_t>
{
std::size_t operator()(groupid_t const &g) const
{
const std::string_view bv{reinterpret_cast<const char*>(g.data()), sizeof(g)};
return std::hash<std::string_view>()(bv);
}
};
}
Hum, can you try this as a hash function? Maybe there is a problem when using a tuple of small integers?
Yes, that hash function works just as well as my bit-shift monstrosity. Something does seem off with a tuple of small integers.
I believe I fixed the issue. It should now work correctly with your original hash function.
I'm trying to build a large
parallel_flat_hash_map
and I'm noticing a distinct change in behavior when the key becomes larger than 32 bytes. My naive key would bestd::array<uint16_t,20>
, where I've defined types and a hash functionMy map typedef is
and I'm using
lazy_emplace_l
(where I know as a precondition that my initial keys are unique)If I insert ~50M elements, I notice that the time it takes to insert an element rises by about a factor of 100 between when the map is empty and when the map is "full", and rises approximately linearly with the number of elements already in the map. (Note: I know the initial number of elements, so I am doing
groupid_map.reserve()
and I can see that the map never gets much above ~50%load_factor
.) If I runperf
, I see that over 50% of the CPU time is spent in__memcmp_avx2_movbe
, which I believe is glibc's implementation ofstd::memcmp
and is probably used bystd::equal_to
for standard-layout types likestd::array
.However, my keys are actually really only 12-bit integers, so if I redefine
groupid_t
and do some fancy bit-packingI get no slowdown as the map increases in size.
I've also tested using
std::array<uint16_t,16>
as my key type, and I see "good" behavior, i.e. no linear slowdown. However, as soon as my key goes over 32-bytes/256-bits, insertion/emplacement slows down dramatically.I am guessing that this has to do with the fact that, if
sizeof(key)<=32
, a bunch of comparisons can be done using AVX2 instructions, but as soon as the key gets even one byte larger, the comparison changes to something non-vectorized.1) Is this a known issue/limitation? I didn't see anything here like this. 2) I would expect that having big keys would cause the comparisons to be slower, but I don't see why the insertion time should grow like O(n).
For now, I can just bit-pack my keys, but it would be nice to know if this is a fundamental limitation or just a weird bug/edge case.