linear slowdown with keys larger than 32 bytes

kc9jud commented 2 years ago

I'm trying to build a large parallel_flat_hash_map and I'm noticing a distinct change in behavior when the key becomes larger than 32 bytes. My naive key would be std::array<uint16_t,20>, where I've defined types and a hash function

#ifndef MAX_A
#define MAX_A 21
#endif

using groupid_t = std::array<uint16_t,MAX_A>;
using value_t = uint32_t;
namespace std
{
  template<> struct hash<groupid_t>
  {
    std::size_t operator()(groupid_t const &g) const
    {
      return phmap::Hash<decltype(std::tuple_cat(g))>()(std::tuple_cat(g));
    }
  };
}

My map typedef is

using hash_map_t = phmap::parallel_flat_hash_map<
    groupid_t, value_t,
    phmap::priv::hash_default_hash<groupid_t>,
    phmap::priv::hash_default_eq<groupid_t>,
    phmap::priv::Allocator<std::pair<const groupid_t, value_t>>, // alias for std::allocator
    10,
    std::mutex
  >;

and I'm using lazy_emplace_l (where I know as a precondition that my initial keys are unique)

  groupid_map.lazy_emplace_l(
      groupid,
      [](hash_map_t::value_type&) {},
      [&](const hash_map_t::constructor& ctor) {ctor(groupid, dim); }
    );

If I insert ~50M elements, I notice that the time it takes to insert an element rises by about a factor of 100 between when the map is empty and when the map is "full", and rises approximately linearly with the number of elements already in the map. (Note: I know the initial number of elements, so I am doing groupid_map.reserve() and I can see that the map never gets much above ~50% load_factor.) If I run perf, I see that over 50% of the CPU time is spent in __memcmp_avx2_movbe, which I believe is glibc's implementation of std::memcmp and is probably used by std::equal_to for standard-layout types like std::array.

However, my keys are actually really only 12-bit integers, so if I redefine groupid_t and do some fancy bit-packing

using groupid_t = std::array<uint64_t,4>;
inline groupid_t ConvertGroupID(std::array<uint16_t,MAX_A>& g)
{
    groupid_t pg{};
    pg[0] = (uint64_t(0xFFFu & g[0]))
            | (uint64_t(0xFFFu & g[1])<<12)
            | (uint64_t(0xFFFu & g[2])<<24)
            | (uint64_t(0xFFFu & g[3])<<36)
            | (uint64_t(0xFFFu & g[4])<<48)
            | (uint64_t(0xFFFu & g[5])<<60);
    pg[1] = (uint64_t(0xFFFu & g[5])>>4)
            | (uint64_t(0xFFFu & g[6])<<8)
            | (uint64_t(0xFFFu & g[7])<<20)
            | (uint64_t(0xFFFu & g[8])<<32)
            | (uint64_t(0xFFFu & g[9])<<44)
            | (uint64_t(0xFFFu & g[10])<<56);
    pg[2] = (uint64_t(0xFFFu & g[10])>>8)
            | (uint64_t(0xFFFu & g[11])<<4)
            | (uint64_t(0xFFFu & g[12])<<16)
            | (uint64_t(0xFFFu & g[13])<<28)
            | (uint64_t(0xFFFu & g[14])<<40)
            | (uint64_t(0xFFFu & g[15])<<52);
    pg[3] = (uint64_t(0xFFFu & g[16]))
            | (uint64_t(0xFFFu & g[17])<<12)
            | (uint64_t(0xFFFu & g[18])<<24)
            | (uint64_t(0xFFFu & g[19])<<36)
            | (uint64_t(0xFFFu & g[20])<<48);

    return pg;
}

I get no slowdown as the map increases in size.

I've also tested using std::array<uint16_t,16> as my key type, and I see "good" behavior, i.e. no linear slowdown. However, as soon as my key goes over 32-bytes/256-bits, insertion/emplacement slows down dramatically.

I am guessing that this has to do with the fact that, if sizeof(key)<=32, a bunch of comparisons can be done using AVX2 instructions, but as soon as the key gets even one byte larger, the comparison changes to something non-vectorized.

1) Is this a known issue/limitation? I didn't see anything here like this. 2) I would expect that having big keys would cause the comparisons to be slower, but I don't see why the insertion time should grow like O(n).

For now, I can just bit-pack my keys, but it would be nice to know if this is a fundamental limitation or just a weird bug/edge case.

kc9jud commented 2 years ago

Some timing information:

Without packing:

reading file mfdn_MBgroups001 (1/89)
  num_groups: 794363
  size:794363  load_factor: 0.0059185107
elapsed time:     0.08  average time:     0.08

reading file mfdn_MBgroups002 (2/89)
  num_groups: 794363
  size:1588726  load_factor: 0.011837021
elapsed time:     0.12  average time:     0.10

reading file mfdn_MBgroups003 (3/89)
  num_groups: 794363
  size:2383089  load_factor: 0.017755533
elapsed time:     0.26  average time:     0.15

reading file mfdn_MBgroups004 (4/89)
  num_groups: 794363
  size:3177452  load_factor: 0.023674043
elapsed time:     0.43  average time:     0.22

reading file mfdn_MBgroups005 (5/89)
  num_groups: 794363
  size:3971815  load_factor: 0.029592553
elapsed time:     0.61  average time:     0.30

reading file mfdn_MBgroups006 (6/89)
  num_groups: 794363
  size:4766178  load_factor: 0.035511065
elapsed time:     0.79  average time:     0.38

reading file mfdn_MBgroups007 (7/89)
  num_groups: 794363
  size:5560541  load_factor: 0.041429576
elapsed time:     0.94  average time:     0.46

reading file mfdn_MBgroups008 (8/89)
  num_groups: 794363
  size:6354904  load_factor: 0.047348086
elapsed time:     1.11  average time:     0.54

reading file mfdn_MBgroups009 (9/89)
  num_groups: 794363
  size:7149267  load_factor: 0.053266596
elapsed time:     1.23  average time:     0.62

reading file mfdn_MBgroups010 (10/89)
  num_groups: 794363
  size:7943630  load_factor: 0.059185106
elapsed time:     1.36  average time:     0.69

reading file mfdn_MBgroups011 (11/89)
  num_groups: 794363
  size:8737993  load_factor: 0.06510362
elapsed time:     1.54  average time:     0.77

reading file mfdn_MBgroups012 (12/89)
  num_groups: 794363
  size:9532356  load_factor: 0.07102213
elapsed time:     1.66  average time:     0.84
[...]

with packing:

reading file mfdn_MBgroups001 (1/89)
  num_groups: 794363
  size:794363  load_factor: 0.0059185107
elapsed time:     0.11  average time:     0.11

reading file mfdn_MBgroups002 (2/89)
  num_groups: 794363
  size:1588726  load_factor: 0.011837021
elapsed time:     0.06  average time:     0.08

reading file mfdn_MBgroups003 (3/89)
  num_groups: 794363
  size:2383089  load_factor: 0.017755533
elapsed time:     0.06  average time:     0.07

reading file mfdn_MBgroups004 (4/89)
  num_groups: 794363
  size:3177452  load_factor: 0.023674043
elapsed time:     0.06  average time:     0.07

reading file mfdn_MBgroups005 (5/89)
  num_groups: 794363
  size:3971815  load_factor: 0.029592553
elapsed time:     0.05  average time:     0.07

reading file mfdn_MBgroups006 (6/89)
  num_groups: 794363
  size:4766178  load_factor: 0.035511065
elapsed time:     0.06  average time:     0.07

reading file mfdn_MBgroups007 (7/89)
  num_groups: 794363
  size:5560541  load_factor: 0.041429576
elapsed time:     0.05  average time:     0.06

reading file mfdn_MBgroups008 (8/89)
  num_groups: 794363
  size:6354904  load_factor: 0.047348086
elapsed time:     0.06  average time:     0.06

reading file mfdn_MBgroups009 (9/89)
  num_groups: 794363
  size:7149267  load_factor: 0.053266596
elapsed time:     0.06  average time:     0.06

reading file mfdn_MBgroups010 (10/89)
  num_groups: 794363
  size:7943630  load_factor: 0.059185106
elapsed time:     0.07  average time:     0.06

reading file mfdn_MBgroups011 (11/89)
  num_groups: 794363
  size:8737993  load_factor: 0.06510362
elapsed time:     0.06  average time:     0.06

reading file mfdn_MBgroups012 (12/89)
  num_groups: 794363
  size:9532356  load_factor: 0.07102213
elapsed time:     0.06  average time:     0.06
[...]

System info: 2x AMD EPYC 7763 (Milan), 128 total cores, 256 total threads, 512 GiB total DDR4 memory (https://docs.nersc.gov/systems/perlmutter/system_details/#cpu-only-compute-nodes) Compiler:

> CC --version
g++ (GCC) 11.2.0 20210728 (Cray Inc.)

greg7mdp commented 2 years ago

Hum, I suspect it is the hash function with your larger keys causing lots of collisions. It is critical that different keys mostly hash to different values. If you don't think it is the case, please provide a sample program reproducing the issue and I'll look at it.

kc9jud commented 2 years ago

That seems to be most of it, yeah -- with this hash function

namespace std
{
  template<> struct hash<groupid_t>
  {
    std::size_t operator()(groupid_t const &g) const
    {
      return phmap::HashState().combine(0,
          (uint64_t(0xFFFu & g[0]))
            | (uint64_t(0xFFFu & g[1])<<12)
            | (uint64_t(0xFFFu & g[2])<<24)
            | (uint64_t(0xFFFu & g[3])<<36)
            | (uint64_t(0xFFFu & g[4])<<48)
            | (uint64_t(0xFFFu & g[5])<<60),
          (uint64_t(0xFFFu & g[5])>>4)
            | (uint64_t(0xFFFu & g[6])<<8)
            | (uint64_t(0xFFFu & g[7])<<20)
            | (uint64_t(0xFFFu & g[8])<<32)
            | (uint64_t(0xFFFu & g[9])<<44)
            | (uint64_t(0xFFFu & g[10])<<56),
          (uint64_t(0xFFFu & g[10])>>8)
            | (uint64_t(0xFFFu & g[11])<<4)
            | (uint64_t(0xFFFu & g[12])<<16)
            | (uint64_t(0xFFFu & g[13])<<28)
            | (uint64_t(0xFFFu & g[14])<<40)
            | (uint64_t(0xFFFu & g[15])<<52),
          (uint64_t(0xFFFu & g[16]))
            | (uint64_t(0xFFFu & g[17])<<12)
            | (uint64_t(0xFFFu & g[18])<<24)
            | (uint64_t(0xFFFu & g[19])<<36)
            | (uint64_t(0xFFFu & g[20])<<48)
        );
    }
  };
}

it works just as well as using the bit-packed key.

I guess that you can get away with a bad hash function for longer as long as you can vectorize the key comparison. If you have a bad hash function and the keys are larger than 32 bytes, the hash collisions become very visible because key comparison is expensive.

I also expect that with my old key function, a sufficiently large map would start to show O(n) slowdowns -- it's just that, with a vectorized comparison, the coefficient on n is quite small.

greg7mdp commented 2 years ago

Hum, can you try this as a hash function? Maybe there is a problem when using a tuple of small integers?

{
  template<> struct hash<groupid_t>
  {
    std::size_t operator()(groupid_t const &g) const
    {
        const std::string_view bv{reinterpret_cast<const char*>(g.data()), sizeof(g)};
        return std::hash<std::string_view>()(bv);
    }
  };
}

kc9jud commented 2 years ago

Hum, can you try this as a hash function? Maybe there is a problem when using a tuple of small integers?

Yes, that hash function works just as well as my bit-shift monstrosity. Something does seem off with a tuple of small integers.

greg7mdp commented 2 years ago

I believe I fixed the issue. It should now work correctly with your original hash function.

greg7mdp / parallel-hashmap

linear slowdown with keys larger than 32 bytes #158