ktprime / emhash

Fast and memory efficient c++ flat hash map/set
MIT License
435 stars 30 forks source link

Best way to have multiple same keys? #31

Closed ghost closed 1 year ago

ghost commented 1 year ago

Hi, I currently define a hash which may have multiple same values for the same u64 key. I use emhash6 but will benchmark the others as well.

Currently using a vector as the value but it means that we have another indirection.

The issue is that I am not sure about the collisions but I need yo handle them anyway. The alternative is not to cache the value but it may become annoying.

I can't use an array because I have no clue of all the collisions.

Is using a vector the only way to solve this? Thanks

ktprime commented 1 year ago

emhash use std::hash as default hash which is not a good hash function if key is integer in gcc. you can try use the other hash function.

but emhash also offers some good hash function by set compile marco EMH_INT_HASH = 1-6 if key is intege


#if EMH_INT_HASH
    static constexpr uint64_t KC = UINT64_C(11400714819323198485);
    inline static uint64_t hash64(uint64_t key)
    {
#if __SIZEOF_INT128__ && EMH_INT_HASH == 1
        __uint128_t r = key; r *= KC;
        return (uint64_t)(r >> 64) + (uint64_t)r;
#elif EMH_INT_HASH == 2
        //MurmurHash3Mixer
        uint64_t h = key;
        h ^= h >> 33;
        h *= 0xff51afd7ed558ccd;
        h ^= h >> 33;
        h *= 0xc4ceb9fe1a85ec53;
        h ^= h >> 33;
        return h;
#elif _WIN64 && EMH_INT_HASH == 1
        uint64_t high;
        return _umul128(key, KC, &high) + high;
#elif EMH_INT_HASH == 3
        auto ror  = (key >> 32) | (key << 32);
        auto low  = key * 0xA24BAED4963EE407ull;
        auto high = ror * 0x9FB21C651E98DF25ull;
        auto mix  = low + high;
        return mix;
#elif EMH_INT_HASH == 1
        uint64_t r = key * UINT64_C(0xca4bcaa75ec3f625);
        return (r >> 32) + r;
#elif EMH_WYHASH64
        return wyhash64(key, KC);
#else
        uint64_t x = key;
        x = (x ^ (x >> 30)) * UINT64_C(0xbf58476d1ce4e5b9);
        x = (x ^ (x >> 27)) * UINT64_C(0x94d049bb133111eb);
        x = x ^ (x >> 31);
        return x;
#endif
    }
#endif

    template<typename UType, typename std::enable_if<std::is_integral<UType>::value, size_type>::type = 0>
    inline size_type hash_key(const UType key) const
    {
#if EMH_INT_HASH
        return hash64(key);
#elif EMH_SAFE_HASH
        return _hash_inter == 0 ? _hasher(key) : hash64(key);
#elif EMH_IDENTITY_HASH
        return key + (key >> 24);
#else
        return (size_type)_hasher(key);
#endif
    }

most hashmap don't supports multiple same keys like std::unordered_multimap. you can try change emhash's code to implement your ideal, it's not a hard work. I have no any good ideal for this issiue beacuse of no any collisions information.

ktprime commented 1 year ago

by set EMH_STATIS=10000 in emhash7, it'll dump detail collision for dubug if rehash happens.

/****************
    under random hashCodes, the frequency of nodes in bins follows a Poisson
distribution(http://en.wikipedia.org/wiki/Poisson_distribution) with a parameter of about 0.5
on average for the default resizing threshold of 0.75, although with a large variance because
of resizing granularity. Ignoring variance, the expected occurrences of list size k are
(exp(-0.5) * pow(0.5, k)/factorial(k)). The first values are:
0: 0.60653066
1: 0.30326533
2: 0.07581633
3: 0.01263606
4: 0.00157952
5: 0.00015795
6: 0.00001316
7: 0.00000094
8: 0.00000006

  ============== buckets size ration ========
    1   1543981  0.36884964|0.36787944  36.885
    2    768655  0.36725597|0.36787944  73.611
    3    256236  0.18364065|0.18393972  91.975
    4     64126  0.06127757|0.06131324  98.102
    5     12907  0.01541710|0.01532831  99.644
    6      2050  0.00293841|0.00306566  99.938
    7       310  0.00051840|0.00051094  99.990
    8        49  0.00009365|0.00007299  99.999
    9         4  0.00000860|0.00000913  100.000
========== collision miss ration ===========
  _num_filled aver_size k.v size_kv = 4185936, 1.58, x.x 24
  collision,possion,cache_miss hit_find|hit_miss, load_factor = 36.73%,36.74%,31.31%  1.50|2.00, 1.00
============== buckets size ration ========
*******************************************************/
ktprime commented 1 year ago

emhash7 dump collision

============== buckets size ration ========
   1    753769  0.44928123|-0.01%  44.928
   2    302722  0.36087266|0.39%  81.015
   3     79659  0.14244136|-0.93%  95.260
   4     16102  0.03839015|0.12%  99.099
   5      2567  0.00765025|-0.24%  99.864
   6       341  0.00121951|-0.61%  99.986
   7        29  0.00012100|-26.04%  99.998
   8         5  0.00002384|27.52%  100.000
========== collision miss ration ===========
   0    190118  36.38  36.38
   1    128187  24.53  60.92
   2     99678  19.08  79.99
   3     40340  7.72  87.71
   4     20921  4.00  91.72
   5     12613  2.41  94.13
   6      8049  1.54  95.67
   7      5508  1.05  96.72
   8      3930  0.75  97.48
   9      2808  0.54  98.01
  10      1999  0.38  98.40
  11      1596  0.31  98.70
  12      1180  0.23  98.93
  13       936  0.18  99.11
  14       691  0.13  99.24
  15       497  0.10  99.33
  16       238  0.05  99.38
  17        79  0.02  99.40
  18        57  0.01  99.41
  19        37  0.01  99.41
  20        19  0.00  99.42
  21        19  0.00  99.42
  23        12  0.00  99.42
  127      2963  0.57  100.00
  _num_filled aver_size k.v size_kv = 1677722, 1.45, x.i 16
  collision, poisson, cache_miss hit_find|hit_miss, load_factor = 31.15%,31.17%,19.81% 1.40|1.64, 0.80
ghost commented 1 year ago

Thanks a lot, I will try to find a better way to avoid the vectors. The collisions right now are unpredictable unfortunately..