Closed ghost closed 1 year ago
emhash use std::hash as default hash which is not a good hash function if key is integer in gcc. you can try use the other hash function.
but emhash also offers some good hash function by set compile marco EMH_INT_HASH = 1-6 if key is intege
#if EMH_INT_HASH
static constexpr uint64_t KC = UINT64_C(11400714819323198485);
inline static uint64_t hash64(uint64_t key)
{
#if __SIZEOF_INT128__ && EMH_INT_HASH == 1
__uint128_t r = key; r *= KC;
return (uint64_t)(r >> 64) + (uint64_t)r;
#elif EMH_INT_HASH == 2
//MurmurHash3Mixer
uint64_t h = key;
h ^= h >> 33;
h *= 0xff51afd7ed558ccd;
h ^= h >> 33;
h *= 0xc4ceb9fe1a85ec53;
h ^= h >> 33;
return h;
#elif _WIN64 && EMH_INT_HASH == 1
uint64_t high;
return _umul128(key, KC, &high) + high;
#elif EMH_INT_HASH == 3
auto ror = (key >> 32) | (key << 32);
auto low = key * 0xA24BAED4963EE407ull;
auto high = ror * 0x9FB21C651E98DF25ull;
auto mix = low + high;
return mix;
#elif EMH_INT_HASH == 1
uint64_t r = key * UINT64_C(0xca4bcaa75ec3f625);
return (r >> 32) + r;
#elif EMH_WYHASH64
return wyhash64(key, KC);
#else
uint64_t x = key;
x = (x ^ (x >> 30)) * UINT64_C(0xbf58476d1ce4e5b9);
x = (x ^ (x >> 27)) * UINT64_C(0x94d049bb133111eb);
x = x ^ (x >> 31);
return x;
#endif
}
#endif
template<typename UType, typename std::enable_if<std::is_integral<UType>::value, size_type>::type = 0>
inline size_type hash_key(const UType key) const
{
#if EMH_INT_HASH
return hash64(key);
#elif EMH_SAFE_HASH
return _hash_inter == 0 ? _hasher(key) : hash64(key);
#elif EMH_IDENTITY_HASH
return key + (key >> 24);
#else
return (size_type)_hasher(key);
#endif
}
most hashmap don't supports multiple same keys like std::unordered_multimap. you can try change emhash's code to implement your ideal, it's not a hard work. I have no any good ideal for this issiue beacuse of no any collisions information.
by set EMH_STATIS=10000 in emhash7, it'll dump detail collision for dubug if rehash happens.
/****************
under random hashCodes, the frequency of nodes in bins follows a Poisson
distribution(http://en.wikipedia.org/wiki/Poisson_distribution) with a parameter of about 0.5
on average for the default resizing threshold of 0.75, although with a large variance because
of resizing granularity. Ignoring variance, the expected occurrences of list size k are
(exp(-0.5) * pow(0.5, k)/factorial(k)). The first values are:
0: 0.60653066
1: 0.30326533
2: 0.07581633
3: 0.01263606
4: 0.00157952
5: 0.00015795
6: 0.00001316
7: 0.00000094
8: 0.00000006
============== buckets size ration ========
1 1543981 0.36884964|0.36787944 36.885
2 768655 0.36725597|0.36787944 73.611
3 256236 0.18364065|0.18393972 91.975
4 64126 0.06127757|0.06131324 98.102
5 12907 0.01541710|0.01532831 99.644
6 2050 0.00293841|0.00306566 99.938
7 310 0.00051840|0.00051094 99.990
8 49 0.00009365|0.00007299 99.999
9 4 0.00000860|0.00000913 100.000
========== collision miss ration ===========
_num_filled aver_size k.v size_kv = 4185936, 1.58, x.x 24
collision,possion,cache_miss hit_find|hit_miss, load_factor = 36.73%,36.74%,31.31% 1.50|2.00, 1.00
============== buckets size ration ========
*******************************************************/
emhash7 dump collision
============== buckets size ration ========
1 753769 0.44928123|-0.01% 44.928
2 302722 0.36087266|0.39% 81.015
3 79659 0.14244136|-0.93% 95.260
4 16102 0.03839015|0.12% 99.099
5 2567 0.00765025|-0.24% 99.864
6 341 0.00121951|-0.61% 99.986
7 29 0.00012100|-26.04% 99.998
8 5 0.00002384|27.52% 100.000
========== collision miss ration ===========
0 190118 36.38 36.38
1 128187 24.53 60.92
2 99678 19.08 79.99
3 40340 7.72 87.71
4 20921 4.00 91.72
5 12613 2.41 94.13
6 8049 1.54 95.67
7 5508 1.05 96.72
8 3930 0.75 97.48
9 2808 0.54 98.01
10 1999 0.38 98.40
11 1596 0.31 98.70
12 1180 0.23 98.93
13 936 0.18 99.11
14 691 0.13 99.24
15 497 0.10 99.33
16 238 0.05 99.38
17 79 0.02 99.40
18 57 0.01 99.41
19 37 0.01 99.41
20 19 0.00 99.42
21 19 0.00 99.42
23 12 0.00 99.42
127 2963 0.57 100.00
_num_filled aver_size k.v size_kv = 1677722, 1.45, x.i 16
collision, poisson, cache_miss hit_find|hit_miss, load_factor = 31.15%,31.17%,19.81% 1.40|1.64, 0.80
Thanks a lot, I will try to find a better way to avoid the vectors. The collisions right now are unpredictable unfortunately..
Hi, I currently define a hash which may have multiple same values for the same u64 key. I use emhash6 but will benchmark the others as well.
Currently using a vector as the value but it means that we have another indirection.
The issue is that I am not sure about the collisions but I need yo handle them anyway. The alternative is not to cache the value but it may become annoying.
I can't use an array because I have no clue of all the collisions.
Is using a vector the only way to solve this? Thanks