ktprime / emhash

Fast and memory efficient c++ flat hash map/set
MIT License
435 stars 30 forks source link

I doubled string hash map speed #7

Closed wangyi-fudan closed 3 years ago

wangyi-fudan commented 3 years ago

Hi, I found a simple trade off to double string hash map speed. The idea is to store hash(key) as surrogate key and drop the real key. Because the basic function of hashmap is to insert and query, we may not need the real key. And if the number of items is not larger than 1e9, we are sure there will be no collision in surrogate key. Just to discuss it here.

template<class  Key,    class   Value,  class   Hasher>
class   wymap{
private:
    uint64_t    mod;
public:
    std::vector<std::pair<uint64_t,Value>   >   data;
    void    resize(size_t   Bits){  mod=(1ull<<Bits)-1; std::vector<std::pair<uint64_t,Value>   >().swap(data); data.resize(mod+1); }
    inline  Value&  operator[](const    Key &k){
        Hasher  hasher; uint64_t    h=hasher(k),    c=h&mod;
        for(;;c=(c+1)&mod)  if(data[c].first==h||!data[c].first){   data[c].first=h;    return  data[c].second; }

    }
};
ktprime commented 3 years ago

And if the number of items is not larger than 1e9, we are sure there will be no collision in surrogate key, I don't know what's your mean. I have do a calculate hash collosion ration

under random hashCodes, the frequency of nodes in bins follows a Poisson distribution(http://en.wikipedia.org/wiki/Poisson_distribution) with a parameter of about 0.5 on average for the default resizing threshold of 0.75, although with a large variance because of resizing granularity. Ignoring variance, the expected occurrences of list size k are (exp(-0.5) * pow(0.5, k)/factorial(k)). The first values are: 0: 0.60653066 1: 0.30326533 2: 0.07581633 3: 0.01263606 4: 0.00157952 5: 0.00015795 6: 0.00001316 7: 0.00000094 8: 0.00000006

the follow is calculated by my hash map function(dump_statis) with load_factor = 1.0 , the result is same as Poisson distribution.

  ============== buckets size ration ========
    1   1543981  0.36884964|0.36787944  36.885
    2    768655  0.36725597|0.36787944  73.611
    3    256236  0.18364065|0.18393972  91.975
    4     64126  0.06127757|0.06131324  98.102
    5     12907  0.01541710|0.01532831  99.644
    6      2050  0.00293841|0.00306566  99.938
    7       310  0.00051840|0.00051094  99.990
    8        49  0.00009365|0.00007299  99.999
    9         4  0.00000860|0.00000913  100.000
========== collision miss ration ===========

_num_filled aver_size k.v size_kv = 4185936, 1.58, x.x 24 collision,possion,cache_miss hit_find|hit_miss, load_factor = 36.73%,36.74%,31.31% 1.50|2.00, 1.00 ============== buckets size ration ========