Closed wangyi-fudan closed 3 years ago
And if the number of items is not larger than 1e9, we are sure there will be no collision in surrogate key, I don't know what's your mean. I have do a calculate hash collosion ration
under random hashCodes, the frequency of nodes in bins follows a Poisson distribution(http://en.wikipedia.org/wiki/Poisson_distribution) with a parameter of about 0.5 on average for the default resizing threshold of 0.75, although with a large variance because of resizing granularity. Ignoring variance, the expected occurrences of list size k are (exp(-0.5) * pow(0.5, k)/factorial(k)). The first values are: 0: 0.60653066 1: 0.30326533 2: 0.07581633 3: 0.01263606 4: 0.00157952 5: 0.00015795 6: 0.00001316 7: 0.00000094 8: 0.00000006
the follow is calculated by my hash map function(dump_statis) with load_factor = 1.0 , the result is same as Poisson distribution.
============== buckets size ration ========
1 1543981 0.36884964|0.36787944 36.885
2 768655 0.36725597|0.36787944 73.611
3 256236 0.18364065|0.18393972 91.975
4 64126 0.06127757|0.06131324 98.102
5 12907 0.01541710|0.01532831 99.644
6 2050 0.00293841|0.00306566 99.938
7 310 0.00051840|0.00051094 99.990
8 49 0.00009365|0.00007299 99.999
9 4 0.00000860|0.00000913 100.000
========== collision miss ration ===========
_num_filled aver_size k.v size_kv = 4185936, 1.58, x.x 24 collision,possion,cache_miss hit_find|hit_miss, load_factor = 36.73%,36.74%,31.31% 1.50|2.00, 1.00 ============== buckets size ration ========
Hi, I found a simple trade off to double string hash map speed. The idea is to store hash(key) as surrogate key and drop the real key. Because the basic function of hashmap is to insert and query, we may not need the real key. And if the number of items is not larger than 1e9, we are sure there will be no collision in surrogate key. Just to discuss it here.