martinus / robin-hood-hashing

Fast & memory efficient hashtable based on robin hood hashing for C++11/14/17/20
https://gitter.im/martinus/robin-hood-hashing
MIT License
1.5k stars 142 forks source link

Potentially critical bug in Robin Hood invariant checking #164

Closed thecppzoo closed 1 year ago

thecppzoo commented 1 year ago

Hello, my friend @scottbruceheart and I studied this code:

       do {
            // unrolling this twice gives a bit of a speedup. More unrolling did not help.
            if (info == mInfo[idx] &&
                ROBIN_HOOD_LIKELY(WKeyEqual::operator()(key, mKeyVals[idx].getFirst()))) {
                return idx;
            }
            next(&info, &idx);
            if (info == mInfo[idx] &&
                ROBIN_HOOD_LIKELY(WKeyEqual::operator()(key, mKeyVals[idx].getFirst()))) {
                return idx;
            }
            next(&info, &idx);
        } while (info <= mInfo[idx]);

from https://github.com/martinus/robin-hood-hashing/blob/master/src/include/robin_hood.h#L1424-L1436 Because both info, from the "needle" under search, and mInfo[idx] from the haystack being searched are made of both hoisted hash bits and probe-sequence-length bits, the termination of the Robin Hood search when the needle has a "poorer" PSL than the haystack is broken according to the conventional meaning of the Robin Hood invariant: For an equal PSL, but the hoisted hash of the needle being greater than the hoisted hash in the haystack, the loop will exit without having detected the real termination of a haystack with a PSL strictly smaller than the needle. If this is a bug, it would be difficult to build a regression test that meets all the criteria; additionally, if the rest of the code is consistent with this modified interpretation of the Robin Hood invariant, then it would not be a bug.

martinus commented 1 year ago

Hi! I'm not 100% sure where you think there is a bug. In my robin_hood map implementation the entries with equal PSL are sorted according to the hoisted hash bits, so it is ok to stop the loop before actually reaching the next PSL.

That's one of the tricks of robin_hood that I think are quite nice, because when PSL are used as the higher bits and the hoised hash bits used for the lower bits it is enough use info <= mInfo[idx] as a stopping criteria

martinus commented 1 year ago

By the way I'm using exactly the same trick in my unordered_dense map: https://github.com/martinus/unordered_dense I think that code might be easier to read. There I am always using 24bit as PSL and 8 bit for the hoisted hash, both stored in dist_and_fingerprint: https://github.com/martinus/unordered_dense/blob/main/include/ankerl/unordered_dense.h#L317-L323

thecppzoo commented 1 year ago

Thanks for the explanation. I will close the issue. Is there a place in the source code where you state this property, of augmenting the Robin Hood invariant with the fingerprint of the key? I will study your unordered_dense

martinus commented 1 year ago

Hm I don't think I have written down anywhere. Some years ago I wrote a blog post, but this is outdated: https://martin.ankerl.com/2016/09/21/very-fast-hashmap-in-c-part-2/

One more note, the encoded PSL value has an offset of 1, so 0 means the bucket is empty, 1 means this bucket is taken and actually is at the correct location, 2 means its one bucket away from its intended location.