Consider further optimisation on `mod_function`.

Supposing your hash values are uniformly distributed "random" values in the range of 0..2**64-1, you can skip the modulo and instead use random-number scaling techniques to distribute the result (the bucket number) evenly throughout that space. I don't think there's any reason to stick with modulo, is there?

The quickest-and-dirtiest of these would be:

size_t bucket(uint64_t hash) {
    return uint64_t(uint128_t(hash) * K >> 64);
}

But if the incoming hashes are not already uniform then try this one weird trick:

size_t bucket(uint64_t hash) {
     crc32(hash) * K >> 32;
}

(which only works when K is less than 2**32), and you'll have to find the right intrinsic for the target to get the hardware-accelerated CRC instruction)

This one is more thorough:

static inline uint64_t murmurmix64(uint64_t h) {
    h ^= h >> 33;
    h *= 0xff51afd7ed558ccdULL;
    h ^= h >> 33;
    h *= 0xc4ceb9fe1a85ec53ULL;
    h ^= h >> 33;

    return h;
}

size_t bucket(uint64_t hash) {
    hash = murmurmix64(hash);
    return uint64_t(uint128_t(hash) * K >> 64);
}

And this one is probably fine, covers up to 64-bit ranges, and is probably still faster than mod by a constant:

size_t bucket(uint64_t hash) {
    hash *= 0xc4ceb9fe1a85ec53ULL;
    return uint64_t(uint128_t(hash) * K >> 64);
}

Plus it probably makes no difference if K is not constant, and so you can avoid calling via a function pointer and just load the table size into a register and perform the multiply by that inline.

Actually, let's take this a step further:

size_t bucket(uint64_t& hash, size_t nbuckets) {
    uint128_t longhash = hash;
    longhash = hash * nbuckets;
    hash = uint64_t(longhash);
    return size_t(longhash >> 64);
}

This returns a bucket number for the current table, and if things go wrong and you feel like trying a different hash -- even if it's a different length -- just call it again! It has also updated the input hash to give the unused entropy from the original hash properly distributed across the next range.

Constraints are:

original input hashes must be uniformly distributed across all 2**64 possible values
Product of nbuckets values for all calls on the same hash must be less than 2**64.
if all values of nbuckets are odd then the function will continue to hallucinate viable bucket choices even after it runs out of entropy (after the product exceeds 2**64)

If you don't trust the caller to provide uniformly distributed hashes (eg., where STL will just pass raw ints, or whatever) then you can precondition it with one of the mixing operations in the original post here.

Here, I tried to explain it all: https://www.tīkōuka.dev/hash-map-reduction-operations/

skarupke / flat_hash_map

Consider further optimisation on `mod_function`. #50