Random sampling - Githubissues

martinus / robin-hood-hashing

Fast & memory efficient hashtable based on robin hood hashing for C++11/14/17/20

https://gitter.im/martinus/robin-hood-hashing

MIT License

1.5k stars 143 forks source link

Random sampling #38

Closed mikemccarty-vertex closed 5 years ago

mikemccarty-vertex commented 5 years ago

I have a branch of this repo that adds a sample() method which, given a PRNG, returns a random entry in the map. The value of this method is that it enables probabilistic cache eviction without requiring a separate random-access dictionary of the entries.

Is this a feature you would be interested in adding to the class?

martinus commented 5 years ago

Hi! It's an interesting use case. I usually try to be standard conform to std::unordered_map when possible, but I think this should be a simple extension that might be worth it. Can I have a look at the code?

mikemccarty-vertex commented 5 years ago

I'm having a little trouble with github authentication at the moment so I'll just paste the code here. I think it is pretty self-explanatory:

    template<typename PRNG>
    size_t sampleIdx(PRNG& prng) const {
        size_t idx;
        InfoType info;

        // Pre-condition:  !empty()
        do {
            idx = prng();
            info = mInfoInc + static_cast<InfoType>(idx >> mInfoHashShift);
            idx &= mMask;

            while (mInfo[idx]==0) {
                next(&info, &idx);
            }

            if (mInfo[idx] != 0) {
                return idx;
            }
        } while ( true );
    }

    template<typename PRNG>
    const_iterator sample(PRNG& prng) const {
        ROBIN_HOOD_TRACE(this);
        if (empty()) {
            return cend();
        }
        const size_t idx = sampleIdx(prng);
        return const_iterator{mKeyVals + idx, mInfo + idx};
    }

The do ... while(true) is leftover from the initial (broken) implementation. It can probably be removed...

martinus commented 5 years ago

I'm afraid your sampling is not uniform. E.g. a hashmap's content could look like this (x means no element there)

x x x x x x 1 2

if you sample that way the number 1 would be sampled with probability 7/8=0.875, and number 2 only 1/8=0,125.

I think a correct sampling could would be this:

    template <typename R>
    iterator uniform_sample(R& rng) {
        if (empty()) {
            return end();
        }

        // to guarantee uniform random sampling, we have to randomly try spots until we find
        // an element that is not empty. It's slow though for maps with low fill factor.
        size_t idx;
        do {
            idx = (std::uniform_int_distribution<size_t>{}(rng)) & mMask;
        } while (0 == mInfo[idx]);
        return iterator{mKeyVals + idx, mInfo + idx};
    }

mikemccarty-vertex commented 5 years ago

That's a good point -- there's a tradeoff between speed (minimal calls to the RNG) and distribution. One of my implementations had similar behavior to yours in that it sometimes made a lot of RNG calls in order to score a hit. I see now I may have over-compensated in the other direction.

I wonder if there's a reasonable compromise to be made here...

martinus commented 5 years ago

I think a better solution would be to have a storage vector separately from the unordered map, like this:

    std::vector<std::pair<size_t, std::string>> keyAndData;
    robin_hood::unordered_map<size_t, size_t> keyToDataidx;

So the keyToDataidx map points from a key to an index of the keyAndData vector. That way accessing data is O(1), e.g. value = keyAndData[keyToDataidx[key]].

Removing an element can also be done in O(1): find the element's index with idx = keyToDataidx[key], then replace keyAndData[idx] with keyAndData.back(), then update keyToDataIdx[keyAndData[idx].first] = idx.

Random sampling can then be easily done with the elements in keyToDataidx.