Closed mikemccarty-vertex closed 5 years ago
Hi! It's an interesting use case. I usually try to be standard conform to std::unordered_map
when possible, but I think this should be a simple extension that might be worth it. Can I have a look at the code?
I'm having a little trouble with github authentication at the moment so I'll just paste the code here. I think it is pretty self-explanatory:
template<typename PRNG>
size_t sampleIdx(PRNG& prng) const {
size_t idx;
InfoType info;
// Pre-condition: !empty()
do {
idx = prng();
info = mInfoInc + static_cast<InfoType>(idx >> mInfoHashShift);
idx &= mMask;
while (mInfo[idx]==0) {
next(&info, &idx);
}
if (mInfo[idx] != 0) {
return idx;
}
} while ( true );
}
template<typename PRNG>
const_iterator sample(PRNG& prng) const {
ROBIN_HOOD_TRACE(this);
if (empty()) {
return cend();
}
const size_t idx = sampleIdx(prng);
return const_iterator{mKeyVals + idx, mInfo + idx};
}
The do ... while(true)
is leftover from the initial (broken) implementation. It can probably be removed...
I'm afraid your sampling is not uniform. E.g. a hashmap's content could look like this (x
means no element there)
x x x x x x 1 2
if you sample that way the number 1 would be sampled with probability 7/8=0.875, and number 2 only 1/8=0,125.
I think a correct sampling could would be this:
template <typename R>
iterator uniform_sample(R& rng) {
if (empty()) {
return end();
}
// to guarantee uniform random sampling, we have to randomly try spots until we find
// an element that is not empty. It's slow though for maps with low fill factor.
size_t idx;
do {
idx = (std::uniform_int_distribution<size_t>{}(rng)) & mMask;
} while (0 == mInfo[idx]);
return iterator{mKeyVals + idx, mInfo + idx};
}
That's a good point -- there's a tradeoff between speed (minimal calls to the RNG) and distribution. One of my implementations had similar behavior to yours in that it sometimes made a lot of RNG calls in order to score a hit. I see now I may have over-compensated in the other direction.
I wonder if there's a reasonable compromise to be made here...
I think a better solution would be to have a storage vector separately from the unordered map, like this:
std::vector<std::pair<size_t, std::string>> keyAndData;
robin_hood::unordered_map<size_t, size_t> keyToDataidx;
So the keyToDataidx
map points from a key to an index of the keyAndData
vector. That way accessing data is O(1), e.g. value = keyAndData[keyToDataidx[key]]
.
Removing an element can also be done in O(1): find the element's index with idx = keyToDataidx[key]
, then replace keyAndData[idx]
with keyAndData.back()
, then update keyToDataIdx[keyAndData[idx].first] = idx
.
Random sampling can then be easily done with the elements in keyToDataidx
.
I have a branch of this repo that adds a
sample()
method which, given a PRNG, returns a random entry in the map. The value of this method is that it enables probabilistic cache eviction without requiring a separate random-access dictionary of the entries.Is this a feature you would be interested in adding to the class?