Open teryror opened 11 months ago
My apologies for not getting to this sooner. This is functionality it would be nice to see included somehow.
We have the rand::seq::index
module which is vaguely analogous (functionality related to implementing the methods on sequence traits). I could see rand::seq::reservoir
living alongside this, although...
... given the amount of existing and proposed behaviour surrounding weighted sampling (WeightedIndex
in rand
, WeightedAliasIndex
in rand_distr
, IndexedRandom::choose_weighted
and choose_multiple_weighted
; this issue, #1053, and #1459), it may be time for a new rand_weighted
crate or some such. Or maybe this is the wrong partitioning. @newpavlov? @vks?
We already have (I think) the A-res algorithm implemented (rand::seq::index::sample_efraimidis_spirakis
). This could perhaps be made more generic to use arbitrary elements instead of an index.
Summary
Recently, while doing research for a blog post, I implemented two different schemes for weighted reservoir sampling: the A-Res algorithm by Efraimidis and Spirakis, and VarOptK by Cohen et al..
I'd like to see both of these upstreamed here, as they use different interpretations of weights: A-Res is equivalent to (though much more efficient than) collecting the data elements and their weights into a discrete distribution, then sampling $n$ elements from that distribution, removing the selected element from it each time. VarOptK doesn't have an obvious sequential analog, but it has the very nice property that the probability of an element being included in the sample will be proportional to that element's relative weight (
weight_i / total_weights
), if possible.At the very least, I'd like to add two new methods to
IteratorRandom
:choose_multiple_weighted
andchoose_multiple_weighted_proportionally
. The names are obviously subject to bikeshedding, I'm not entirely happy with them either.I also propose exposing an advanced API that these methods would simply wrap. While I personally don't have a use case for this, a major reason to want a reservoir sampling algorithm is that they can take in the sampled population one element at a time, and at any point extract a sample of the part of the population seen so far. Essentially, I want to decouple the implementation of these algorithms from the
Iterator
trait, and allow users to do this, which would add indirect support for sampling infinite iterators or async streams, for example. I also suggest giving the existingIteratorRandom::choose_multiple
the same treatment.Details
I'd add a module
rand::reservoir
containing three types corresponding to these three algorithms, each with essentially the same API:Again, I'm not sure what to name these types. The best descriptive names I can come up with would be along the lines of
Uniform
,Weighted
, andProportionallyWeighted
, but there's the unfortunate name clashing. Naming the structs after the corresponding algorithms or their inventors would be an option, but that's very opaque, I feel. Suggestions would be appreciated.Anyway, from this, we'd get:
Alternatives and Concluding Notes
I guess there's a case to be made for making a new trait
ReservoirSampler
(or similar) that all three structs would implement, and making thechoose_multiple
method generic over the reservoir type, but that seems over-engineered to me.I think it's also worth discussing whether to also extend the
SliceRandom
trait. If I'm reading the implementation ofSliceRandom::choose_multiple_weighted
right, it's equivalent toslice.iter().choose_multiple_weighted()
, so maybe it should also get an equivalent toslice.iter().choose_multiple_weighted_proportionally()
? That doesn't seem necessary to me either, but users may expect that kind of symmetry.I could probably bang out a pull request for all this relatively quickly - I already have working, if rudimentary implementations, after all - , I just wanted to gauge interest first, and get some opinions on these surface level questions.