rust-random / rand

A Rust library for random number generation.
https://crates.io/crates/rand
Other
1.67k stars 431 forks source link

More options for (weighted) reservoir sampling #1365

Open teryror opened 11 months ago

teryror commented 11 months ago

Summary

Recently, while doing research for a blog post, I implemented two different schemes for weighted reservoir sampling: the A-Res algorithm by Efraimidis and Spirakis, and VarOptK by Cohen et al..

I'd like to see both of these upstreamed here, as they use different interpretations of weights: A-Res is equivalent to (though much more efficient than) collecting the data elements and their weights into a discrete distribution, then sampling $n$ elements from that distribution, removing the selected element from it each time. VarOptK doesn't have an obvious sequential analog, but it has the very nice property that the probability of an element being included in the sample will be proportional to that element's relative weight (weight_i / total_weights), if possible.

At the very least, I'd like to add two new methods to IteratorRandom: choose_multiple_weighted and choose_multiple_weighted_proportionally. The names are obviously subject to bikeshedding, I'm not entirely happy with them either.

I also propose exposing an advanced API that these methods would simply wrap. While I personally don't have a use case for this, a major reason to want a reservoir sampling algorithm is that they can take in the sampled population one element at a time, and at any point extract a sample of the part of the population seen so far. Essentially, I want to decouple the implementation of these algorithms from the Iterator trait, and allow users to do this, which would add indirect support for sampling infinite iterators or async streams, for example. I also suggest giving the existing IteratorRandom::choose_multiple the same treatment.

Details

I'd add a module rand::reservoir containing three types corresponding to these three algorithms, each with essentially the same API:

struct Reservoir<T> { /* ... */ }
impl<T> Reservoir<T> {
    fn with_sample_size(n: usize) -> Self { /* ... */ }
    fn feed<R: Rng + ?Sized>(&mut self, rng: &mut R, item: T, weight: f64) -> ??? {
        // Should this simply return `bool` to indicate whether this item was selected?
        // Or a more complex type, like `Result<Option<T>, T>`, to return `Err(item)` if it wasn't selected,
        // and `Ok` with the potentially replaced item from the reservoir if it was?
        // Do we need a return value at all?
    }
    fn current_sample(&self) -> impl Iterator<Item = &T> { /* ... */ }

    // Maybe also something like this:
    fn feed_to_next_replacement<I, F>(&mut self, rng: &mut R, items: &mut I, weights: F) -> Option<T>
    where
        R: Rng + ?Sized,
        I: Iterator<Item = T>,
        F: FnMut(&T) -> f64,
    { /* return `None` if `items` is exhausted before the sample is updated, `Some(replaced_item)` otherwise */ }

    // Other convenience functions?
}

impl<T> IntoIterator for &Reservoir<T> { /* equivalent to `current_sample()` */ }
impl<T> IntoIterator for Reservoir<T> { /* consume the Reservoir to get an iterator of owned items */ }
impl<T> Into<Vec<T>> for Reservoir<T> { /* equivalent to, but potentially more efficient than, `into_iter().collect()` */ }

// Other traits to implement?

Again, I'm not sure what to name these types. The best descriptive names I can come up with would be along the lines of Uniform, Weighted, and ProportionallyWeighted, but there's the unfortunate name clashing. Naming the structs after the corresponding algorithms or their inventors would be an option, but that's very opaque, I feel. Suggestions would be appreciated.

Anyway, from this, we'd get:

pub trait IteratorRandom: Iterator + Sized {
    /* other methods omitted */

    fn choose_multiple<R>(mut self, rng: &mut R, amount: usize) -> Vec<Self::Item>
    where
        R: Rng + ?Sized,
    {
        let mut reservoir = crate::reservoir::Uniform::with_sample_size(amount);
        for item in self {
            reservoir.feed(rng, item);
        }
        reservoir.into()
    }

    // and analogously:
    fn choose_multiple_weighted<R, F>(mut self, rng: &mut R, amount: usize, weight: F) -> Vec<Self::Item> where ...
    fn choose_multiple_weighted_proportionally<R, F>(/* ... */) -> Vec<Self::item> where ...
}

Alternatives and Concluding Notes

I guess there's a case to be made for making a new trait ReservoirSampler (or similar) that all three structs would implement, and making the choose_multiple method generic over the reservoir type, but that seems over-engineered to me.

I think it's also worth discussing whether to also extend the SliceRandom trait. If I'm reading the implementation of SliceRandom::choose_multiple_weighted right, it's equivalent to slice.iter().choose_multiple_weighted(), so maybe it should also get an equivalent to slice.iter().choose_multiple_weighted_proportionally()? That doesn't seem necessary to me either, but users may expect that kind of symmetry.

I could probably bang out a pull request for all this relatively quickly - I already have working, if rudimentary implementations, after all - , I just wanted to gauge interest first, and get some opinions on these surface level questions.

dhardy commented 4 months ago

My apologies for not getting to this sooner. This is functionality it would be nice to see included somehow.

API organisation

We have the rand::seq::index module which is vaguely analogous (functionality related to implementing the methods on sequence traits). I could see rand::seq::reservoir living alongside this, although...

... given the amount of existing and proposed behaviour surrounding weighted sampling (WeightedIndex in rand, WeightedAliasIndex in rand_distr, IndexedRandom::choose_weighted and choose_multiple_weighted; this issue, #1053, and #1459), it may be time for a new rand_weighted crate or some such. Or maybe this is the wrong partitioning. @newpavlov? @vks?

Algorithms

We already have (I think) the A-res algorithm implemented (rand::seq::index::sample_efraimidis_spirakis). This could perhaps be made more generic to use arbitrary elements instead of an index.