richardstartin / richardstartin.github.io

12 stars 4 forks source link

posts/reservoir-sampling #26

Open utterances-bot opened 3 years ago

utterances-bot commented 3 years ago

Reservoir Sampling | Richard Startin’s Blog

In my last post I covered a technique to infer distribution parameters from a sample taken from a system with the aim of calibrating a simulation. This post is about how to take samples, using reservoir sampling algorithms.

https://richardstartin.github.io/posts/reservoir-sampling

neillrobson commented 3 years ago

Hi! Thank you for this blog post. It helped me greatly in building an intuition for various reservoir sampling algorithms, as part of an algorithms course I'm taking currently.

If I may, I think that there are a few typos in the "Skipping Records" section, and was wondering if you could correct my understanding if I'm mistaken:

  1. You define n and N as follows:

    Vitter starts by defining the random variable S(n,N) as the number of observations to skip, where n is the number of records selected so far, and N is the number of records remaining in the file.

    Based on the subsequent equations, shouldn't n be the number of records remaining to be selected? So, n begins as equal to k and falls to zero over the course of each algorithm?

  2. The random variable S is bounded as such:

    We just need to be sure we define S(n,N) ∈ [0,N−n) such that we do get k samples by the time the input has been scanned, without introducing bias.

    Couldn't that upper bound be closed (i.e. S(n,N) ∈ [0,N−n]) since, if s = N-n, then the last n records are guaranteed to be chosen? Or is this edge case the "bias" that is mentioned in that sentence?

Thank you again for the post.

richardstartin commented 3 years ago

Hi @neillrobson - you are quite right on both points. I would expect to find more mistakes in the derivations because these were just my rough notes from reading the papers mentioned. I'm glad you found some use in the post though. I will correct the post now.

renaatd commented 2 years ago

Great article, learned a lot from it ! One thing that surprises me: it looks like element k+1 is always copied in the reservoir. E.g. for algorithm R, the statement "long replacementIndex = ThreadLocalRandom.current().nextLong(0, counter);" generates a random number from 0..counter-1. So when element k+1 is added, counter has value reservoir.length, and the resulting random number is 0..reservoir.length-1. Is this a small bias error, or am I overlooking sth? The other algorithms seem to have similar behaviour.