Add a function to get a random sample from an iterable (reservoir sampling)

python / cpython

The Python programming language

https://www.python.org

Other

63.38k stars 30.35k forks source link

Add a function to get a random sample from an iterable (reservoir sampling) #85483

Closed a35477f8-71c1-46d8-a00e-b2f89ce712bf closed 4 years ago

a35477f8-71c1-46d8-a00e-b2f89ce712bf commented 4 years ago

BPO	41311
Nosy	@tim-one, @rhettinger, @mdickinson

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = 'https://github.com/rhettinger' closed_at = created_at = labels = ['library', '3.10'] title = 'Add a function to get a random sample from an iterable (reservoir sampling)' updated_at = user = 'https://bugs.python.org/oscarbenjamin' ``` bugs.python.org fields: ```python activity = actor = 'oscarbenjamin' assignee = 'rhettinger' closed = True closed_date = closer = 'rhettinger' components = ['Library (Lib)'] creation = creator = 'oscarbenjamin' dependencies = [] files = [] hgrepos = [] issue_num = 41311 keywords = [] message_count = 17.0 messages = ['373733', '373742', '373771', '373781', '373794', '373828', '373853', '373857', '373861', '373864', '373865', '373870', '373894', '373928', '373930', '373943', '396870'] nosy_count = 4.0 nosy_names = ['tim.peters', 'rhettinger', 'mark.dickinson', 'oscarbenjamin'] pr_nums = [] priority = 'normal' resolution = 'rejected' stage = 'resolved' status = 'closed' superseder = None type = None url = 'https://bugs.python.org/issue41311' versions = ['Python 3.10'] ```

a35477f8-71c1-46d8-a00e-b2f89ce712bf commented 4 years ago

The random.choice/random.sample functions will only accept a sequence to select from. Can there be a function in the random module for selecting from an arbitrary iterable?

It is possible to make an efficient function that can make random selections from an arbitrary iterable e.g.:

from math import exp, log, floor
from random import random, randrange
from itertools import islice

def sample_iter(iterable, k=1):
    """Select k items uniformly from iterable.

Returns the whole population if there are k or fewer items
"""
iterator = iter(iterable)
values = list(islice(iterator, k))

    W = exp(log(random())/k)
    while True:
        # skip is geometrically distributed
        skip = floor( log(random())/log(1-W) )
        selection = list(islice(iterator, skip, skip+1))
        if selection:
            values[randrange(k)] = selection[0]
            W *= exp(log(random())/k)
        else:
            return values

https://en.wikipedia.org/wiki/Reservoir_sampling#An_optimal_algorithm

This could be used for random sampling from sets/dicts or also to choose something like a random line from a text file. The algorithm needs to fully consume the iterable but does so efficiently using islice. In the case of a dict this is faster than converting to a list and using random.choice:

In [2]: n = 6

In [3]: d = dict(zip(range(10**n), range(10**n)))

In [4]: %timeit sample_iter(d) 15.5 ms ± 326 µs per loop (mean ± std. dev. of 7 runs, 100 loops each

In [5]: %timeit list(d) 26.1 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [6]: %timeit sample_iter(d, 2) 15.8 ms ± 427 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: %timeit sample_iter(d, 20) 17.6 ms ± 2.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: %timeit sample_iter(d, 100) 19.9 ms ± 297 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

This was already discussed on python-ideas: https://mail.python.org/archives/list/python-ideas@python.org/thread/4OZTRD7FLXXZ6R6RU4BME6DYR3AXHOBD/

rhettinger commented 4 years ago

Thanks for the suggestion. I'll give it some thought over the next few days.

Here are a few initial thoughts:

The use of islice() really helps going through a small population quickly.
The current sample() uses _randbelow() instead of random() to guarantee equidistribution and to eliminate the small biases that come from using random(). So this is a step backwards.
The current sample() guarantees that slices of the sample are all in randomized order. The proposed algorithm doesn't:

    # Fully shuffled
    >>> sample(range(20), k=19)
    >>> [3, 19, 11, 16, 5, 12, 10, 7, 14, 0, 6, 18, 8, 9, 4, 13, 15, 2, 1]

    # Not random at all 
    >>> sample_iter(range(20), k=19)
    >>> [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 19, 11, 12, 13, 14, 15, 16, 17, 18]

The current sample runs in speed proportional to *k rather than *n. This means that the proposed algorithm slows down considerably for large populations:

In [12]: %timeit sample(range(100_000_000), k=20) 15.7 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [13]: %timeit sample_iter(range(100_000_000), k=20) 1.59 s ± 9.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
For an apples-to-apples comparison with choices(), use the version in Python 3.9 which was updated to use floor() which is several times faster than int().
Instead of listing the islice object, consider using next() instead. That would likely be faster:

    def sample_iter(iterable, k=1):
        iterator = iter(iterable)
        values = list(islice(iterator, k))
        W = exp(log(random())/k)
        try:
            while True:
                # skip is geometrically distributed
                skip = floor( log(random())/log(1-W) )
                values[randrange(k)] = next(islice(iterator, skip, skip+1))
                W *= exp(log(random())/k)
        except StopIteration:
            return values

Using mock's call-count feature shows that the proposed algorithm uses more entropy that the current sample() code. It seems that random() and _randbelow() are both being called for every selection. I haven't yet investigated to what causes this, but it is unfavorable especially for slow generators like SystemRandom().

tim-one commented 4 years ago

Pro: focus on the "iterable" part of the title. If you want to, e.g., select 3 lines "at random" out of a multi-million-line text file, this kind of reservoir sampling allows to do that holding no more than one line in memory simultaneously. Materializing an iterable in a sequence is sometimes impossible. Yup, it takes O(number of elements) time, but so does anything else that has to look at every element of an iterable (including materializing an iterable in a sequence!).

So this would make most sense as a different function.

Con: over the years we've worked diligently to purge these algorithms of minor biases, leaving them as unbiased as the core generator. Introducing floating-point vagaries goes against that, and results may vary at times due to tiny differences in platform exp() and log() implementations.

Worse, I'm not sure that the _approach_ is beyond reproach: the claim that skips "follow a geometric distribution" is baffling to me. If the reservoir has size K, then when we're looking at the N'th element it should be skipped with probability 1-K/N. But when looking at the N+1'th, that changes to 1-K/(N+1). Then 1-K/(N+2), and so on. These probabilities are all different. In an actual geometric distribution, the skip probability is a constant.

Now 1-K/(N+i) is approximately equal to 1-K/(N+j) for i and j sufficiently close, and for K sufficiently smaller than N, so the skips may be well _approximated by a geometric distribution. But that's quite different than saying they _do follow a geometric distribution, and I see no reason to trust that the difference can't matter.

The simple algorithm on the Wikipedia page suffers none of these potential problems (it implements an obviously-sound approach, & our randrange() is unbiased and platform-independent). But, for selecting 3 out of a million-element iterable, would invoke the core generator hundreds of thousands of times more often.

a35477f8-71c1-46d8-a00e-b2f89ce712bf commented 4 years ago

To be clear I suggest that this could be a separate function from the existing sample rather than a replacement or a routine used internally.

The intended use-cases for the separate function are:

Select from something where you really do not want to build a list and just want/need to use a single pass. For example in selecting a random line from a text file it is necessary to read the entire file in any case just to know how many lines there are. The method here would mean that you could make the selection in a single pass in O(k) memory. The same can apply to many situations involving generators etc.
Convenient, possibly faster selection from a most-likely small dict/set (this was the original context from python-ideas).

The algorithm naturally gives something in between the original order or a randomised order. There are two possibilities for changing that:

a. Call random.shuffle or equivalent either to get the original k items in a random order or at the end before returning.

b. Preserve the original ordering from the iterable: append all new items and use a sentinel for removed items (remove sentinels at the end).

Since randomised order does not come naturally and requires explicit shuffling my preference would be to preserve the original order (b.) because there is no other way to recover the information lost by shuffling (assuming that only a single pass is possible). The user can call shuffle if they want.

To explain what "geometrically distributed" means I need to refer to the precursor algorithm from which this is derived. A simple Python version could be:

def sample_iter_old(iterable, k=1):
    uvals_vals = []
    # Associate a uniform (0, 1) with each element:
    for uval, val in zip(iter(random, None), iterable):
        uvals_vals.append((uval, val))
        uvals_vals.sort()
        uvals_vals = uvals_vals[:k]   # keep the k items with smallest uval
    return [val for uval, val in uvals_vals]

In sample_iter_old each element val of the iterable is associated with a uniform (0, 1) variate uval. At each step we keep the k elements having the smallest uval variates. This is relatively inefficient because we need to generate a uniform variate for each element val of the iterable. Most of the time during the algorithm the new val is simply discarded so sample_iter tries instead to calculate how many items to discard.

The quantity W in sample_iter is the max of the uvals from sample_iter_old:

W := max(uval, for uval, val in uvals_vals)

A new item from the iterable will be swapped in if its uval is less than W. The number of items skipped before finding a uval \< W is geometrically distributed with parameter W.

tim-one commented 4 years ago

Thanks! That explanation really helps explain where "geometric distribution" comes from. Although why it keeps taking k'th roots remains a mystery to me ;-)

Speaking of which, the two instances of

exp(log(random())/k)

are numerically suspect. Better written as

random()**(1/k)

The underlying pow() implementation will, in effect, compute log(random()) with extra bits of precision for internal use. Doing log(random()) forces it to use a 53-bit approximation. Not to mention that it's more obvious to write a k'th root as a k'th root. Note: then the 1/k can be computed once outside the loop.

Perhaps worse is

log(1-W)

which should be written

log1p(-W)

instead. W is between 0 and 1, and the closer it is to 0 the more its trailing bits are entirely lost in computing 1-W. It's the purpose of log1p to combat this very problem.

a35477f8-71c1-46d8-a00e-b2f89ce712bf commented 4 years ago

All good points :)

Here's an implementation with those changes and that shuffles but gives the option to preserve order. It also handles the case W=1.0 which can happen at the first step with probability 1 - (1 - 2**53)**k.

Attempting to preserve order makes the storage requirements expected O(k*log(k)) rather than deterministic O(k) but note that the log(k) part just refers to the values list growing larger with references to None: only k of the items from iterable are stored at any time. This can be simplified by removing the option to preserve order which would also make it faster in the small-iterable case.

There are a few timings below for choosing from a dict vs converting to a list and using sample (I don't have a 3.9 build immediately available to use choices). Note that these benchmarks are not the primary motivation for sample_iter though which is the case where the underlying iterable is much more expensive in memory and/or time and where the length is not known ahead of time.

from math import exp, log, log1p, floor
from random import random, randrange, shuffle as _shuffle
from itertools import islice

def sample_iter(iterable, k=1, shuffle=True):
    """Choose a sample of k items from iterable

    shuffle=True (default) gives the items in random order
    shuffle=False preserves the original ordering of the items
    """
    iterator = iter(iterable)
    values = list(islice(iterator, k))

    irange = range(len(values))
    indices = dict(zip(irange, irange))

    kinv = 1 / k
    W = 1.0
    while True:
        W *= random() ** kinv
        # random() < 1.0 but random() ** kinv might not be
        # W == 1.0 implies "infinite" skips
        if W == 1.0:
            break
        # skip is geometrically distributed with parameter W
        skip = floor( log(random())/log1p(-W) )
        try:
            newval = next(islice(iterator, skip, skip+1))
        except StopIteration:
            break
        # Append new, replace old with dummy, and keep track of order
        remove_index = randrange(k)
        values[indices[remove_index]] = None
        indices[remove_index] = len(values)
        values.append(newval)

    values = [values[indices[i]] for i in irange]

    if shuffle:
        _shuffle(values)

    return values

Timings for a large dict (1,000,000 items):

In [8]: n = 6

In [9]: d = dict(zip(range(10**n), range(10**n)))

In [10]: %timeit sample_iter(d, 10)
16.1 ms ± 363 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [11]: %timeit sample(list(d), 10)
26.3 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Timings for a small dict (5 items):

In [14]: d2 = dict(zip(range(5), range(5)))

In [15]: %timeit sample_iter(d2, 2)
14.8 µs ± 539 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [16]: %timeit sample(list(d2), 2)
6.27 µs ± 457 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

The crossover point for this benchmark is around 10,000 items with k=2. Profiling at 10,000 items with k=2 shows that in either case the time is dominated by list/next so the time difference is just about how efficiently we can iterate vs build the list. For small dicts it is probably possible to get a significant factor speed up by removing the no shuffle option and simplifying the routine.

Although why it keeps taking k'th roots remains a mystery to me ;-)

Thinking of sample_iter_old, before doing a swap the uvals in our reservoir look like:

  U0 = {u[1], u[2], ... u[k-1], W0}
  W0 = max(V0)

Here u[1] ... u[k-1] are uniform in (0, W0). We find a new u[n] \< W0 which we swap in while removing W0 and afterwards we have

  U1 = {u[1], u[2], ... u[k-1], u[k]}
  W1 = max(U1)

Given that U1 is k iid uniform variates in (0, W0) we have that

  W1 = W0 * max(random() for _ in range(k)) = W0 * W'

Here W' has cdf x**k and so by the inverse sampling method we can generate it as random()**(1/k). That gives the update rule for sample_iter:

W *= random() ** (1/k)

rhettinger commented 4 years ago

Other implementations aren't directly comparable, but I thought I would check to see what others were doing:

Scikit-learn uses reservoir sampling but only when k / n > 0.99. Also, it requires a follow-on step to shuffle the selections.
numpy does not use reservoir sampling.
Julia's randsubseq() does not use reservoir sampling. The docs guarantee that, "Complexity is linear in p*length(A), so this function is efficient even if p is small and A is large."

tim-one commented 4 years ago

Julia's randsubseq() doesn't allow to specify the _size_ of the output desired. It picks each input element independently with probability p, and the output can be of any size from 0 through the input's size (with mean output length p*length(A)). Reservoir sampling is simply irrelevant to that, although they almost certainly use a form of skip-generation internally.

The quoted docs don't make much sense: for any given p, O(p*N) = O(N). I'm guessing they're trying to say that the mean of the number of times the RNG is called is p*N.

rhettinger commented 4 years ago

I've put more thought into the proposal and am going to recommend against it.

At its heart, this a CPython optimization to take advantage of list() being slower than a handful of islice() calls. It also gains a speed benefit by dropping the antibias logic because random() is faster than _randbelow(). IMO, this doesn't warrant an API extension. I'm not looking forward to years of teaching people that there are two separate APIs for sampling without replacement and that the second one is almost never what they should use.

A few years ago, GvR rejected adding a pre-sizing argument to dicts even though there were some cases where it gave improved performance. His rationale was that it was challenging for a user to know when they were better off and when they weren't. It added a new complication that easily led to suboptimal choices. IMO, this new API puts the users in a similar situation. There are a number of cases where a person is worse off, sometimes much worse off.

This new code runs O(n) instead of O(k). It eats more entropy. It loses the the antibias protections. The API makes it less explicit that the entire input iterable is consumed. It can only be beneficial is the input is not a sequence. When k gets bigger, the repeated calls to islice() become more expensive than a single call to list. And given the math library functions involved, I not even sure that this code can guarantee it gives the same results across platforms.

Even if the user makes a correct initial decision about which API to use, the decision can become invalidated when the population sizes or sample sizes change over time.

Lastly, giving users choices between two substantially similar tools typically makes them worse off. It creates a new burden to learn, remember, and distinguish the two. It's really nice that we currently have just one sample() and that it behaves well across a broad range of cases — you generally get a good result without having to think about it. Presumably, that was the wisdom behind having one-way-to-do-it.

rhettinger commented 4 years ago

More thoughts:

If sample_iter() were added, people would expect a choices_iter() as well.
Part of the reason that Set support was being dropped from sample() is that it was rarely used and that it was surprising that it was a O(n) operation instead of O(k).

a35477f8-71c1-46d8-a00e-b2f89ce712bf commented 4 years ago

At its heart, this a CPython optimization to take advantage of list() being slower than a handful of islice() calls.

This comment suggest that you have missed the general motivation for reservoir sampling. Of course the stdlib can not satisfy all use cases so this can be out of scope as a feature. It is not the case though that this is a CPython optimisation.

The idea of reservoir sampling is that you want to sample from an iterator, you only get one chance to iterate over it, and you don't know a priori how many items it will yield. The classic example of that situation is reading from a text file but in general it maps neatly onto Python's concept of iterators. The motivation for generators/iterators in Python is that there are many situations where it is better to avoid building a concrete in-memory data structure and it can be possible to avoid doing so with appropriately modified algorithms (such as this one).

The core use case for this feature is not sampling from an in-memory data structure but rather sampling from an expensive generator or an iterator over a file/database. The premise is that it is undesirable or perhaps impossible to build a list out of the items of the iterable. In those contexts the comparative properties of sample/choices are to some extent irrelevant because those APIs can not be used or should be avoided because of their overhead in terms of memory or other resources.

rhettinger commented 4 years ago

This comment suggest that you have missed the general motivation for reservoir sampling.

Please don't get personal. I've devoted a good deal of time thinking about your proposal. Tim is also giving it an honest look. Please devote some time to honestly thinking about what we have to say.

FWIW, this is an area of expertise for me. I too have been fascinated with reservoir sampling for several decades, have done a good deal of reading on the topic, and routinely performed statistical sampling as part of my job (where it needed to be done in a legally defensible manner).

The idea of reservoir sampling is that you want to sample from an iterator, you only get one chance to iterate over it, and you don't know a priori how many items it will yield.

Several thoughts:

The need for sampling a generator or one-time stream of data is in the "almost never" category. Presumably, that is why you don't find it in numpy or Julia.
The examples you gave involved dicts or sets. These aren't one-chance examples and we do know the length in advance.
Whether talking about sets, dicts, generators, or arbitrary iterators, "sample(list(it), k)" would still work. Both ways still have to consume the entire input before returning. So really this is just an optimization, one that under some circumstances runs a bit faster, but one that forgoes a number of desirable characteristics of the existing tool.
IMO, sample_iter() is hard to use correctly. In most cases, the users would be worse off than they are now and it would be challenging to communicate clearly under what circumstances they would be marginally better off.

At any rate, my recommendation stands. This should not be part of standard library random module API. Perhaps it could be a recipe or a see-also link. We really don't have to do this.

a35477f8-71c1-46d8-a00e-b2f89ce712bf commented 4 years ago

Please don't get personal.

Sorry, that didn't come across with the intended tone :)

I agree that this could be out of scope for the random module but I wanted to make sure the reasons were considered.

Reading between the lines I get the impression that you'd both be happier with it if the algorithm was exact (rather than using fp). That would at least give the possibility for it to be used internally by e.g. sample/choice if there was a benefit for some cases.

tim-one commented 4 years ago

The lack of exactness (and possibility of platform-dependent results, including, e.g., when a single platform changes its math libraries) certainly works against it.

But I think Raymond is more bothered by that there's no apparently _compelling_ use case, in the sense of something frequent enough in real life to warrant including it in the standard library.

For example, there's really no problem right now if you have a giant iterable _and_ you know its length. I had a case where I had to sample a few hundred doubles from giant binary files of floats. The "obvious" solution worked great:

    for o in sorted(random.sample(range(0, file_size, 8), 1000)):
        seek to offset o and read up 8 bytes

Now random access to that kind of iterable is doable, but a similar approach works fine too if it's sequential-only access to a one-shot iterator of known length: pick the indices in advance, and skip over the iterator until each index is hit in turn.

It doesn't score much points for being faster than materializing a set or dict into a sequence first, since that's a micro-optimization justified only by current CPython implementation accidents.

Where it's absolutely needed is when there's a possibly-giant iterable of unknown length. Unlike Raymond, I think that's possibly worth addressing (it's not hard to find people asking about it on the web). But it's not a problem I've had in real life, so, ya, it's hard to act enthusiastic ;-)

PS: you should also guard against W > 1.0. No high-quality math library will create such a thing given these inputs, but testing only for exact equality to 1.0 is no cheaper and so needlessly optimistic.

rhettinger commented 4 years ago

I agree that this could be out of scope for the random module but I wanted to make sure the reasons were considered.

I think we've done that. Let's go ahead and close this one down.

In general, better luck can be had by starting with a common real world problem not adequately solved by the library, then creating a clean API for it, and lastly searching for the best algorithm to implement it. It is much tougher the other way around, starting with an algorithm you like, then hoping to find a use case to justify it, and hoping to find an API that isn't a footgun for everyday users.

FWIW, reservoir sampling was considered at the outset when sample() was first designed. Subsequent to that we've also evaluated a high quality PR for switching the internals to reservoir sampling, but it proved to be inferior to the current implementation in most respects (code complexity, computational overhead, speed, and entropy consumed); the only gain was some memory savings.

a35477f8-71c1-46d8-a00e-b2f89ce712bf commented 4 years ago

Yeah, I guess it's a YAGNI.

Thanks Raymond and Tim for looking at this!

a35477f8-71c1-46d8-a00e-b2f89ce712bf commented 3 years ago

I was contacted by someone interested in this so I've posted the last version above as a GitHub gist under the MIT license: https://gist.github.com/oscarbenjamin/4c1b977181f34414a425f68589e895d1