python / cpython

The Python programming language
https://www.python.org
Other
63.36k stars 30.35k forks source link

Augment random.choices() with the alias method #85303

Closed rhettinger closed 4 years ago

rhettinger commented 4 years ago
BPO 41131
Nosy @tim-one, @rhettinger, @mdickinson
Files
  • choices_binomial.py: Binomial variant
  • choices_proposal.py: Alias variant
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/rhettinger' closed_at = created_at = labels = ['library', '3.10', 'performance'] title = 'Augment random.choices() with the alias method' updated_at = user = 'https://github.com/rhettinger' ``` bugs.python.org fields: ```python activity = actor = 'tim.peters' assignee = 'rhettinger' closed = True closed_date = closer = 'rhettinger' components = ['Library (Lib)'] creation = creator = 'rhettinger' dependencies = [] files = ['49269', '49270'] hgrepos = [] issue_num = 41131 keywords = [] message_count = 6.0 messages = ['372441', '372443', '374622', '374624', '374625', '374627'] nosy_count = 3.0 nosy_names = ['tim.peters', 'rhettinger', 'mark.dickinson'] pr_nums = [] priority = 'normal' resolution = 'wont fix' stage = 'resolved' status = 'closed' superseder = None type = 'performance' url = 'https://bugs.python.org/issue41131' versions = ['Python 3.10'] ```

    rhettinger commented 4 years ago

    For n unequal weights and k selections, sample selection with the inverse-cdf method is O(k log₂ n). Using the alias method, it improves to O(k). The proportionally constants also favor the alias method so that if the set up times were the same, the alias method would always win (even when n=2).

    However, the set up times are not the same. For the inverse-cdf method, set up is O(1) if cum_weights are given; otherwise, it is O(n) with a fast loop. The setup time for the alias method is also O(n) but is proportionally much slower.

    So, there would need to be a method selection heuristic based on the best trade-off between setup time and sample selection time.

    Both methods make k calls to random().

    See: https://en.wikipedia.org/wiki/Alias_method

    Notes on the attached draft implementation:

    rhettinger commented 4 years ago

    I also looked at another method using binomial variates but couldn't get it to run faster than the alias method:

        def choices(population, weights, *, k=1):
            r = 1.0
            n = k
            selections = []
            for elem, p in zip(population, weights):
                v = binomial_variate(n, p / r)
                selections += [elem] * v
                n -= v
                r -= p
            shuffle(selections)
            return selections

    The shuffle step took as much time as the alias method. Also, the binomial variate was hard to compute quickly and without overflow/underflow issues for large inputs.

    tim-one commented 4 years ago

    I'm not clear on that the alias method is valuable here. Because of the preprocessing expense, it cries out for a class instead: an object that can retain the preprocessed tables, be saved to disk (pickled), restored later, and used repeatedly to make O(1) selections. I have such a class, which uses exact integer arithmetic (e.g., during preprocessing, there's at least one "too small" bin if and only if there's at least one "too large" bin), and is as unbiased as choice() and randrange(). Reasoning about float vagaries is much harder, and I see no error analysis here.

    Most troubling: that due to rounding, reducing a "too large" bin results in a weight that spuriously compares \<= bin_size. Then a "too small" bin with a genuinely tiny weight can be left with nothing to pair with, and so ends up aliasing itself, giving it a massively too-high probability of being picked.

    So if you're determined to stick to "a function", I'd stick to the simple inverse CDF sampling method. Unless the number of choices is "very" large, the log factor doesn't much matter.

    That said, the alias numerics here could be improved some by working with the weights directly, "normalizing" them only at the end. Instead of comparing weights to bin_size, compare them instead to the mean:

    mean = total / n

    (note: the next step to getting rid of floats entirely is to pre-multiply the weights by lcm(total, n) // total - then their mean is exactly the integer lcm(total, n) // n).

    The mean (instead of bin_size) suffers rounding errors then, but the original weights don't during the "hard part" of preprocessing. At the end,

    fn = n + 0.0
    U = [weights[i] / total + i / fn for i in range(n)]

    instead. The "i / fn" also suffers just one rounding error as opposed to the two in the current "bin_width * i". That can make a difference for n as small as 5:

    >>> 3/5
    0.6
    >>> (1/5)*3
    0.6000000000000001
    rhettinger commented 4 years ago

    Okay, thank you for looking!

    rhettinger commented 4 years ago

    FWIW, I paid close attention to rounding. The reason for the "while small and large" is that the loop stops making aliases unless we have both something at or below 0.5 and something known to be above 0.5. If we end up with only two bins slightly above over below 0.5, they are considered close-enough and will alias to themselves.

    By giving the bins a default alias to themselves, there is always a reasonable dispatch in the main sampling loop even if the random() value compares to an unfavorably rounded entry in U.

    The code assumes that weird roundings will happen through out.

    tim-one commented 4 years ago

    Oh yes - I understood the intent of the code. It's as good an approach to living with floating-point slop in this context as I've seen. It's not obviously broken. But neither is it obviously correct, and after a few minutes I didn't see a clear path to rigorously proving it can't break. In FP, if there's one chance in 2**53 that it _can_ break, it will ;-)

    The suggestions I made change none of that: reducing the number of rounding errors can only help.

    OTOH, an all-integer version of the algorithm _can_ be "obviously correct", and can embed asserts to verify that it is. It can all be done in a single fixed-size list of 3-lists ([weight, value, alias]), swapping entries as needed to keep the 3 regions of interest ("too small", "on the nose", "too large") each contiguous. Then it's straightforward to prove that you run out of "too small" entries at the same time you run out of "too large" ones.