allow weights in random.choice

5db38da4-1243-41dd-8f8a-1a42f116f8a7 commented 11 years ago

BPO	18844
Nosy	@tim-one, @rhettinger, @mdickinson, @pitrou, @serhiy-storchaka, @NeilGirdhar, @applio
PRs	python/cpython#552
Files	weighted_choice.diff: Preliminary implementation of random.choice optional arg "weights" weighted_choice_v2.diff: Move cumulative distribution calculation to separate function that returns an index generator weighted_choice_generator.patch wcg_bench.py: Benchmarking different methods weighted_choice_generator_2.patch weighted_choice_v3.diff: a new implementation of weighted choice weighted_choice_v3.patch: weighted choice function weighted_choice_v4.patch weighted_choice_v5.patch: weighted choice function v5 weighted_choice.diff weighted_choice2.diff: Add docs and test

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = 'https://github.com/rhettinger' closed_at = created_at = labels = ['type-feature', 'library'] title = 'allow weights in random.choice' updated_at = user = 'https://bugs.python.org/aisaac' ``` bugs.python.org fields: ```python activity = actor = 'dstufft' assignee = 'rhettinger' closed = True closed_date = closer = 'rhettinger' components = ['Library (Lib)'] creation = creator = 'aisaac' dependencies = [] files = ['31479', '31547', '31732', '31734', '36331', '42322', '42323', '42386', '42393', '44394', '44407'] hgrepos = [] issue_num = 18844 keywords = ['patch', 'needs review'] message_count = 69.0 messages = ['196229', '196234', '196235', '196252', '196551', '196567', '196709', '196711', '196716', '196721', '196728', '196731', '196741', '196750', '196761', '196767', '197507', '197512', '197540', '197862', '197865', '197866', '198367', '198372', '223750', '224947', '224949', '224953', '224954', '224957', '225128', '225133', '225137', '225140', '225148', '226891', '262625', '262626', '262642', '262649', '262652', '262656', '262678', '262744', '262967', '262970', '262971', '262981', '262982', '262983', '262994', '262995', '267782', '272767', '272785', '274538', '274677', '274684', '274686', '274760', '274907', '274964', '277485', '277486', '277487', '278516', '278633', '279701', '279702'] nosy_count = 14.0 nosy_names = ['tim.peters', 'rhettinger', 'mark.dickinson', 'pitrou', 'aisaac', 'westley.martinez', 'python-dev', 'serhiy.storchaka', 'NeilGirdhar', 'madison.may', 'dkorchem', 'Christian.Kleineidam', 'davin', 'xksteven'] pr_nums = ['552'] priority = 'normal' resolution = 'fixed' stage = 'patch review' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue18844' versions = ['Python 3.6'] ```

5db38da4-1243-41dd-8f8a-1a42f116f8a7 commented 11 years ago

The need for weighted random choices is so common that it is addressed as a "common task" in the docs: http://docs.python.org/dev/library/random.html

This enhancement request is to add an optional argument to random.choice, which must be a sequence of non-negative numbers (the weights) having the same length as the main argument.

bed36d65-e285-4d18-acf3-5dd5d6a271d3 commented 11 years ago

+1. I've found myself in need of this feature often enough to wonder why it's not part of the stdlib.

pitrou commented 11 years ago

Agreed with the feature request. The itertools dance won't be easy to understand, for many people.

bed36d65-e285-4d18-acf3-5dd5d6a271d3 commented 11 years ago

I realize its probably quite early to begin putting a patch together, but here's some preliminary code for anyone interested. It builds off of the "common task" example in the docs and adds in validation for the weights list.

There are a few design decisions I'd like to hash out.
In particular:

Should negative weights cause a ValueError to be raised, or should they be converted to 0s?
Should passing a list full of zeros as the weights arg raise a ValueError or be treated as if no weights arg was passed?

mdickinson commented 11 years ago

[Madison May]

Should negative weights cause a ValueError to be raised, or should they be converted to 0s?

Should passing a list full of zeros as the weights arg raise a ValueError or be treated as if no weights arg was passed?

Both those seem like clear error conditions to me, though I think it would be fine if the second condition produced a ZeroDivisionError rather than a ValueError.

I'm not 100% sold on the feature request. For one thing, the direct implementation is going to be inefficient for repeated sampling, building the table of cumulative sums each time random.choice is called. A more efficient approach for many use-cases would do the precomputation once, returning some kind of 'distribution' object from which samples can be generated. (Walker's aliasing method is one route for doing this efficiently, though there are others.) I agree that this is a commonly needed and commonly requested operation; I'm just not convinced either that an efficient implementation fits well into the random module, or that it makes sense to add an inefficient implementation.

bed36d65-e285-4d18-acf3-5dd5d6a271d3 commented 11 years ago

[Mark Dickinson]

Both those seem like clear error conditions to me, though I think it would be fine if the second condition produced a ZeroDivisionError rather than a ValueError.

Yeah, in hindsight it makes sense that both of those conditions should raise errors. After all: "Explicit is better than implicit".

As far as optimization goes, could we potentially use functools.lru_cache to cache the cumulative distribution produced by the weights argument and optimize repeated sampling?

Without @lru_cache:
>>> timeit.timeit("x = choice(list(range(100)), list(range(100)))", setup="from random import choice", number=100000)
36.7109281539997

With @lru_cache(max=128):
>>> timeit.timeit("x = choice(list(range(100)), list(range(100)))", setup="from random import choice", number=100000)
6.6788657720007905

Of course it's a contrived example, but you get the idea.

Walker's aliasing method looks intriguing. I'll have to give it a closer look.

I agree that an efficient implementation would be preferable but would feel out of place in random because of the return type. I still believe a relatively inefficient addition to random.choice would be valuable, though.

rhettinger commented 11 years ago

+1 for the overall idea. I'll take a detailed look at the patch when I get a chance.

rhettinger commented 11 years ago

The sticking point is going to be that we don't want to recompute the cumulative weights for every call to weighted_choice.

So there should probably be two functions:

  cw = make_cumulate_weights(weight_list) 
  x = choice(choice_list, cw)

This is similar to what was done with string.maketrans() and str.translate().

serhiy-storchaka commented 11 years ago

A more efficient approach for many use-cases would do the precomputation once, returning some kind of 'distribution' object from which samples can be generated.

I like the idea about adding a family of distribution generators. They should check input parameters and make a precomputation and then generate infinite sequence of specially distributed random numbers.

bed36d65-e285-4d18-acf3-5dd5d6a271d3 commented 11 years ago

[Raymond Hettinger]

The sticking point is going to be that we don't want to recompute the cumulative weights for every call to weighted_choice.

So there should probably be two functions:

cw = make_cumulate_weights(weight_list) x = choice(choice_list, cw)

That's pretty much how I broke things up when I decided to test out optimization with lru_cache. That version of the patch is now attached.

[Serhiy Storchaka]

I like the idea about adding a family of distribution generators. They should check input parameters and make a precomputation and then > generate infinite sequence of specially distributed random numbers.

Would these distribution generators be implemented internally (see attached patch) or publicly exposed?

serhiy-storchaka commented 11 years ago

Would these distribution generators be implemented internally (see attached patch) or publicly exposed?

See bpo-18900. Even if this proposition will be rejected I think we should publicly expose weighted choice_generator(). A generator or a builder which returns function are only ways how efficiently implement this feature. Use lru_cache isn't good because several choice generators can be used in a program and because it left large data in a cache long time after it was used.

bed36d65-e285-4d18-acf3-5dd5d6a271d3 commented 11 years ago

Use lru_cache isn't good because several choice generators can be used in a program and because it left large data in a cache long time after it was used.

Yeah, I just did a quick search of the stdlib and only found one instance of lru_cache in use -- another sign that lru_cache is a bad choice.

rhettinger commented 11 years ago

I like the idea about adding a family of distribution generators

Let's stay focused on the OP's feature request for a weighted version of choice().

For the most part, it's not a good idea to "just add" a family of anything to the standard library. We wait for user requests and use cases to guide the design and error on the side of less, rather than more. This helps avoid bloat. Also, it would be a good idea to start something like this as a third-party to module to let it iterate and mature before deciding whether there was sufficient user uptake to warrant inclusion in the standard library.

For the current request, we should also do some research on existing solutions in other languages. This isn't new territory. What do R, SciPy, Fortran, Matlab or other statistical packages already do? Their experiences can be used to inform our design. Alan Kay's big criticism of Python developers is that they have a strong propensity invent from scratch rather than taking advantage of the mountain of work done by the developers who came before them.

bed36d65-e285-4d18-acf3-5dd5d6a271d3 commented 11 years ago

What do R, SciPy, Fortran, Matlab or other statistical packages already do?

Numpy avoids recalculating the cumulative distribution by introducing a 'size' argument to numpy.random.choice(). The cumulative distribution is calculated once, then 'size' random choices are generated and returned.

Their overall implementation is quite similar to the method suggested in the python docs.

>> choices, weights = zip(weighted_choices) >> cumdist = list(itertools.accumulate(weights)) >> x = random.random() cumdist[-1] >> choices[bisect.bisect(cumdist, x)]

The addition of a 'size' argument to random.choice() has already been discussed (and rejected) in bpo-18414, but this was on the grounds that the standard idiom for generating a list of random choices ([random.choice(seq) for i in range(k)]) is obvious and efficient.

7fb66088-1777-4bc1-b120-d9516c88dfaa commented 11 years ago

Honestly, I think adding weights to any of the random functions are trivial enough to implement as is. Just because something becomes a common task does not mean it ought to be added to the stdlib.

Anyway, from a user point of view, I think it'd be useful to be able to send a sequence to a function that'll weight the sequence for use by random.

bed36d65-e285-4d18-acf3-5dd5d6a271d3 commented 11 years ago

Just ran across a great blog post on the topic of weighted random generation from Eli Bendersky for anyone interested: http://eli.thegreenplace.net/2010/01/22/weighted-random-generation-in-python/

serhiy-storchaka commented 11 years ago

The proposed patch add two methods to the Random class and two module level functions: weighted_choice() and weighted_choice_generator().

weighted_choice(data) accepts either mapping or sequence and returns a key or index x with probability which is proportional to data[x].

If you need several elements with same distribution, use weighted_choice_generator(data) which returns an iterator which produces random keys or indices of the data. It is more faster than calling weighted_choice(data) repeatedly and is more flexible than generating a list of random values at specified size (as in NumPy).

2c693aaf-b711-4c74-9243-f6794805e1ba commented 11 years ago

Should this really be implemented using the cumulative distribution and binary search algorithm? Vose's Alias Method has the same initialization and memory usage cost (O(n)), but is constant time to generate each sample.

An excellent tutorial is here: http://www.keithschwarz.com/darts-dice-coins/

serhiy-storchaka commented 11 years ago

Thank you Neil. It is interesting.

Vose's alias method has followed disadvantages (in comparison with the roulette wheel selection proposed above):

It operates with probabilities and uses floats, therefore it can be a little less accurate.
It consumes two random number (an integer and a float) for generating one sample. It can be fixed however (in the cost of additional precision lost).
While it has same time and memory O(n) cost for initialization, it has larger multiplication, Vose's alias method requires several times larger time and memory for initialization.
It requires more memory in process of generating samples.

However it has an advantage. It really has constant time cost to generate each sample.

Here are some benchmark results. "Roulette Wheel" is proposed above implementation. "Roulette Wheel 2" is its modification with normalized cumulative sums. It has twice more initialization time, but 1.5-2x faster generates each sample. "Vose's Alias" is an implementation of Vose's alias method directly translated from Java. "Vose's Alias 2" is optimized implementation which uses Python specific.

Second column is a size of distribution, third column is initialization time (in milliseconds), fourth column is time to generate each sample (in microseconds), fifth column is a number of generated samples after which this method will overtake "Roulette Wheel" (including initialization time).

Roulette Wheel 10 0.059 7.165 0 Roulette Wheel 2 10 0.076 4.105 5 Vose's Alias 10 0.129 13.206 - Vose's Alias 2 10 0.105 6.501 69 Roulette Wheel 100 0.128 8.651 0 Roulette Wheel 2 100 0.198 4.630 17 Vose's Alias 100 0.691 12.839 - Vose's Alias 2 100 0.441 6.547 148 Roulette Wheel 1000 0.719 10.949 0 Roulette Wheel 2 1000 1.458 5.177 128 Vose's Alias 1000 6.614 13.052 - Vose's Alias 2 1000 3.704 6.531 675 Roulette Wheel 10000 7.495 13.249 0 Roulette Wheel 2 10000 14.961 6.051 1037 Vose's Alias 10000 69.937 13.830 - Vose's Alias 2 10000 37.017 6.746 4539 Roulette Wheel 100000 73.988 16.180 0 Roulette Wheel 2 100000 148.176 8.182 9275 Vose's Alias 100000 690.099 13.808 259716 Vose's Alias 2 100000 391.367 7.095 34932 Roulette Wheel 1000000 743.415 19.493 0 Roulette Wheel 2 1000000 1505.409 8.930 72138 Vose's Alias 1000000 7017.669 13.798 1101673 Vose's Alias 2 1000000 4044.746 7.152 267507

As you can see Vose's alias method has very large initialization time. Non-optimized version will never overtake "Roulette Wheel" with small distributions (\<100000), and even optimized version will never overtake "Roulette Wheel" with small distributions (\<100000). Only with very large distributions Vose's alias method has an advantage (when you needs very larger number of samples).

Because for generating only one sample we need a method with fastest initialization we need "Roulette Wheel" implementation. And because large distributions are rare, I think there is no need in alternative implementation. In worst case for generating 1000000 samples from 1000000-elements distribution the difference between "Roulette Wheel" and "Vose's Alias 2" is a difference between 20 and 11 seconds.

bed36d65-e285-4d18-acf3-5dd5d6a271d3 commented 11 years ago

Serhiy, from a technical standpoint, your latest patch looks like a solid solution. From an module design standpoint we still have a few options to think through, though. What if random.weighted_choice_generator was moved to random.choice_generator and refactored to take an array of weights as an optional argument? Likewise, random.weighted_choice could still be implemented with an optional arg to random.choice. Here's the pros and cons of each implementation as I see them.

Implementation: weighted_choice_generator + weighted_choice Pros: Distinct functions help indicate that weighted_choice should be used in a different manner than choice -- [weightedchoice(x) for in range(n)] isn't efficient. Can take Mapping or Sequence as argument. Has a single parameter Cons: Key, not value, is returned Requires two new functions Dissimilar to random.choice Long function name (weighted_choice_generator)

Implementation: choice_generator + optional arg to choice Pros: Builds on existing code layout Value returned directly Only a single new function required More compact function name

Cons: Difficult to support Mappings Two args required for choicegenerator and random.choice Users may use [choice(x, weights) for in range(n)] expecting efficient results

7fb66088-1777-4bc1-b120-d9516c88dfaa commented 11 years ago

I think Storchaka's solution is more transparent and I agree with him on the point that the choice generator should be exposed.

bed36d65-e285-4d18-acf3-5dd5d6a271d3 commented 11 years ago

I think Storchaka's solution is more transparent and I agree with him on the point that the choice generator should be exposed.

Valid point -- transparency should be priority #1

serhiy-storchaka commented 11 years ago

Most existing implementation produce just index. That is why weighted_choice() accepts singular weights list and returns index. On the other hand, I think working with mapping will be wished feature too (especially because Counter is in stdlib). Indexable sequences and mappings are similar. In both cases weighted_choice() returns value which can be used as index/key of input argument.

If you need choice an element from some sequence, just use seq[weighted_choice(weights)]. Actually weighted_choice() has no common code with choice() and has too different use cases. They should be dissimilar as far as possible. Perhaps we even should avoid the "choice" part in function names (are there any ideas?) to accent this.

bed36d65-e285-4d18-acf3-5dd5d6a271d3 commented 11 years ago

You have me convinced, Serhiy. I see the value in making the two functions distinct.

For naming purposes, perhaps weighted_index() would be more descriptive.

mdickinson commented 10 years ago

Closed bpo-22048 as a duplicate of this one.