Closed 5db38da4-1243-41dd-8f8a-1a42f116f8a7 closed 8 years ago
Re-implemented with suggested improvements taken into account. Thanks @mark.dickinson and @pitrou for the suggestions.
I also removed the redundant "fast path" portion for this code since it doesn't deal with generators anyways.
Let me know additional thoughts about it.
Left in a line of code that was supposed to be removed. Fixed.
Raymond, do you have a time for this issue?
Raymond, any chance to get weighted random choices generator in 3.6? Less than month is left to feature code freeze.
FWIW, I have four full days set aside for the upcoming pre-feature release sprint which is dedicated to taking time to thoughtfully evaluate pending feature requests. In the meantime, I'm contacting Alan Downey for a consultation for the best API for this. As mentioned previously, the generator version isn't compatible with the design of the rest of the module that allows streams to have their state saved and restored at arbitrary points in the sequence. One API would be to create a list all at once (like random.sample does). Another would be to have two steps (like str.maketrans and str.translate). Ideally, the API should integrate neatly with collections.Counter as a possible input for the weighting. Hopefully, Alan can also comment on the relative frequency of small integer weightings versus the general case (the former benefits from a design using random.choice() applied to Counter.elements() and the latter benefits from a design with accumulate() and bisect()). Note, this is a low priority feature (no real demonstrated need, there is already a recipe for it in the docs, and once the best API have been determined, the code is so simple that any of us could implement it in only a few minutes).
Latest draft patch attached (w/o tests or docs). Incorporates consultation from Alan Downey and Jake Vanderplas.
Population and weights are separate arguments (like numpy.random.choice() and sample() in R). Matches the way data would arrive in Pandas. Easily extracted from a Counter or dict using keys() and values(). Suitable for applications that sample the population multiple times but using different weights. See https://stat.ethz.ch/R-manual/R-devel/library/base/html/sample.html and http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html
Excludes a replacement=False option. That use case necessarily has integer weights and may be better suited to the existing random.sample() rather than trying to recompute a CDF on every iteration as we would have to in this function.
Allows cumulative_weights to be submitted instead of individual weights. This supports uses cases where the CDF already exists (as in the ThinkBayes examples) and where we want to periodically reuse the same CDF for repeated samples of the same population -- this occurs in resampling applications, Gibbs sampling, and Monte Carlo Markov Chain applications. Per Jake, "MCMC/Gibbs Sampling approaches generally boil down to a simple weighted coin toss at each step" and "It's definitely common to do aggregation of multiple samples, e.g. to compute sample statistics"
The API allows the weights to be integers, fractions, decimals, or floats. Likewise, the population and weights can be any Sequence. Population elements need not be hashable.
Returning a list means that the we don't have to save state in mid-stream (that is why we can't use a generator). A list feeds nicely into Counters, mean, median, stdev, etc for summary statistics. Returning a list parallels what random.sample() does, keeping the module internally consistent.
Default uniform weighting falls back to random.choice() which would be more efficient than bisecting.
Bisecting tends to beat other approaches in the general case. See http://eli.thegreenplace.net/2010/01/22/weighted-random-generation-in-python
Incorporates error checks for len(population)==len(cum_weights) and for conflicting specification of both weights and cumulative weights.
There API is not perfect and there are some aspects that give me heartburn. 1) Not saving the computed CDF is waste and forces the user to pre-build the CDF if they want to save it for later use (the API could return both the selections and the CDF but that would be awkward and atypical). 2) For the common case of having small integer weights on a small population, the bisecting approach is slower than using random.choice on a population expanded to include the selections multiple times in proportion to their weights (that said, short of passing in a flag, there is no cheap easy way for this function to detect that case and give it a fast path). 3) Outputting a list is inefficient if all you're doing with result is summarizing it with a Counter, histogram tool, mean, median, or stdev. 4) There is no cheap way to check to see if the user supplied cum_weights is sorted or if the weights contain negative values.
I've gone through the patch -- looks good to me.
New changeset a5856153d942 by Raymond Hettinger in branch 'default': Issue bpo-18844: Add random.weighted_choices() https://hg.python.org/cpython/rev/a5856153d942
Thanks Davin.
Returning a list instead of an iterator looks unpythonic to me. Values generated sequentially, there are no advantages of returning a list.
An implementation lacks optimizations used in my patch.
The documentation still contains a receipt for weighted choice. It is incompatible with new function.
There isn't really an option to return a generator because it conflicts the rest of the module that uses lists elsewhere and that allows state to be saved and restored before and after any function call. One of the design reviewers also said that the generator form would harder for students to use.
I left the text in the examples section unchanged because it is still valid (showing how to make a cumulative distribution and how to build a fast alternative for the special case of small integer weights). Before the 3.6 release, I expect to expand this section to provide recipes for a MCMC application (using choices() with a passed-in CDF) and some other examples suggested by the design reviewers.
The optimization hacks in the other patch don't seem worth it. The current code is readable and runs fast (the principal steps are all C functions).
Using a generator doesn't prevents state to be saved and restored.
New changeset 39a4be5e003d by Raymond Hettinger in branch '3.6': Issue bpo-18844: Make the number of selections a keyword-only argument for random.choices(). https://hg.python.org/cpython/rev/39a4be5e003d
Equidistributed examples:
choices(c.execute('SELECT name FROM Employees').fetchall(), k=20)
choices(['hearts', 'diamonds', 'spades', 'clubs'], k=5)
choices(list(product(card_facevalues, suits)), k=5)
Weighted selection examples:
Counter(choices(['red', 'black', 'green'], [18, 18, 2], k=3800)) # american roulette
Counter(choices(['hit', 'miss'], [5, 1], k=600)) # russian roulette
choices(fetch('employees'), fetch('years_of_service'), k=100) # tenure weighted
choices(cohort, map(cancer_risk, map(risk_factors, cohort)), k=50) # risk weighted
Star unpacking example:
transpose = lambda s: zip(*s)
craps = [(2, 1), (3, 2), (4, 3), (5, 4), (6, 5), (7, 6), (8, 5), (9, 4), (10, 3), (11, 2), (12, 1)]
print(choices(*transpose(craps), k=10))
Comparative APIs from other languages:
http://www.mathworks.com/help/stats/randsample.html
http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html
https://stat.ethz.ch/R-manual/R-devel/library/base/html/sample.html
https://reference.wolfram.com/language/ref/RandomChoice.html
################################################################### # Flipping a biased coin
from collections import Counter
from random import choices
print(Counter(choices(range(2), [0.9, 0.1], k=1000)))
################################################################### # Bootstrapping
'From a small statistical sample infer a 90% confidence interval for the mean' # http://statistics.about.com/od/Applications/a/Example-Of-Bootstrapping.htm
from statistics import mean
from random import choices
data = 1, 2, 4, 4, 10
means = sorted(mean(choices(data, k=5)) for i in range(20))
print('The sample mean of {:.1f} has a 90% confidence interval from {:.1f} to {:.1f}'.format(
mean(data), means[1], means[-2]))
New changeset 433cff92d565 by Raymond Hettinger in branch '3.6': Issue bpo-18844: Fix-up examples for random.choices(). Remove over-specified test. https://hg.python.org/cpython/rev/433cff92d565
New changeset d4e715e725ef by Raymond Hettinger in branch '3.6': Issue bpo-18844: Add more tests https://hg.python.org/cpython/rev/d4e715e725ef
New changeset 32bfc81111b6 by Raymond Hettinger in branch '3.6': Issue bpo-18844: Make the various ways for specifing weights produce the same results. https://hg.python.org/cpython/rev/32bfc81111b6
New changeset 09a87b16d5e5 by Raymond Hettinger in branch '3.6': Issue bpo-18844: Strengthen tests to include a case with unequal weighting https://hg.python.org/cpython/rev/09a87b16d5e5
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = 'https://github.com/rhettinger' closed_at =
created_at =
labels = ['type-feature', 'library']
title = 'allow weights in random.choice'
updated_at =
user = 'https://bugs.python.org/aisaac'
```
bugs.python.org fields:
```python
activity =
actor = 'dstufft'
assignee = 'rhettinger'
closed = True
closed_date =
closer = 'rhettinger'
components = ['Library (Lib)']
creation =
creator = 'aisaac'
dependencies = []
files = ['31479', '31547', '31732', '31734', '36331', '42322', '42323', '42386', '42393', '44394', '44407']
hgrepos = []
issue_num = 18844
keywords = ['patch', 'needs review']
message_count = 69.0
messages = ['196229', '196234', '196235', '196252', '196551', '196567', '196709', '196711', '196716', '196721', '196728', '196731', '196741', '196750', '196761', '196767', '197507', '197512', '197540', '197862', '197865', '197866', '198367', '198372', '223750', '224947', '224949', '224953', '224954', '224957', '225128', '225133', '225137', '225140', '225148', '226891', '262625', '262626', '262642', '262649', '262652', '262656', '262678', '262744', '262967', '262970', '262971', '262981', '262982', '262983', '262994', '262995', '267782', '272767', '272785', '274538', '274677', '274684', '274686', '274760', '274907', '274964', '277485', '277486', '277487', '278516', '278633', '279701', '279702']
nosy_count = 14.0
nosy_names = ['tim.peters', 'rhettinger', 'mark.dickinson', 'pitrou', 'aisaac', 'westley.martinez', 'python-dev', 'serhiy.storchaka', 'NeilGirdhar', 'madison.may', 'dkorchem', 'Christian.Kleineidam', 'davin', 'xksteven']
pr_nums = ['552']
priority = 'normal'
resolution = 'fixed'
stage = 'patch review'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue18844'
versions = ['Python 3.6']
```