allow weights in random.choice

5db38da4-1243-41dd-8f8a-1a42f116f8a7 commented 11 years ago

BPO	18844
Nosy	@tim-one, @rhettinger, @mdickinson, @pitrou, @serhiy-storchaka, @NeilGirdhar, @applio
PRs	python/cpython#552
Files	weighted_choice.diff: Preliminary implementation of random.choice optional arg "weights" weighted_choice_v2.diff: Move cumulative distribution calculation to separate function that returns an index generator weighted_choice_generator.patch wcg_bench.py: Benchmarking different methods weighted_choice_generator_2.patch weighted_choice_v3.diff: a new implementation of weighted choice weighted_choice_v3.patch: weighted choice function weighted_choice_v4.patch weighted_choice_v5.patch: weighted choice function v5 weighted_choice.diff weighted_choice2.diff: Add docs and test

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = 'https://github.com/rhettinger' closed_at = created_at = labels = ['type-feature', 'library'] title = 'allow weights in random.choice' updated_at = user = 'https://bugs.python.org/aisaac' ``` bugs.python.org fields: ```python activity = actor = 'dstufft' assignee = 'rhettinger' closed = True closed_date = closer = 'rhettinger' components = ['Library (Lib)'] creation = creator = 'aisaac' dependencies = [] files = ['31479', '31547', '31732', '31734', '36331', '42322', '42323', '42386', '42393', '44394', '44407'] hgrepos = [] issue_num = 18844 keywords = ['patch', 'needs review'] message_count = 69.0 messages = ['196229', '196234', '196235', '196252', '196551', '196567', '196709', '196711', '196716', '196721', '196728', '196731', '196741', '196750', '196761', '196767', '197507', '197512', '197540', '197862', '197865', '197866', '198367', '198372', '223750', '224947', '224949', '224953', '224954', '224957', '225128', '225133', '225137', '225140', '225148', '226891', '262625', '262626', '262642', '262649', '262652', '262656', '262678', '262744', '262967', '262970', '262971', '262981', '262982', '262983', '262994', '262995', '267782', '272767', '272785', '274538', '274677', '274684', '274686', '274760', '274907', '274964', '277485', '277486', '277487', '278516', '278633', '279701', '279702'] nosy_count = 14.0 nosy_names = ['tim.peters', 'rhettinger', 'mark.dickinson', 'pitrou', 'aisaac', 'westley.martinez', 'python-dev', 'serhiy.storchaka', 'NeilGirdhar', 'madison.may', 'dkorchem', 'Christian.Kleineidam', 'davin', 'xksteven'] pr_nums = ['552'] priority = 'normal' resolution = 'fixed' stage = 'patch review' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue18844' versions = ['Python 3.6'] ```

8be6dc97-7e75-4bae-bcbd-0928118040f1 commented 8 years ago

Re-implemented with suggested improvements taken into account. Thanks @mark.dickinson and @pitrou for the suggestions.

I also removed the redundant "fast path" portion for this code since it doesn't deal with generators anyways.

Let me know additional thoughts about it.

8be6dc97-7e75-4bae-bcbd-0928118040f1 commented 8 years ago

Left in a line of code that was supposed to be removed. Fixed.

serhiy-storchaka commented 8 years ago

Raymond, do you have a time for this issue?

serhiy-storchaka commented 8 years ago

Raymond, any chance to get weighted random choices generator in 3.6? Less than month is left to feature code freeze.

rhettinger commented 8 years ago

FWIW, I have four full days set aside for the upcoming pre-feature release sprint which is dedicated to taking time to thoughtfully evaluate pending feature requests. In the meantime, I'm contacting Alan Downey for a consultation for the best API for this. As mentioned previously, the generator version isn't compatible with the design of the rest of the module that allows streams to have their state saved and restored at arbitrary points in the sequence. One API would be to create a list all at once (like random.sample does). Another would be to have two steps (like str.maketrans and str.translate). Ideally, the API should integrate neatly with collections.Counter as a possible input for the weighting. Hopefully, Alan can also comment on the relative frequency of small integer weightings versus the general case (the former benefits from a design using random.choice() applied to Counter.elements() and the latter benefits from a design with accumulate() and bisect()). Note, this is a low priority feature (no real demonstrated need, there is already a recipe for it in the docs, and once the best API have been determined, the code is so simple that any of us could implement it in only a few minutes).

rhettinger commented 8 years ago

Latest draft patch attached (w/o tests or docs). Incorporates consultation from Alan Downey and Jake Vanderplas.

Population and weights are separate arguments (like numpy.random.choice() and sample() in R). Matches the way data would arrive in Pandas. Easily extracted from a Counter or dict using keys() and values(). Suitable for applications that sample the population multiple times but using different weights. See https://stat.ethz.ch/R-manual/R-devel/library/base/html/sample.html and http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html
Excludes a replacement=False option. That use case necessarily has integer weights and may be better suited to the existing random.sample() rather than trying to recompute a CDF on every iteration as we would have to in this function.
Allows cumulative_weights to be submitted instead of individual weights. This supports uses cases where the CDF already exists (as in the ThinkBayes examples) and where we want to periodically reuse the same CDF for repeated samples of the same population -- this occurs in resampling applications, Gibbs sampling, and Monte Carlo Markov Chain applications. Per Jake, "MCMC/Gibbs Sampling approaches generally boil down to a simple weighted coin toss at each step" and "It's definitely common to do aggregation of multiple samples, e.g. to compute sample statistics"
The API allows the weights to be integers, fractions, decimals, or floats. Likewise, the population and weights can be any Sequence. Population elements need not be hashable.
Returning a list means that the we don't have to save state in mid-stream (that is why we can't use a generator). A list feeds nicely into Counters, mean, median, stdev, etc for summary statistics. Returning a list parallels what random.sample() does, keeping the module internally consistent.
Default uniform weighting falls back to random.choice() which would be more efficient than bisecting.
Bisecting tends to beat other approaches in the general case. See http://eli.thegreenplace.net/2010/01/22/weighted-random-generation-in-python
Incorporates error checks for len(population)==len(cum_weights) and for conflicting specification of both weights and cumulative weights.

There API is not perfect and there are some aspects that give me heartburn. 1) Not saving the computed CDF is waste and forces the user to pre-build the CDF if they want to save it for later use (the API could return both the selections and the CDF but that would be awkward and atypical). 2) For the common case of having small integer weights on a small population, the bisecting approach is slower than using random.choice on a population expanded to include the selections multiple times in proportion to their weights (that said, short of passing in a flag, there is no cheap easy way for this function to detect that case and give it a fast path). 3) Outputting a list is inefficient if all you're doing with result is summarizing it with a Counter, histogram tool, mean, median, or stdev. 4) There is no cheap way to check to see if the user supplied cum_weights is sorted or if the weights contain negative values.

applio commented 8 years ago

I've gone through the patch -- looks good to me.

1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 8 years ago

New changeset a5856153d942 by Raymond Hettinger in branch 'default': Issue bpo-18844: Add random.weighted_choices() https://hg.python.org/cpython/rev/a5856153d942

rhettinger commented 8 years ago

Thanks Davin.

serhiy-storchaka commented 8 years ago

Returning a list instead of an iterator looks unpythonic to me. Values generated sequentially, there are no advantages of returning a list.
An implementation lacks optimizations used in my patch.
The documentation still contains a receipt for weighted choice. It is incompatible with new function.

rhettinger commented 8 years ago

There isn't really an option to return a generator because it conflicts the rest of the module that uses lists elsewhere and that allows state to be saved and restored before and after any function call. One of the design reviewers also said that the generator form would harder for students to use.

I left the text in the examples section unchanged because it is still valid (showing how to make a cumulative distribution and how to build a fast alternative for the special case of small integer weights). Before the 3.6 release, I expect to expand this section to provide recipes for a MCMC application (using choices() with a passed-in CDF) and some other examples suggested by the design reviewers.

The optimization hacks in the other patch don't seem worth it. The current code is readable and runs fast (the principal steps are all C functions).

serhiy-storchaka commented 8 years ago

Using a generator doesn't prevents state to be saved and restored.

1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 8 years ago

New changeset 39a4be5e003d by Raymond Hettinger in branch '3.6': Issue bpo-18844: Make the number of selections a keyword-only argument for random.choices(). https://hg.python.org/cpython/rev/39a4be5e003d

rhettinger commented 8 years ago

Equidistributed examples:

choices(c.execute('SELECT name FROM Employees').fetchall(), k=20)
choices(['hearts', 'diamonds', 'spades', 'clubs'], k=5)
choices(list(product(card_facevalues, suits)), k=5)

Weighted selection examples:

  Counter(choices(['red', 'black', 'green'], [18, 18, 2], k=3800))   # american roulette
  Counter(choices(['hit', 'miss'], [5, 1], k=600))                   # russian roulette
  choices(fetch('employees'), fetch('years_of_service'), k=100)      # tenure weighted
  choices(cohort, map(cancer_risk, map(risk_factors, cohort)), k=50) # risk weighted

Star unpacking example:

   transpose = lambda s: zip(*s)
   craps = [(2, 1), (3, 2), (4, 3), (5, 4), (6, 5), (7, 6), (8, 5), (9, 4), (10, 3), (11, 2), (12, 1)]
   print(choices(*transpose(craps), k=10))

Comparative APIs from other languages:

http://www.mathworks.com/help/stats/randsample.html
http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html
https://stat.ethz.ch/R-manual/R-devel/library/base/html/sample.html
https://reference.wolfram.com/language/ref/RandomChoice.html

rhettinger commented 8 years ago

################################################################### # Flipping a biased coin

from collections import Counter
from random import choices

print(Counter(choices(range(2), [0.9, 0.1], k=1000)))

################################################################### # Bootstrapping

'From a small statistical sample infer a 90% confidence interval for the mean' # http://statistics.about.com/od/Applications/a/Example-Of-Bootstrapping.htm

from statistics import mean
from random import choices

data = 1, 2, 4, 4, 10
means = sorted(mean(choices(data, k=5)) for i in range(20))
print('The sample mean of {:.1f} has a 90% confidence interval from {:.1f} to {:.1f}'.format(
  mean(data), means[1], means[-2]))

1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 8 years ago

New changeset 433cff92d565 by Raymond Hettinger in branch '3.6': Issue bpo-18844: Fix-up examples for random.choices(). Remove over-specified test. https://hg.python.org/cpython/rev/433cff92d565

1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 8 years ago

New changeset d4e715e725ef by Raymond Hettinger in branch '3.6': Issue bpo-18844: Add more tests https://hg.python.org/cpython/rev/d4e715e725ef

1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 8 years ago

New changeset 32bfc81111b6 by Raymond Hettinger in branch '3.6': Issue bpo-18844: Make the various ways for specifing weights produce the same results. https://hg.python.org/cpython/rev/32bfc81111b6

1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 8 years ago

New changeset 09a87b16d5e5 by Raymond Hettinger in branch '3.6': Issue bpo-18844: Strengthen tests to include a case with unequal weighting https://hg.python.org/cpython/rev/09a87b16d5e5

python / cpython

allow weights in random.choice #63044