python / cpython

The Python programming language
https://www.python.org
Other
63.42k stars 30.37k forks source link

allow weights in random.choice #63044

Closed 5db38da4-1243-41dd-8f8a-1a42f116f8a7 closed 8 years ago

5db38da4-1243-41dd-8f8a-1a42f116f8a7 commented 11 years ago
BPO 18844
Nosy @tim-one, @rhettinger, @mdickinson, @pitrou, @serhiy-storchaka, @NeilGirdhar, @applio
PRs
  • python/cpython#552
  • Files
  • weighted_choice.diff: Preliminary implementation of random.choice optional arg "weights"
  • weighted_choice_v2.diff: Move cumulative distribution calculation to separate function that returns an index generator
  • weighted_choice_generator.patch
  • wcg_bench.py: Benchmarking different methods
  • weighted_choice_generator_2.patch
  • weighted_choice_v3.diff: a new implementation of weighted choice
  • weighted_choice_v3.patch: weighted choice function
  • weighted_choice_v4.patch
  • weighted_choice_v5.patch: weighted choice function v5
  • weighted_choice.diff
  • weighted_choice2.diff: Add docs and test
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/rhettinger' closed_at = created_at = labels = ['type-feature', 'library'] title = 'allow weights in random.choice' updated_at = user = 'https://bugs.python.org/aisaac' ``` bugs.python.org fields: ```python activity = actor = 'dstufft' assignee = 'rhettinger' closed = True closed_date = closer = 'rhettinger' components = ['Library (Lib)'] creation = creator = 'aisaac' dependencies = [] files = ['31479', '31547', '31732', '31734', '36331', '42322', '42323', '42386', '42393', '44394', '44407'] hgrepos = [] issue_num = 18844 keywords = ['patch', 'needs review'] message_count = 69.0 messages = ['196229', '196234', '196235', '196252', '196551', '196567', '196709', '196711', '196716', '196721', '196728', '196731', '196741', '196750', '196761', '196767', '197507', '197512', '197540', '197862', '197865', '197866', '198367', '198372', '223750', '224947', '224949', '224953', '224954', '224957', '225128', '225133', '225137', '225140', '225148', '226891', '262625', '262626', '262642', '262649', '262652', '262656', '262678', '262744', '262967', '262970', '262971', '262981', '262982', '262983', '262994', '262995', '267782', '272767', '272785', '274538', '274677', '274684', '274686', '274760', '274907', '274964', '277485', '277486', '277487', '278516', '278633', '279701', '279702'] nosy_count = 14.0 nosy_names = ['tim.peters', 'rhettinger', 'mark.dickinson', 'pitrou', 'aisaac', 'westley.martinez', 'python-dev', 'serhiy.storchaka', 'NeilGirdhar', 'madison.may', 'dkorchem', 'Christian.Kleineidam', 'davin', 'xksteven'] pr_nums = ['552'] priority = 'normal' resolution = 'fixed' stage = 'patch review' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue18844' versions = ['Python 3.6'] ```

    8be6dc97-7e75-4bae-bcbd-0928118040f1 commented 8 years ago

    Re-implemented with suggested improvements taken into account. Thanks @mark.dickinson and @pitrou for the suggestions.

    I also removed the redundant "fast path" portion for this code since it doesn't deal with generators anyways.

    Let me know additional thoughts about it.

    8be6dc97-7e75-4bae-bcbd-0928118040f1 commented 8 years ago

    Left in a line of code that was supposed to be removed. Fixed.

    serhiy-storchaka commented 8 years ago

    Raymond, do you have a time for this issue?

    serhiy-storchaka commented 8 years ago

    Raymond, any chance to get weighted random choices generator in 3.6? Less than month is left to feature code freeze.

    rhettinger commented 8 years ago

    FWIW, I have four full days set aside for the upcoming pre-feature release sprint which is dedicated to taking time to thoughtfully evaluate pending feature requests. In the meantime, I'm contacting Alan Downey for a consultation for the best API for this. As mentioned previously, the generator version isn't compatible with the design of the rest of the module that allows streams to have their state saved and restored at arbitrary points in the sequence. One API would be to create a list all at once (like random.sample does). Another would be to have two steps (like str.maketrans and str.translate). Ideally, the API should integrate neatly with collections.Counter as a possible input for the weighting. Hopefully, Alan can also comment on the relative frequency of small integer weightings versus the general case (the former benefits from a design using random.choice() applied to Counter.elements() and the latter benefits from a design with accumulate() and bisect()). Note, this is a low priority feature (no real demonstrated need, there is already a recipe for it in the docs, and once the best API have been determined, the code is so simple that any of us could implement it in only a few minutes).

    rhettinger commented 8 years ago

    Latest draft patch attached (w/o tests or docs). Incorporates consultation from Alan Downey and Jake Vanderplas.

    There API is not perfect and there are some aspects that give me heartburn. 1) Not saving the computed CDF is waste and forces the user to pre-build the CDF if they want to save it for later use (the API could return both the selections and the CDF but that would be awkward and atypical). 2) For the common case of having small integer weights on a small population, the bisecting approach is slower than using random.choice on a population expanded to include the selections multiple times in proportion to their weights (that said, short of passing in a flag, there is no cheap easy way for this function to detect that case and give it a fast path). 3) Outputting a list is inefficient if all you're doing with result is summarizing it with a Counter, histogram tool, mean, median, or stdev. 4) There is no cheap way to check to see if the user supplied cum_weights is sorted or if the weights contain negative values.

    applio commented 8 years ago

    I've gone through the patch -- looks good to me.

    1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 8 years ago

    New changeset a5856153d942 by Raymond Hettinger in branch 'default': Issue bpo-18844: Add random.weighted_choices() https://hg.python.org/cpython/rev/a5856153d942

    rhettinger commented 8 years ago

    Thanks Davin.

    serhiy-storchaka commented 8 years ago
    1. Returning a list instead of an iterator looks unpythonic to me. Values generated sequentially, there are no advantages of returning a list.

    2. An implementation lacks optimizations used in my patch.

    3. The documentation still contains a receipt for weighted choice. It is incompatible with new function.

    rhettinger commented 8 years ago

    There isn't really an option to return a generator because it conflicts the rest of the module that uses lists elsewhere and that allows state to be saved and restored before and after any function call. One of the design reviewers also said that the generator form would harder for students to use.

    I left the text in the examples section unchanged because it is still valid (showing how to make a cumulative distribution and how to build a fast alternative for the special case of small integer weights). Before the 3.6 release, I expect to expand this section to provide recipes for a MCMC application (using choices() with a passed-in CDF) and some other examples suggested by the design reviewers.

    The optimization hacks in the other patch don't seem worth it. The current code is readable and runs fast (the principal steps are all C functions).

    serhiy-storchaka commented 8 years ago

    Using a generator doesn't prevents state to be saved and restored.

    1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 8 years ago

    New changeset 39a4be5e003d by Raymond Hettinger in branch '3.6': Issue bpo-18844: Make the number of selections a keyword-only argument for random.choices(). https://hg.python.org/cpython/rev/39a4be5e003d

    rhettinger commented 8 years ago

    Equidistributed examples:

    choices(c.execute('SELECT name FROM Employees').fetchall(), k=20)
    choices(['hearts', 'diamonds', 'spades', 'clubs'], k=5)
    choices(list(product(card_facevalues, suits)), k=5)

    Weighted selection examples:

      Counter(choices(['red', 'black', 'green'], [18, 18, 2], k=3800))   # american roulette
      Counter(choices(['hit', 'miss'], [5, 1], k=600))                   # russian roulette
      choices(fetch('employees'), fetch('years_of_service'), k=100)      # tenure weighted
      choices(cohort, map(cancer_risk, map(risk_factors, cohort)), k=50) # risk weighted

    Star unpacking example:

       transpose = lambda s: zip(*s)
       craps = [(2, 1), (3, 2), (4, 3), (5, 4), (6, 5), (7, 6), (8, 5), (9, 4), (10, 3), (11, 2), (12, 1)]
       print(choices(*transpose(craps), k=10))

    Comparative APIs from other languages:

    http://www.mathworks.com/help/stats/randsample.html
    http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html
    https://stat.ethz.ch/R-manual/R-devel/library/base/html/sample.html
    https://reference.wolfram.com/language/ref/RandomChoice.html
    rhettinger commented 8 years ago

    ################################################################### # Flipping a biased coin

    from collections import Counter
    from random import choices

    print(Counter(choices(range(2), [0.9, 0.1], k=1000)))

    ################################################################### # Bootstrapping

    'From a small statistical sample infer a 90% confidence interval for the mean' # http://statistics.about.com/od/Applications/a/Example-Of-Bootstrapping.htm

    from statistics import mean
    from random import choices
    
    data = 1, 2, 4, 4, 10
    means = sorted(mean(choices(data, k=5)) for i in range(20))
    print('The sample mean of {:.1f} has a 90% confidence interval from {:.1f} to {:.1f}'.format(
      mean(data), means[1], means[-2]))
    1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 8 years ago

    New changeset 433cff92d565 by Raymond Hettinger in branch '3.6': Issue bpo-18844: Fix-up examples for random.choices(). Remove over-specified test. https://hg.python.org/cpython/rev/433cff92d565

    1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 8 years ago

    New changeset d4e715e725ef by Raymond Hettinger in branch '3.6': Issue bpo-18844: Add more tests https://hg.python.org/cpython/rev/d4e715e725ef

    1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 8 years ago

    New changeset 32bfc81111b6 by Raymond Hettinger in branch '3.6': Issue bpo-18844: Make the various ways for specifing weights produce the same results. https://hg.python.org/cpython/rev/32bfc81111b6

    1762cc99-3127-4a62-9baf-30c3d0f51ef7 commented 8 years ago

    New changeset 09a87b16d5e5 by Raymond Hettinger in branch '3.6': Issue bpo-18844: Strengthen tests to include a case with unequal weighting https://hg.python.org/cpython/rev/09a87b16d5e5