python / cpython

The Python programming language
https://www.python.org/
Other
60.01k stars 29.05k forks source link

Patch of itertools.{combinations,permutations} for empty combinations #49066

Closed f24ef518-0963-4e81-90b1-85bb1f5e2b60 closed 15 years ago

f24ef518-0963-4e81-90b1-85bb1f5e2b60 commented 15 years ago
BPO 4816
Nosy @rhettinger, @mdickinson
Files
  • itertools-empty-combinations.diff
  • combperm4.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/rhettinger' closed_at = created_at = labels = ['tests', 'type-bug', 'library'] title = 'Patch of itertools.{combinations,permutations} for empty combinations' updated_at = user = 'https://bugs.python.org/TFinley' ``` bugs.python.org fields: ```python activity = actor = 'rhettinger' assignee = 'rhettinger' closed = True closed_date = closer = 'rhettinger' components = ['Library (Lib)', 'Tests'] creation = creator = 'TFinley' dependencies = [] files = ['12563', '12647'] hgrepos = [] issue_num = 4816 keywords = ['patch'] message_count = 22.0 messages = ['78943', '78948', '79111', '79116', '79308', '79366', '79367', '79369', '79370', '79371', '79372', '79373', '79374', '79379', '79380', '79381', '79382', '79383', '79386', '79393', '79394', '79399'] nosy_count = 4.0 nosy_names = ['rhettinger', 'mark.dickinson', 'LambertDW', 'TFinley'] pr_nums = [] priority = 'normal' resolution = 'fixed' stage = None status = 'closed' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue4816' versions = ['Python 2.6', 'Python 3.0', 'Python 3.1', 'Python 2.7'] ```

    f24ef518-0963-4e81-90b1-85bb1f5e2b60 commented 15 years ago

    This is a patch for the Python 3.1 build checked out from http://svn.python.org/projects/python/branches/py3k

    The current behavior of itertools.combinations(iterable,r) and itertools.permutations(iterable,r) is to throw a ValueError if iterable yields fewer than r items. This changes these functions so the generator instead yields 0 items.

    As for my argument for acceptance, while the original behavior is not a bug insofar as its implementation was deliberate, I'd argue this is undesirable on grounds of mathematical consistency and functionality.

    In mathematics the "choose" function is defined as "(n choose r)=0" for r>n, so itertools.combinations' behavior is inconsistent with its obvious combinatorial analogy. (Similar for permutations and the combinatorial "P" function.)

    For functionality I'll cite my own case as anecdote. In writing regression tests I wanted to ensure that a group of items obeyed a certain "mutually consistent" property between all triples. (Something akin to the triangle inequality.) So:

    all(triangle_respected(*triple) for triple in itertools.combinations(group, 3))

    If len(group)\<=2, that's fine, since with no triangles there is no inconsistency, and all(())==True. However, the code failed with a ValueError when groups were that small. Working around this was fairly awkward, since I wanted to continue to fail noisily if my triangle_respected function threw a ValueError, and I wasn't sure at all that it wouldn't if my items were

    The patch modifies Modules/itertoolsmodule.c slightly, changing combinationsobject_new and permutationsobject_new. (Deleting the error throwing code, and have one line changes in both structures to handle the n>r case. One could handle this special case more efficiently than I do by not bothering to allocate certain structures in this case, but I didn't want to tempt fate.) The patch also changes the appropriate tests in Lib/test/test_itertools.py .

    Obviously, this would break code that depends upon Python throwing ValueError in the event of the combination or permutation sequence being empty. However, I hope given that combinations and permutations are a relative novelty that their behavior in this corner case is not yet etched in stone.

    Sorry if this belongs in a PEP -- it seems quite minor, but I don't quite have a feel what the threshold is.

    mdickinson commented 15 years ago

    I agree that the proposed behaviour seems more correct: the collection of all subsets of size 4 of range(3) is perfectly valid and well-defined; it just happens to be empty. I've also encountered this in practice (in code that was enumerating partitions of a list into a fixed number of pieces), but in my case it was easier to work around.

    I guess the counterargument is that the current ValueError might catch bugs early; I don't know how often this is true in practice.

    In any case, this is Raymond's call.

    rhettinger commented 15 years ago

    Will spend a while mulling this over and taking it under advisement.

    Initially, I am disinclined for several reasons.

    1. For many use cases, r>n is an error condition that should not pass silently.

    2. While it's possible to make definitions of comb/perm that define the r>n case as having an empty result, many definitions do not. See the wikipedia article on permutations:

    """ In general the number of permutations is denoted by P(n, r), nPr, or sometimes P_n^r, where:

    * n  is the number of elements available for selection, and
    * r  is the number of elements to be selected (0 ≤ r ≤ n).

    For the case where r = n it has just been shown that P(n, r) = n!. The general case is given by the formula:

    P(n, r) = n! / (n-r)!.

    """

    That discussion is typical and I think it important the number of items returned matches the formula (which fails when r>n because (n-r)! would call for a negative factorial).

    Also see the wikipedia article on combinations (the "choose" function) which also expresses a factorial formula that fails for r>n. No mention is made for special handling for r>n.

    1. For cases where you need a more inclusive definition that assigns a zero length result to cases where r>n, the workaround is easy. Moreover, in such cases, there is some advantage to being explicit about those cases being handled. In the example provided by the OP, the explicit workaround is:

      all(triangle_respected(*triple) for triple in itertools.combinations(group, 3) if len(group)>=3)

    or if you factor-out the unvarying constant expression in the inner-loop:

    len(group)>=3 or all(triangle_respected(*triple) for triple in itertools.combinations(group, 3))

    For other cases, it may be preferable to write your own wrapper:

       def myperm(pool, r=None):
          'custom version of perm returning an empty iterator when r>n'
          if r is not None and r > len(pool):
              return iter([])
          return itertools.permutations(pool, r)

    I like this because it is explicit about its intended behavior but doesn't slow-down the common case where some work actually gets done.

    1. Don't want to break any existing code that relies on the ValueError being thrown. It's too late to change this for 2.6 and 3.0 and possibly for 2.7 and 3.1 without having a deprecation period. While this hasn't been out long, it will have been once 2.7 is released. Code that relies on the ValueError may be hard to spot because it is implicitly relying on the ValueError aborting the program for invalid input (invalid in the sense of how the result is being used).

    2. I personally find the r>n case to be weird and would have a hard time explaining it to statistics students who are counting permutations and relating the result back to the factorial formulas.

    On the flip side, I do like Mark's thought that the r>n case being empty doesn't muck-up the notion of combinations as returning all "r-length subsequences of elements from the input iterable." Also, I can see a parallel in the string processing world where substring searches using __contains__ simply return False instead of raising an exception when the substring is longer that the string being searched.

    Those are my initial thoughts. Will ponder it a bit more this week.

    rhettinger commented 15 years ago

    Quick notes so I don't forget:

    The two main pure python equivalents in the docs would need to be updated with the new special cases (currently they raise IndexError for r>n). The other two pure python equivalents would automatically handle the proposed new cases.

    The Mathematica definition of permutations does not discuss or allow for the r>n case, http://mathworld.wolfram.com/Permutation.html . One of the other tools that actually lists permutations and takes a second argument (most do not) is in Mathematica, http://reference.wolfram.com/mathematica/ref/Permutations.html . It is unclear from their docs whether the r>n case is supported as a zero-length output or as an error condition.

    f24ef518-0963-4e81-90b1-85bb1f5e2b60 commented 15 years ago

    Hi Raymond, thanks for your well reasoned and thorough reply. Just a couple small thoughts... I'll borrow your numbers if you don't mind.

    1. I'd ask you to discount this argument. There are many situations in the Python library where empty results are possible return values. There are likewise many places in mathematics where a set as defined winds up being empty. And sure, absolutely, sometimes this empty result will be an error condition. However, right off the top of my head, I can't really think of any other place in the Python library where an empty returned iterator/list is automatically assumed to necessarily be an error.

    The closest I can think of right now might be key/index errors on dicts and sequences... but that's a user explicitly asking for exactly one element, in which case non-existence would be bad.

    In a larger sense, if a result computed by a function is sensible and well defined in itself, it doesn't seem even possible for a function to anticipate what answers might be considered error conditions by calling code. Potentially any return value for any function might be an error for some user code. So... the question then becomes, is the empty result a sensible and well defined answer. And here we differ again I see. :D

    1. OK on perms. If you don't mind, I'm going to talk about combinations since that was my real motivation.

      Also see the wikipedia article on combinations (the "choose" function) which also expresses a factorial formula that fails for r>n. No mention is made for special handling for r>n.

    Hmmm. Are you looking at the definition of "choose function" under Wikipedia? http://en.wikipedia.org/wiki/Choose_function Notice the material after the "and," after the first part of the definition, in case you missed it in your first reading. So, this is the def I was taught through high school and undergrad, perhaps the same for Mark judging from his reply. You were taught it was undefined? It does sort of complicate some other defs that rely on binomial coefficients though, I'd think, but I guess you could make those work with little effort. It's not as though subtly different but incompatible definitions of basic terms are that uncommon in math... never knew the binomial coef was in that category though. You teach stats I take it? I suppose you'd know then... never heard of it, but that doesn't mean much. Oh well. (You can perhaps feel my pain at seeing the current behavior, given my understanding.)

    1. For your "replacement" myperm (let's say mycomb), if you insert a "pool=tuple(iterable)" into the beginning, you'll have my fix, statement for statement. You advance this as desirable, while to my eye it was nasty enough to motivate me to submit my first patch to Python. So, you can imagine we have a slight difference of opinion. ;) Let's see if I can't present my view effectively...

    Saying the fix is more explicit suggests there's something implicit about the concise solution. If the choose function is defined for those values, then itertools.combinations is just unnecessarily forcing code to treat something as a special case even when it's not. So, I'd be careful to draw a distinction between "more verbose" and "more explicit."

    1. Can't really speak to that. It's probably worth changing documentation in any event, even if you reject the patch. The provided equiv code and the actual behavior in this "error case" differ anyway as you noted in your second reply. Also, this error condition isn't evident from the documentation without reading and parsing the source code.

    A third and extraneous comment, if you do wind up changing the docs anyway, perhaps a shorter recursive solution would be easier to understand? Maybe it's just me since I'm sort of a functional type of guy, but I found the equivalent code a little hard to parse.

    1. Yeah. Actually, I'm pretty sure that's why the choose function is defined piecewise, since negative factorials are undefined as you say.
    rhettinger commented 15 years ago
    1. This is the most important issue. Is the r>n case typically an error in the user's program or thought process. If so, then the utility of a ValueError can be weighed against the utility of other cases where you want combs/perms to automatically handle the r>n cases with an empty output. These use cases are what I'm mulling over this week.

    2. I see your wikipedia link. The one I was looking at was: http://en.wikipedia.org/wiki/Combination

    3. I wasn't clear on the issue of explictness. The potential problem that it isn't immediately obvious to me what comb/perm does when r>n, so if I see code that isn't explictly handling that case, I have to lookup what comb/perm does do to make sure the code works.

    In the math module, it is a virtue that math.sqrt(-1) raises a ValueError. In the cmath module, it is a virtue that it does not. The latter case is distinguished because the programmer has explicitly requested a routine that can handle complex numbers -- that is a good clue that the surrounding code was designed to handle the complex result.

    1. Not too worried about this one. Essentially the thought is that code that wasn't designed with the r>n case in mind probably benefits from having a ValueError raised when that condition is encountered.

    2. This one bugs me a bit. It is nice to have all the factorial
      formulas just work and not have a piecewise definition.

    BTW, we don't really have a difference of opinion. My mind is open. I aspire to pick the option that is the best for most users including students and novice users. The trick in language design is to anticipate use cases and to understand that people have differing world views (i.e. it is a perfectly reasonable point-of-view that taking n things r at a time makes no sense when r is greater than n).

    In choosing, there is some bias toward sticking the API as it was released. Changing horses in midstream always causes a headache for some user somewhere.

    Am still thinking this one through and will let you know in a week or so. I also want to continue to research into how this is handled in other libraries and other languages.

    rhettinger commented 15 years ago

    Another thought before I forget: The piecewise definition of the choose function or for binomial coefficients suggests that supporting the r>n case should be accompanied by supporting the r\<0 case.

    mdickinson commented 15 years ago
    1. This one bugs me a bit. It is nice to have all the factorial formulas just work and not have a piecewise definition.

    IIRC, in the 'Concrete Mathematics' book, Knuth and co use something like

    (n choose k) = (n-to-the-k-falling)/k!

    to get around this: this definition works for all k >= 0 and all integers (or even real numbers) n. Here x-to-the-k-falling means x (x-1) ... * (x-k+1). See:

    http://mathworld.wolfram.com/FallingFactorial.html

    Another thought before I forget: The piecewise definition of the choose function or for binomial coefficients suggests that supporting the r>n case should be accompanied by supporting the r\<0 case.

    I'd say not: in the context of sets of combinations/permutations, (rather than just defining the nCr or nPr symbols) r has a definite meaning as a cardinality: "enumerate all subsets of iter of cardinality r" and so it makes sense to me to restrict to the case where r is nonnegative.

    (If we were talking about the integer-valued function nCr, then I'd agree.)

    My own guess would be that having combinations(range(n), r) give the empty iterator for r > n is the right thing in the vast majority of situations, even (especially?) where the programmer hasn't anticipated the possibility of r > n. I have yet to see a case where it's useful to raise ValueError.

    I have access to the Magma algebra package at work; I'll check tomorrow or Friday to see what that does.

    78a6bf11-a01d-46d5-ba37-0e5218983f8f commented 15 years ago

    Mathematica returns an empty list.

    In[1]:= Permutations[{1,2},{1}]

    Out[1]= {{1}, {2}}

    In[2]:= Permutations[{1,2},{4}]

    Out[2]= {}

    In[3]:=

    rhettinger commented 15 years ago

    David, thanks for the data point. What does it return for

    In[1]:= Permutations[{a, b, c}, {-1}]

    78a6bf11-a01d-46d5-ba37-0e5218983f8f commented 15 years ago

    Mathematica indicates for the user to define it later. An error.

    In[3]:= Permutations[{1,2},{-2}]

    Permutations::nninfseq: Position 2 of Permutations[{1, 2}, {-2}] must be All, Infinity, a non-negative integer, or a List whose first element (required) is a non-negative integer, second element (optional) is a non-negative integer or Infinity, and third element (optional) is a nonzero integer.

    Out[4]= Permutations[{1, 2}, {-2}]

    mdickinson commented 15 years ago

    Results from Magma, Sage and Gap:

    Magma provides functions "Subsets(S, k)" and "Permutations(S, k)". From the documentation:

    Subsets(S, k) : The set of subsets of the set S of size k. If k is larger than the cardinality of S then the result will be empty.

    Permutations(S, k) : The set of permutations (stored as sequences) of each of the subsets of the set S of cardinality k.

    GAP has the same behaviour: even if you've never encountered GAP before, it's fairly Pythonesque, so the following extract from a GAP 4 REPL means exactly what you think it does:

    gap> Combinations([1,2,3,4], 3); [ [ 1, 2, 3 ], [ 1, 2, 4 ], [ 1, 3, 4 ], [ 2, 3, 4 ] ] gap> Combinations([1,2,3,4], 5); [ ]

    Permutations work the same way (but the function is called Arrangements):

    gap> Arrangements([1,2,3,4], 3); [ [ 1, 2, 3 ], [ 1, 2, 4 ], [ 1, 3, 2 ], [ 1, 3, 4 ], [ 1, 4, 2 ], [ 1, 4, 3 ], [ 2, 1, 3 ], [ 2, 1, 4 ], [ 2, 3, 1 ], [ 2, 3, 4 ], [ 2, 4, 1 ], [ 2, 4, 3 ], [ 3, 1, 2 ], [ 3, 1, 4 ], [ 3, 2, 1 ], [ 3, 2, 4 ], [ 3, 4, 1 ], [ 3, 4, 2 ], [ 4, 1, 2 ], [ 4, 1, 3 ], [ 4, 2, 1 ], [ 4, 2, 3 ], [ 4, 3, 1 ], [ 4, 3, 2 ] ] gap> Arrangements([1,2,3,4], 5); [ ]

    Combinations([1,2,3,4], -1) gives an error. Interestingly, Arrangements([1,2,3,4], -1) returns the empty list.

    GAP also has a NrCombinations function returning the number of combinations:

    gap> NrCombinations([1,2,3,4], 5); 0 gap> NrCombinations([1,2,3,4], -1); 0

    My Sage installation currently seems to be broken, but from the documentation it looks as though it just steals the functions from GAP.

    mdickinson commented 15 years ago

    Got Sage working again. It also returns an empty list for r > n. For r negative, Combinations returns an empty list and Permutation gives an error.

    sage: Combinations(range(4), 6) Combinations of [0, 1, 2, 3] of length 6 sage: Combinations(range(4), 6).list() [] sage: Permutations(range(4), 6).list() [] sage: Combinations(range(4), -1).list() [] sage: Permutations(range(4), -1).list() ---------------------------------------------------------------------------

    IndexError                                Traceback (most recent call last)
    [... traceback snipped ...]
    rhettinger commented 15 years ago

    Data point: Microsoft Excel's PERMUT() and COMBIN() return #NUM! for r>n or r\<0.

    78a6bf11-a01d-46d5-ba37-0e5218983f8f commented 15 years ago

    I try to "not know" excel. Does it have any other means to represent an empty set?

    rhettinger commented 15 years ago

    Excel is returning the count, not the set itself. So, it could have chosen to return zero.

    The HP32SII also has nCr and nPr functions that return INVALID DATA for r>n or r\<0.

    The Combinatorics section of CRC Standard Mathematical Formulae (30th edition) defines nCr and nPr only for 0 \<= r \<= n. The same is true in Harris and Stocker's Handbook of Mathematics and Computational science.

    rhettinger commented 15 years ago

    Summary of functions/definitions expected to return a number:

    r>n r\<0 Source --------- ---------- ------------------------------ error error MS Excel PERMUT() and COMBIN() error error HP32SII nPr and nCr
    undefined undefined CRC Combinatoric definitions
    undefined undefined Handbook of Mathematics and Computational Science
    undefined undefined Wolfram:
    http://mathworld.wolfram.com/Permutation.html zero zero http://en.wikipedia.org/wiki/Choose_function
    undefined undefined http://en.wikipedia.org/wiki/Combination
    undefined undefined http://en.wikipedia.org/wiki/Permutation
    zero undefined Knuth's choose function in Concrete Mathematics
    zero zero GAP's nrCombinations()

    Summary for tools that return the individual subsets:

    r>n r\<0 Source --------- ---------- ------------------------------ emptylist error Mathematica emptylist unknown Magma emptylist error GAP Combinations emptylist emptylist GAP Arrangements emptylist emptylist Sage Combinations emptylist error Sage Permutations

    rhettinger commented 15 years ago

    David, I know the OPs position and Mark's opinion. What is your recommendation for the r>n case and the r\<0 case?

    rhettinger commented 15 years ago

    Attached an updated patch which expands the tests and docs. Handles the r>n case with an empty result. Leaves the r\<0 case as a ValueError.

    Am thinking that if we do this, it should be treated as a bug and posted to 2.6 and 3.0.

    78a6bf11-a01d-46d5-ba37-0e5218983f8f commented 15 years ago

    I had thought highly of the "mull it over for a week" plan. After a week we'd decide to follow Stephen Wolfram's lead, which seems to be the current patch. I haven't yet used the python permutations iterator, although I used to have a script that solved word JUMBLE puzzles with a mathematica | spell pipeline. Now I look up words using a sorted anagram dictionary.

    rhettinger commented 15 years ago

    Thanks David. Looks like we're all in agreement. Turns out, I did a weeks worth of mulling today ;-) Accepting the proposal with the following rationale:

    1. There exist some valid use cases where r>n.
    2. Returning an empty list is the expected outcome with two of the current pure python versions (the ones that filter results from product() based on their length, uniqueness and sort).
    3. It is consist with the notion of 'all subsets of length r ...'
    4. It corresponds to the piecewise definitions of the choose function and binomial coefficient.
    5. It matches the battle-tested APIs of all of the major mathematics tools we looked at (Mathematica, Magma, GAP, and Sage.
    6. The recipe for combinations_with_replacement() is well defined and useful in cases where r>n -- it is desirable that all the variants of this function keep their APIs in-sync.
    7. The three other posters here all agree on it :-)

    Weighing against it:

    1. The ValueError version has already been released.
    2. There may be use cases where r>n represents an error so that raising a ValueError would be helpful.
    3. The length of the comb/perm doesn't correspond to the non-piecewise, factorial based functions that are so ubiquitous (MS Excel, HP32SII, CRC handbook, Handbook of Mathematics, wikipedia's entries for Combination and Permutation, and GAP's nrCombinations() function).
    rhettinger commented 15 years ago

    Fixed in r68394 Will forward port in a day or two.