python / cpython

The Python programming language
https://www.python.org
Other
63.49k stars 30.4k forks source link

itertools.product with infinite iterator cause MemoryError. #54318

Closed 6a7c622c-37fb-42e8-a2de-3e3fa98e85ed closed 14 years ago

6a7c622c-37fb-42e8-a2de-3e3fa98e85ed commented 14 years ago
BPO 10109
Nosy @tim-one, @rhettinger, @terryjreedy, @merwok
Files
  • product.py: alt. implementation of cartesian product generator
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/rhettinger' closed_at = created_at = labels = ['invalid', 'type-bug', 'library'] title = 'itertools.product with infinite iterator cause MemoryError.' updated_at = user = 'https://bugs.python.org/falsetru' ``` bugs.python.org fields: ```python activity = actor = 'terry.reedy' assignee = 'rhettinger' closed = True closed_date = closer = 'terry.reedy' components = ['Library (Lib)'] creation = creator = 'falsetru' dependencies = [] files = ['24270'] hgrepos = [] issue_num = 10109 keywords = [] message_count = 15.0 messages = ['118735', '118831', '151533', '151575', '151579', '151605', '228151', '266774', '276836', '276843', '276853', '276855', '276857', '276860', '276872'] nosy_count = 9.0 nosy_names = ['tim.peters', 'rhettinger', 'terry.reedy', 'falsetru', 'nneonneo', 'eric.araujo', 'Sumudu.Fernando', 'yegle', 'Lucas Wiman'] pr_nums = [] priority = 'normal' resolution = 'not a bug' stage = 'resolved' status = 'closed' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue10109' versions = ['Python 2.7', 'Python 3.2', 'Python 3.3'] ```

    6a7c622c-37fb-42e8-a2de-3e3fa98e85ed commented 14 years ago

    According to the documentation, itertools.product is equivalent to nested for-loops in a generator expression. But, itertools.product(itertools.count(2010)) is not.

    >>> import itertools
    >>> (year for year in itertools.count(2010))
    <generator object <genexpr> at 0x026367D8>
    >>> itertools.product(itertools.count(2010))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    MemoryError
    terryjreedy commented 14 years ago

    The input to itertools.product must be a finite sequence of finite iterables. itertools.count(startvalue) produces an infinite sequence of ints (which are not iterable). Passing the latter to the former causes the latter to run until memory is exhausted. You should get the same result with tuple(itertools.count()), which is what happens within itertools.product.

    Your generator example does not show the same problem because it is not nested and never runs.

    If you are unfamiliar with Python and its stdlib, please ask questions on python-list (mirror on comp.lang.python and gmane.comp.python.general) before reporting what you think might be an error.

    99242df2-e0ea-4177-9535-06e0dd935f5e commented 12 years ago

    I don't agree with the response to this.

    It is true that as implemented (at least in 2.7, I don't have 3.x handy to check) itertools.product requires finite iterables. However this seems to be simply a consequence of the implementation and not part of the "spirit" of the function, which as falsetru pointed out is stated to be "equivalent to nested for-loops in a generator expression".

    Indeed, implementing product in Python (in a recursive way) doesn't have this problem.

    Perhaps a more convincing set of testcases to show why this could be considered a problem:

    >>> import itertools
    >>> itertools.product(xrange(100))
    <itertools.product object at 0xb7ed334c>
    >>> itertools.product(xrange(1000000))
    <itertools.product object at 0xb7ed620c>
    >>> itertools.product(xrange(1000000000))
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    MemoryError

    Note that I'm not even using an infinite iterable, just a really big one. The issue is that creating the iterator fails with a MemoryError, before I've even asked for any values. Consider the following:

    for (i, v) in enumerate(itertools.product(a, b, c)):
        if i < 1000:
            print v
        else:
            break

    When a, b, and c are relatively small, finite iterables, this code works fine. However, if *any* of them are too large (or infinite), we see a MemoryError before the loop even starts, even though only 1000 elements are required. I think it's conceivable that we might want something like "a = itertools.cycle(xrange(5))", and even that will break this loop.

    That said, in all such cases I could think of, we can always either truncate big iterators before passing them to product, or use zip/comprehensions to add their values into the tuple (or some combination of those). So maybe it isn't a huge deal.

    I've attached my implementation of product which deals with infinite iterators by leveraging enumerate and itertools.cycle, and is pretty much a direct translation of the "odometer" idea. This doesn't support the "repeat" parameter (but probably could using itertools.tee). One thing that should be changed is itertools.cycle shouldn't be called / doesn't need to be called on infinite iterators, but I couldn't figure out how to do that. Maybe there is some way to handle it in the C implementation?)

    In summary: the attached implementation of product can accept any mix of infinite / finite iterators, returning a generator intended for partial consumption. The existing itertools.product doesn't work in this case.

    terryjreedy commented 12 years ago
    Proposing an expansion of the definition of product() is a *completely* different issue from the validity of count() as an input. I answered correctly given the current definition of product(): it is not valid input. It is also not valid input to your proposed revision:
    >>> tuple(itertools.cycle(enumerate(it)) for it in itertools.count())
    ...
    TypeError: 'int' object is not iterable
    -- just as I said.

    If you want to propose an enhancement, either open an new, enhancement issue or post to python-ideas. Since new features can only go in 3.3+, post 3.x code, not 2.x. And please do not quibble about the difference between 'infinite' and 'too large to fit in memory'.

    99242df2-e0ea-4177-9535-06e0dd935f5e commented 12 years ago
    >>> tuple(itertools.cycle(enumerate(it)) for it in itertools.count())
      ...
      TypeError: 'int' object is not iterable

    That is not what happens in the function, though! That would correspond to doing product(itertools.count(2010)), but if you try that you won't even get past argument expansion (obviously). Doing product(xrange(10)) gives the error you're talking about, for example.

    product(itertools.count(2010)) works perfectly well with the version I posted, though it is a bit silly to do it that way since it produces the same values as count itself (which is what "cartesian product" should do), while saving extra bookkeeping along the way.

    Anyway, I'm pretty new to python and I don't think this is quite relevant enough to warrant opening a new ticket. I'm happy to leave it here for the education of the next neophyte who stumbles across this idiosyncracy of itertools.product.

    terryjreedy commented 12 years ago

    A relatively simple change would be to allow the first iterable to be 'infinite', when repeat==1, by not calling tuple() on it. The reason for turning the iterables into concrete sequences is because they might not be reiterable. (cycle() does the same for the same reason.) But since the first iterable is only iterated once, this does not apply to it.

        if repeat == 1:
            pools = [args[0:1]].extend(tuple(pool) for pool in args[1:])
        else:
            pools = [tuple(pool) for pool in args] * repeat

    The counter argument to this or any generalized proposal is that one can expand the product() into enough loops to avoid infinite (or very large) args. For example, the following produces '1AA', '1AB', ..., '1EE', '2AA', ... indefinitely.

    naa=(''.join((str(n),)+s) for n in itertools.count(1)
         for s in itertools.product(string.ascii_uppercase[0:5], repeat=2))

    RAYMOND: Do you think the doc should specify that each iterable must be finite, and that explicit loops are the alternative if not?

    cd65c590-98e9-41d1-9e7a-7754bceafdd7 commented 10 years ago

    Found another example that shows horrible performance when using itertools.product:

    def gen():
      l = [itertools.permutations(range(10)) for _ in range(10)]
      g = itertools.product(*l)
      for i in g:
        yield i

    A simple next() to this generator takes 16 seconds on my desktop.

    I use this recursive product() instead and the performance is acceptable:

    def product(*args):
        if len(args) == 1:
            for i in args[0]:
                yield [i]
        else:
            for i in args[0]:
                for j in product(*args[1:]):
                    j.append(i)
                    yield j
    55a20bbb-80c5-428f-a834-b3838349251b commented 8 years ago

    I realize this is an old (and closed!) thread, but here are a few notes from running into this issue recently, working on a search problem with infinite iterators.

    First, note that yegle's recursive solution is not correct, since it exhausts the iterators:
    >>> def product(*args):
    ...     if len(args) == 1:
    ...         for i in args[0]:
    ...             yield [i]
    ...     else:
    ...         for i in args[0]:
    ...             for j in product(*args[1:]):
    ...                 j.append(i)
    ...                 yield j
    ... 
    >>> 
    >>> assert len(list(itertools.product(*[iter(range(2)) for _ in range(10)]))) == 2 ** 10
    >>> assert len(list(product(*[iter(range(2)) for _ in range(10)]))) == 2 ** 10
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AssertionError

    This is fairly tricky to get right, and even a correct fully-lazy solution requires each iterator to eventually be fully stored in memory. For that reason, Sumudu's MemoryError is not something that can be easily escaped, though it can be delayed.

    For infinite iterators, the naive "lexicographic" solution to product(count(0), count(0)) would never get to 1 on the first iterator, yielding (0, 0), (0, 1), (0, 2), ... To fully explore the space, the user needs to think about how to iterate over that space, so IMO keeping infinite iterators as invalid arguments to itertools.product makes sense.

    4b4bffee-c0a5-4c36-b1d7-404ef2ef29b0 commented 8 years ago

    I think this _is_ a bug. Most of the itertools are quite thrifty with memory, and exceptions are explicitly documented.

    The itertools generally only require memory proportional to the number of iterations consumed. However, itertools.product requires an enormous up-front memory cost even if only a few iterations are performed. There are good reasons to extract only a few iterations from a product - for example, a search procedure where only the first item in the product is infinite:

        for year, day in product(count(2010), range(365)):
            if rare_event(year, day):
                break

    Or, in an application I recently tried, to count occurrences of something up to a limit:

        prod = product(count(), range(n))
        filtered = ifilter(pred, prod)
        want = list(islice(filtered, 101))
        if len(want) > 100:
            print "Too many matches"

    Having product greedily consume all of its input iterators is rather antithetical to the notion of itertools being lazy generators, and it breaks efficient composability with the other itertools.

    The fix should be fairly straightforward: like tee, maintain lists of items already generated as the input iterators are exhausted, and add a note that "product"'s memory usage may slowly grow over time (as opposed to growing to maximum size immediately).

    tim-one commented 8 years ago

    I see nothing wrong with combinatorial generators materializing their inputs before generation. Perhaps it should be documented clearly. It's certainly not limited to product(). For example,

    >>> for i in itertools.combinations(itertools.count(), 2):
    ... print(i)

    dies with MemoryError too. But I expected that ;-)

    4b4bffee-c0a5-4c36-b1d7-404ef2ef29b0 commented 8 years ago

    It wouldn't be that complicated to have the combinatorial functions be lazy too - I feel like there's some value to be had there.

    What's the major objection to making all of the itertools functions "maximally lazy"?

    rhettinger commented 8 years ago

    What's the major objection to making all of the itertools functions "maximally lazy"?

    It complicates the code and provides nearly zero benefit in practical use cases of the combinatoric functions.

    55a20bbb-80c5-428f-a834-b3838349251b commented 8 years ago

    It is quite thrifty with memory compared to the size of the search space O(n*k) memory for a search space of size O(n**k).

    I think a reasonable expectation for itertools.product is that it should _eventually_ reach any element in the search space. The only way to maintain that expectation for infinite iterators would be be to use a Cantor-style zig-zag enumeration. Even for your example of searching over a single infinite iterator, you could end up exploring dates in the order (2010, 0), (2011, 0), (2012, 0), ... by switching the ordering of inputs. You'd never hit the rare date unless it happened on the first day of a year.

    There's no way to have special behavior _only_ for infinite iterators, since an infinite iterator is indistinguishable from a long-finite one due to the halting problem, so this would require changing the behavior for finite iterators too. While the output ordering is not explicitly documented, the current search order is probably depended on by plenty of existing programs, and has less surprise than a Cantor enumeration.

    So the only use case where the lazy enumeration matters is if:

    1. You have one or more _very_ large iterators.
    2. You _do not care_ what order they are searched in, or even if all possible tuples appear in the output.
    3. Despite not caring about the output ordering, you still only want a few examples.

    I'm struggling to think of cases where that would come up.

    tim-one commented 8 years ago

    Lucas, I largely agree, but it is documented that the various combinatorial generators emit items in a particular lexicographic order. So that is documented, and programs definitely rely on it.

    That's why, in an earlier comment, Terry suggested that perhaps product() could make a special case of its (and only its) first argument (and only when repeat=1). Each element of the first iterable is needed only once (although it may copied into any number of outputs), so there's no actual need to make a tuple of it first. The implementation is just simpler and clearer by treating all arguments alike.

    Which is a good enough reason for me - and the "use cases" for an unbounded first argument look exceptionally weak to me.

    terryjreedy commented 8 years ago

    Recipes for handling an infinite first iterator for product, or an infinite iterator for combinations (they could be similar), that use and build on the current functions, without the output order constraint, might be candidates for the recipe section. I will probably write at least one of them for my own purposes someday.