python / cpython

The Python programming language
https://www.python.org
Other
63.15k stars 30.24k forks source link

Use range in itertools roundrobin recipe #76280

Closed terryjreedy closed 6 years ago

terryjreedy commented 6 years ago
BPO 32099
Nosy @tim-one, @rhettinger, @terryjreedy, @serhiy-storchaka, @dubslow
PRs
  • python/cpython#4486
  • python/cpython#4487
  • python/cpython#4497
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/rhettinger' closed_at = created_at = labels = ['type-feature', '3.7', 'docs'] title = 'Use range in itertools roundrobin recipe' updated_at = user = 'https://github.com/terryjreedy' ``` bugs.python.org fields: ```python activity = actor = 'rhettinger' assignee = 'rhettinger' closed = True closed_date = closer = 'rhettinger' components = ['Documentation'] creation = creator = 'terry.reedy' dependencies = [] files = [] hgrepos = [] issue_num = 32099 keywords = ['patch'] message_count = 15.0 messages = ['306605', '306608', '306611', '306612', '306613', '306623', '306624', '306628', '306630', '306631', '306670', '306673', '306680', '306681', '306860'] nosy_count = 6.0 nosy_names = ['tim.peters', 'rhettinger', 'terry.reedy', 'docs@python', 'serhiy.storchaka', 'Dubslow'] pr_nums = ['4486', '4487', '4497'] priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue32099' versions = ['Python 3.7'] ```

    terryjreedy commented 6 years ago

    The itertools roundrobin recipe has an outer loop executed a preset number of times. It is currently implemented with two assignments and a while loop. https://docs.python.org/3/library/itertools.html#itertools-recipes These can be replaced with a for loop using a reversed range.

    def roundrobin(*iterables):
        "roundrobin('ABC', 'D', 'EF') --> A D E B F C"
        nexts = cycle(iter(it).__next__ for it in iterables)
        for current_len in reversed(range(1, len(iterables)+1)):
            try:
                for next in nexts:
                    yield next()
            except StopIteration:
                nexts = cycle(islice(nexts, current_len - 1))

    I think this is easier to understand. So do some other participants in the current python-ideas thread 'Rewriting the "roundrobin" recipe in the itertools documentation'.

    I changed 'pending' to 'current_len' because, to me, 'pending' should be the set of iter_nexts and not their number.

    I originally avoided the '-1' in the islice call by decrementing both range arguments by 1 and calling the loop variable 'reduced_len'. But having the loop variable be the size of the nexts iterable in the next outer iteration seemed confusing and not worth the trivial efficiency gain.

    An independent change would be to replace 'next' with 'iter' on the basis that reusing the builtin name is not good and because 'nexts' is awkward to pronounce.

    I will produce a PR if any version is preferred to the current one. ---

    The OP proposed, and some participants like, an accept-reject algorithm based on zip_longest.

    def roundrobin(*iters):
        "roundrobin('ABC', 'D', 'EF') --> A D E B F C"
        # Perhaps "flat_zip_nofill" is a better name, or something similar
        sentinel = object()
        for tup in it.zip_longest(*iters, fillvalue=sentinel):
            yield from (x for x in tup if x is not sentinel)

    I dislike creating tuples we don't want, with values we don't want, with an arbitrarily small acceptance ratio. I also note that zip_longest is properly used in grouper, whereas roundrobin is the only recipe using cycle.

    7ce1a323-7d55-40f6-b97f-d80c3d15e7dc commented 6 years ago

    Note that this follows a bit of discussion on python-ideas, in two threads:

    https://mail.python.org/pipermail/python-ideas/2017-November/047920.html

    https://mail.python.org/pipermail/python-ideas/2017-November/047989.html

    I agree the zip_longest-derived solution is both easier to read/understand and also not really accurate, especially in extreme cases.

    But, and see the second thread, I think it might just be better to add this funcitonality itself to itertools, under the name zip_flat -- per the second thread, there's a lot of confusion around the topic beyond just the suitability of the current recipe (including such things as lack of a clear name, legibility, efficiency, and brevity -- a fair number of people are looking for one or two liners, even if that doesn't really exist). (One alternative to zip_flat would be to add a second keyword argument to zip_longest that *doesn't use a fillvalue, producing variable-length tuples when the supplied iters run out. Then the recipe and the entire conversation/topic could be replaced by "yield from zip_longest(iters, usefill=False)".)

    Despite that opinion, this suggested change is better than nothing.

    7ce1a323-7d55-40f6-b97f-d80c3d15e7dc commented 6 years ago

    Regarding the current bug more specifically, I think a few comments would go a long way to helping readers understand the code.

    And also, I do think the (1, +1, -1) is less readable, simply because it doesn't follow the most common usage patterns of range, where your first version using (0,0,0) (with implicit zeroes of course) is cleaner. It's much more apparent how long the loop is at first glance ("loop once per iter" vs "loop... from what to what? oh, once per iter"). Perhaps not using reversed() but instead setting step=-1 would be a better case to use the off-by-1 variant.

    Such a version with both proposed changes (which are independent and can be considered or accepted separately) looks like this:

    def roundrobin(*iterables):
        "roundrobin('ABC', 'D', 'EF') --> A D E B F C"
        # This recipe is also called other names like "alternate", "interleave", or
        # "merge". "zip_flat" would also be an accurate name.
        # Algorithm: cycle around the __next__ methods of each iterable. when an
        # iter runs out, remove its __next__ from the cycle.
        nexts = cycle(iter(it).__next__ for it in iterables)
        for reduced_len in reversed(range(len(iterables))):
            try:
                for next in nexts:
                    yield next()
            except StopIteration:
                nexts = cycle(islice(nexts, reduced_len))
                # This skips one round of the cycle, starting the next round
                # without the current iter

    The last comment is probably the least valuable, though it does point out the slight quirk of how the last line works. I suppose this is the case for having the loop variable be named the actual length, but I think most Python users are accustomed to the last element of a list having index length-1. At any rate, I think the readability gain in the for statement is more than the readability loss in the islice().

    7ce1a323-7d55-40f6-b97f-d80c3d15e7dc commented 6 years ago

    Perhaps the loop variable could be renamed to "len_minus_1" or some such something which is more semantic to the algorithm.

    7ce1a323-7d55-40f6-b97f-d80c3d15e7dc commented 6 years ago

    Er, in my first message, make that "(yield from tup for tup in zip_longest(*iters, usefill=False))"

    rhettinger commented 6 years ago

    IMO, George Sakkis's original version is better than the TJR's proposed revision which has three separate adjustments by one and adds to extraneous functions which aren't central to the example. Dubslow's alternative at least avoids the three offsets by one; however, it has the disadvantage that meaning of the loop induction variable is somewhat indirect and creates its own source of confusion.

    I agree that some algorithmic comment should be added, probably just for the last line.

    Let's not add any of the listed alternative names to the comments. Each one is either confusing or wrong. The word "merge" already has an established, different meaning as in "sort/merge". For example, heapq.merge('ABC', 'D', 'EF') gives ['A', 'B', 'C', 'D', 'E', 'F']. The word "alternate" tends to mean "toggle back-and-forth" in common usage. The term "zip_flat" has no meaning to me, it has no hits on Google, and the closest match is a recipe on StackOverflow that does something different. And "interleave" is not suggestive of looping back over the iterables until each is exhausted. Also, we may yet use interleave() to mean something else.

    In contrast, "round robin" has a well established meaning that matches what this recipe does. Until now, not a single reader has ever claimed to not know what it means. https://en.wikipedia.org/wiki/Round-robin_scheduling

    FWIW, the recipe has several benefit. 1) Give a way to implement round robin iteration without re-inventing the wheel, 2) Demonstrate ways to use cycle() and islice(). 3) Demonstrate useful optimization technique of factoring the try/except out of the for-loop, 4) Demonstrate the useful optimization technique of calling bound methods, and 5) Be concise (I've left longer or more complex recipes for the ASPN cookbook or StackOverflow).

    Ideally, I would prefer to not add two extra builtin lookups (the recipe is sometime used on short inputs which would be affected by the extra overhead). Also, I prefer the visual weight to be on the central message of tight inner loops that exploit itertools rather than having the visual weight shift to the for-loop which is the least important part.

    Can you a suggest a concise single line comment that would make the last line clearer about what it is doing? Also, I'm open to changing the name of the "pending" variable but think "current_len" and "reduced_len" are not improvements.

    tim-one commented 6 years ago

    I agree the current recipe strikes a very nice balance among competing interests, and is educational on several counts.

    s/pending/numactive/

    # Remove the iterator we just exhausted from the cycle.
    numactive -= 1
    nexts = cycle(islice(nexts, numactive))
    serhiy-storchaka commented 6 years ago

    roundrobin() has quadratic computational complexity. For example list(roundrobin(([[1]]N))). Is there a way to make it with linear complexity?

    serhiy-storchaka commented 6 years ago

    I agree that using binary tree would be excessively here. I just wondering if there is a simple way to get rid of the quadratic complexity. In any case this does not matter much until you work with many hundreds or thousands of iterables.

    serhiy-storchaka commented 6 years ago

    Actually I tried to test if this implementation can cause a crash due to using deeply nested iterators (like in bpo-14010). For example in next(roundrobin(([[]]N + [[1]]))). But if there is such problem here, roundrobin() becomes unusable due to quadratic time for smaller N (tens of thousands).

    rhettinger commented 6 years ago

    While we're on the topic, I had some thought of also adding a similar recipe to https://docs.python.org/3/library/collections.html#deque-recipes . The alternative recipe is slower is common cases but doesn't degrade when there are a huge number of iterables:

        def roundrobin(*iterables):
            "roundrobin('ABC', 'D', 'EF') --> A D E B F C"
            nexts = deque(iter(it).__next__ for it in iterables)
            while nexts:
                try:
                    while True:
                        yield nexts[0]()
                        nexts.rotate(-1)
                except StopIteration:
                    nexts.popleft()
    serhiy-storchaka commented 6 years ago

    Today I have published a similar recipe on Python-Ideas. It uses popleft/append instead of __getitem__/rotate.

    def roundrobin(*iterables):
        "roundrobin('ABC', 'D', 'EF') --> A D E B F C"
        nexts = deque(iter(it).__next__ for it in iterables)
        popleft = nexts.popleft
        append = nexts.append
        while nexts:
            next = popleft()
            try:
                yield next()
            except StopIteration:
                pass
            else:
                append(next)

    It is faster (10-25%) in all microbenchmarks that I did (Steven's benchmarks for small number of iterables and my examples for large number of iterables).

    rhettinger commented 6 years ago

    Really, I don't give a whit about the micro-benchmarks -- that completely misses the point. The recipes are primarily intended to have educational value on ways to use itertools, deques, and whatnot.

    For itertools, I'm satisfied with new variable name and the additional comment. For collections, there is an open question about whether adding an extra example would make users better off. Beyond that, I have little interest is exploring all the little variants or wasting further time on this (that is what ASPN, StackOverflow, and blog posts are for).

    serhiy-storchaka commented 6 years ago

    But why use less efficient code? Is my code worse?

    rhettinger commented 6 years ago

    New changeset 0858495a50e19defde786a4ec050ec182e920f46 by Raymond Hettinger in branch 'master': bpo-32099 Add deque variant of roundrobin() recipe (bpo-4497) https://github.com/python/cpython/commit/0858495a50e19defde786a4ec050ec182e920f46