python / cpython

The Python programming language
https://www.python.org
Other
63.14k stars 30.23k forks source link

Possible performance improvement for heapq.merge() #83119

Closed sweeneyde closed 2 years ago

sweeneyde commented 4 years ago
BPO 38938
Nosy @tim-one, @rhettinger, @serhiy-storchaka, @bharel, @DimitrisJim, @bbayles, @sweeneyde
PRs
  • python/cpython#17422
  • python/cpython#17729
  • python/cpython#18427
  • python/cpython#20550
  • Files
  • new_merge.py: Iterative version for comparison
  • tournament_heap.py: Using a heap that stores each item only once, items move from leaves to root.
  • recursive_merge.py: Recursive 2-way-merging generators
  • losers.py: Using a "loser tree"
  • winners.py: like new_merge, except jump straight to leaf instead of traversing down to it
  • key_and_reverse.py: Similar to winners.py but handles key and reverse; also uses lists as node structs.
  • new_merge2.py: Tree formation is iterative and inlined
  • less_movement.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/rhettinger' closed_at = None created_at = labels = ['type-feature', 'library', '3.10'] title = 'Possible performance improvement for heapq.merge()' updated_at = user = 'https://github.com/sweeneyde' ``` bugs.python.org fields: ```python activity = actor = 'bar.harel' assignee = 'rhettinger' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'Dennis Sweeney' dependencies = [] files = ['49160', '49164', '49166', '49167', '49170', '49180', '49184', '49198'] hgrepos = [] issue_num = 38938 keywords = ['patch'] message_count = 23.0 messages = ['357631', '357632', '357664', '357665', '358968', '364160', '364198', '364201', '364205', '364260', '364279', '364297', '368934', '368989', '369012', '369064', '369115', '369145', '369198', '369559', '369667', '370176', '370413'] nosy_count = 7.0 nosy_names = ['tim.peters', 'rhettinger', 'serhiy.storchaka', 'bar.harel', 'Jim Fasarakis-Hilliard', 'bbayles', 'Dennis Sweeney'] pr_nums = ['17422', '17729', '18427', '20550'] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue38938' versions = ['Python 3.10'] ```

    sweeneyde commented 4 years ago

    Although the implementation of the heapq.merge function uses an underlying heap structure, its behavior centers on iterators. For this reason, I believe there should either be an alias to this function in the itertools module or at least a recipe in the itertools docs describing the use of heapq.merge.

    Furthermore, I have an implementation (attached) of heapq.merge that is twice as fast for two iterables (as used in mergesort and other common cases), and is faster for up to 6 or 7 iterables. I think it would be nice to special-case this for small numbers of iterables for this significant speedup.

    sweeneyde commented 4 years ago

    The following seems like it is a short, readable recipe for itertools.

    sweeneyde commented 4 years ago

    Disregard merge_recipe.py: it would skip over a value that had already been retrieved from the iterator when the loop finished.

    rhettinger commented 4 years ago

    Several thoughts:

    sweeneyde commented 4 years ago

    PR 17729 is a C implementation of a non-recursive "flattening" of the the recursive-lazy-mergesort algorithm into a tournament whose state is a tree of losers of comparisons.

    serhiy-storchaka commented 4 years ago

    This is a large change. And making heaqq.merge a class looks unrelated to the declared goal. I afraid it may add significant overhead.

    Since this is an optimization, could you please provide any benchmarks which we can use to check the performance boost?

    sweeneyde commented 4 years ago

    First, as I posted at https://github.com/python/cpython/pull/17729#issuecomment-571864662, there is a theoretical advantage of fewer comparisons in all cases, and the new algorithm would be especially dominant when one iterable keeps winning. (I'm given to understand that "lists very often do have exploitable partial order in real life" ;-)

    making heaqq.merge a class looks unrelated to the declared goal.

    Is there a better way to implement a C accelerator for a Python function that returns a generator? I figured it would be better to have the pure-python implementation match the C implementation.

    As for the timing, I'm running Windows and pyperf gives a "WARNING: unable to increase process priority", and a "WARNING: no operation available for your platform" when I run "pyperf system tune", but I'm still able to get the following results:

    .\python.bat -m pyperf timeit -s "from random import random; from heapq import merge; from collections import deque; iters = [sorted(random() for j in range(10_000)) for i in range(20)]" "deque(merge(*iters), maxlen=0)"

    Master: Mean +- std dev: 88.0 ms +- 10.3 ms
    This PR: Mean +- std dev: 28.2 ms +- 1.0 ms

    The above I'll call "merging 20 of size 10_000." For "merging 2 of size 100_000", I get:

    Master: Mean +- std dev: 34.4 ms +- 3.2 ms
    This PR: Mean +- std dev: 10.6 ms +- 0.7 ms

    Merging 10_000 of size 100:

    Master: Mean +- std dev: 1.56 sec +- 0.11 sec
    This PR: Mean +- std dev: 628 ms +- 22 ms
    serhiy-storchaka commented 4 years ago

    Sorry, I did not notice that there is a C implementation in PR 18427. Changes in the Python implementations are so larger that I though this is the goal of the PR.

    Often the most clear and efficient way to implement an iterator in Python is to write a generator function. In C you need to write a class with the __next__ method, but Python has better way.

    I have tested your first example with the Python implementation and got 93.9 msec on master vs 314 msec with PR 18427 applied. It is expected that the C implementation is faster than the Python implementation, but was there a need to make the Python implementation 3 times slower?

    sweeneyde commented 4 years ago

    The existing Python implementation is benefiting from the C accelerators for heapify and heapreplace. When forcing pure python using test.support, I get these results:

    .\python.bat -m pyperf timeit -s "from random import random; from collections import deque; from test import support; merge = support.import_fresh_module('heapq', blocked=['_heapq']).merge; iters = [sorted(random() for j in range(1_000)) for i in range(20)]" "deque(merge(*iters), maxlen=0)"

    Master: Mean +- std dev: 73.1 ms +- 8.4 ms
    PR: Mean +- std dev: 63.7 ms +- 7.8 ms

    I think I remember getting similarly slightly-better results out of PyPy, but I will double-check.

    So while the existing "mixed-c-and-Python" implementation would be deleted, the "true pure-Python" implementation would get faster and the pure-c implementation would be created much faster. Is that an acceptable trade-off?

    Regarding using a generator for the Python implementation, is it okay if type(heapq.merge) gives \<class 'function'> for the Python implementation but \<class 'type'> for the c implementation? How much seemingly-irrelevant variation like this is acceptable between the APIs?

    sweeneyde commented 4 years ago

    My suspicion was confirmed about PyPy (My PyPy here is Python 3.6.1 (784b254d6699, Apr 16 2019, 12:10:48) [PyPy 7.1.1-beta0 with MSC v.1910 32 bit] on win32). In what follows, "heapq2.py" had exactly the class merge Python implementation from PR 18427.

    pypy -m pyperf timeit -s "from random import random; from heapq import merge; from collections import deque; iters = [sorted(random() for j in ra nge(10_000)) for i in range(20)]" "deque(merge(*iters), maxlen=0)"

    Mean +- std dev: 191 ms +- 3 ms

    pypy -m pyperf timeit -s "from random import random; from heapq2 import merge; from collections import deque; iters = [sorted(random() for j in r ange(10_000)) for i in range(20)]" "deque(merge(*iters), maxlen=0)"

    Mean +- std dev: 69.3 ms +- 2.2 ms

    Another test: PyPy merging 2 iterables of size 100_000:

    heapq: Mean +- std dev: 91.4 ms +- 2.4 ms
    heapq2: Mean +- std dev: 52.4 ms +- 2.1 ms

    PyPy merging 10_000 iterables of size 100:

    heapq: Mean +- std dev: 2.74 sec +- 0.07 sec
    heapq2: Mean +- std dev: 781 ms +- 19 ms
    rhettinger commented 4 years ago

    Serhiy, I've got this one. It will still take a little to get to it, but I've already invested work in it. The code idea is plausible but I want to think it through for both CPython and PyPy.

    serhiy-storchaka commented 4 years ago

    Are there any algorithmic optimizations in the Python version or it was just rewritten from a generator function to a class? If yes, how hard to keep the functional design?

    sweeneyde commented 4 years ago

    As Serhiy suggested, keeping the algorithm but moving the Python implementation to be a generator again (as I recently changed in PR 18427) gives another performance boost (although this unrolling is many lines of code).

    Timing the C implementation:

    .\\python.bat -m pyperf timeit -s "from random import random; from collections import deque; from heapq import merge;  iters = [sorted(random() for j in range(10_000)) for i in range(20)]" "deque(merge(*iters), maxlen=0)"
    
        On Master: Mean +- std dev: 66.9 ms +- 0.6 ms
        With PR:   Mean +- std dev: 17.3 ms +- 0.9 ms

    Timing the Python Implementation in CPython:

    .\python.bat -m pyperf timeit -s "from random import random; from collections import deque; from test import support; merge = support.import_fresh_module('heapq', blocked=['_heapq']).merge; iters = [sorted(random() for j in range(10_000)) for i in range(20)]" "deque(merge(*iters), maxlen=0)"
    
        On Master: Mean +- std dev: 385 ms +- 3 ms
        With PR:   Mean +- std dev: 250 ms +- 2 ms

    Timing PyPy:

    pypy -m pyperf timeit -s "... from heapq import merge; ..." ...
    versus
    pypy -m pyperf timeit -s "... from heapq2 import merge; ..." ...
    
        Testing on PyPy 7.1.1: Mean +- std dev: 142 ms +- 2 ms
        This implementation:   Mean +- std dev: 38.0 ms +- 1.2 ms

    \=============================================================================

    Another positive for this proposal I just discovered is that object.__eq__ no longer needs to be called at each comparison:

        from heapq import merge
        from collections import deque
    
        class Int(int):
            lt = eq = 0
            def __lt__(self, other):
                __class__.lt += 1
                return int.__lt__(self, other)
    
            def __eq__(self, other):
                __class__.eq += 1
                return int.__eq__(self, other)
    
        def comparisons(iterables):
            Int.lt = Int.eq = 0
            deque(merge(*iterables), maxlen=0)
            return Int.lt, Int.eq
    
        no_overlap = comparisons(
            # (0..999), (1_000..1_999), (2_000..2_999), ...
            map(Int, range(x, x+1_000))
            for x in range(0, 16_000, 1_000)
        )
    
        interleaved = comparisons(
            # (0,16,32,...), (1,17,33,...), (2,18,34,...), ...
            map(Int, range(x, 16_000, 16))
            for x in range(16)
        )
    
        print("No overlap: {:,} lt; {:,} eq".format(*no_overlap))
        print("Interleaved: {:,} lt; {:,} eq".format(*interleaved))

    Before:

    No overlap: 65,004 lt; 65,004 eq
    Interleaved: 64,004 lt; 64,004 eq

    After:

    No overlap: 32,000 lt; 0 eq
    Interleaved: 63,968 lt; 0 eq

    This comes from the way that tuples are compared by scanning item-wise for equality before comparing the first discrepancy. Using the positional information in the tree with the logic

    yield (right if right < left else left)

    requires only one rich comparison, while breaking ties with the stored index and the logic

    yield (b if (b, b_index) \< (a, a_index) else a)

    requires two arbitrary rich comparisons (a != b, then a \< b). This can be somewhat fixed with the logic

        if a_index < b_index:
            yield (b if b < a else a)
        else:
            yield (a if a < b else b)

    ...but this is another branch in the innermost loop.

    \==============================================================================

    I also played with a "Tree of Losers" method[1] here[2], and on the same benchmarks I got:

    CPython (C): Mean +- std dev: 22.7 ms +- 0.2 ms
        (slower than in [PR 18427](https://github.com/python/cpython/pull/18427))
    CPython (Pure Python): Mean +- std dev: 197 ms +- 9 ms
        (faster -- likely because of fewer int operations)
    PyPy: Mean +- std dev: 38.8 ms +- 0.8 ms
        (about the same)

    Maybe some more optimizations could be made to that code.

    The two approaches should be "isomorphic" in some sense: both should do the same number of comparisons on the same data. But I'll emphasize the differences:

    Tree of Losers:

    PR 18427 "Tree of winners":

    Personally, I find the Tree of Winners approach more intuitive, but the Tree of Losers approach seems to be the one that comes up more in the literature.

    [1] https://en.wikipedia.org/wiki/K-way_merge_algorithm [2] https://github.com/sweeneyde/cpython/tree/replacement-selection

    rhettinger commented 4 years ago

    The nested generators approach looks promising. I hope you keep working on that :-)

    I'm not too keen on PR 18427 because it is a massive explosion in code volume and complexity.

    With some continued effort, I expect we'll get to something much simpler and more maintainable. The existing code already performs well for typical applications, so there is no need to "go gonzo" and throw a massive wall of code at the problem. The ensuing maintenance costs would be too high.

    To save time and effort, please hold-off on a C implementation until we've agreed to a pure python version of the algorithm.

    For timings, try using strings, tuples, or dataclass instances. Timing integer inputs isn't representative of how merge() is used and it tends to exaggerate improvements. For interesting merges, the dominant cost is the comparison step.

    By switching to a binary tree arrangement, the payoff we're aiming for is to avoid the double comparison in the tuple inequality logic — "('a',1)\<('b',2)" tests both 'a'=='b' and 'a'\<='b'. I suggest aiming for the simplest possible code that avoids the double test.

    sweeneyde commented 4 years ago

    The attached recursive_merge.py should be much less ugly and still somewhat performant.

    It should be the same algorithm as that PR, just written recursively rather than iteratively.

    I got some text files from http://www.gwicks.net/dictionaries.htm and tried merging them line-by-line:

    py -3.9 -m pyperf timeit -s "from heapq import merge; from collections import deque" "deque(merge(open('english.txt'), open('dutch.txt'), open('french.txt'), open('german.txt'), open('italian.txt')), maxlen=0)"

    Mean +- std dev: 391 ms +- 9 ms

    py -3.9 -m pyperf timeit -s "from recursive_merge import merge; from collections import deque" "deque(merge(open('english.txt'), open('dutch.txt'), open('french.txt'), open('german.txt'), open('italian.txt')), maxlen=0)"

    Mean +- std dev: 262 ms +- 9 ms

    Perhaps that's a more real-world benchmark.

    rhettinger commented 4 years ago

    For comparison, I'm attaching an iterative version that manipulates the tree nodes with all local variables. Try it out and see what you think.

    sweeneyde commented 4 years ago

    It seems to me that the code sprawl mostly comes from the separate handling of the four keyed/unkeyed and forward/reverse cases, which as far as I can tell requires a branch in the innermost loop if not unrolled into separate cases. I think recursive_merge.py is about as concise as possible while still efficiently handling all four cases.

    Is there any issue with recursion in this application? Even if there are 2**64-1 iterables, this would only mean 64 nested next() calls, which should be within the stack on most machines, no?

    I did the following benchmark:

    py -3.9 -m pyperf timeit -s "from [FILENAME] import merge; from collections import deque" "deque(merge(open('english.txt'), open('dutch.txt'), open('french.txt'), open('german.txt'), open('italian.txt')), maxlen=0)"

    new_merge.py:          Mean +- std dev: 773 ms +- 16 ms
    tournament_heap.py:    Mean +- std dev: 665 ms +- 42 ms
    losers.py:             Mean +- std dev: 470 ms +- 31 ms
    Existing heapq:        Mean +- std dev: 313 ms +- 2 ms
    recursive_merge.py:    Mean +- std dev: 266 ms +- 3 ms

    I can look at some more diverse benchmarks (types, iterable length imbalance, lengths of an iterator's win-streak, etc.) soon.

    rhettinger commented 4 years ago

    Am leaning toward the iterative approach in new_merge.py because it most directly implements the core concept of storing the data in a binary tree.

    Merits: no recursion, no nested generators, avoids the clutter of left-empty and right-empty checks, easy to understand, simple and fast loops, all local variables or instance variables, no calls to memory allocator after then initial tree construction, runs well on PyPy, and easily Cythonized.

    sweeneyde commented 4 years ago

    I mostly like new_merge.py too, especially the dynamic reduction of the tree.

    However, it looks like list(merge([2],[1],[1])) currently fails, and I think what's missing is the following in the sibling-promotion:

    + if sibling.left is not None: + sibling.left.parent = sibling.right.parent = parent

    Also for what it's worth, I think both appearances of "c1 if c1.value \< c2.value else c2" should become "c2 if c2.value \< c1.value else c1" for stability.

    I've attached winners.py, which is similar to new_merge.py, but replaces the "while node.left is not None: node = node.source" traversal with a single "node = node.leaf" call and adds a fast yield-from for the last iterator. I don't think it's much more complex either.

    sweeneyde commented 4 years ago

    key_and_reverse.py employs the same strategy as winners.py, but uses lists as the nodes of the tree rather than using Node instances. It also eliminates the recursion of treeify, and adds (with neither much of a performance hit nor much code duplication) support for key=some_func and/or reverse=True keyword arguments, so it now passes the test_heapq suite.

    rhettinger commented 4 years ago

    Thanks. I'll look at this shortly. We're getting much closer.

    Just so that I don't lose other work I've done, am uploading new_merge2.py with in-line iterative tree construction of the initial tree.

    sweeneyde commented 4 years ago

    less_movement.py is my favorite so far. It still handles key and reverse, but using instance attributes instead of the list indices I tried before. It does this by only advancing the "key" and "leaf" attributes up toward the root (where the key is just the value if key is None), while the value is stored only in the leaf. Since the "value" attribute is no longer used except for at the leaves, we can pack a leaf's value into its left slot and reduce the number of slots.

    This seems to run very well on pypy and it looks like it would be pretty receptive to a c implementation, which I would be happy to work on.

    sweeneyde commented 4 years ago

    PR 20550 uses a linked structure like what we've been talking about.