Possible performance improvement for heapq.merge()

sweeneyde commented 4 years ago

BPO	38938
Nosy	@tim-one, @rhettinger, @serhiy-storchaka, @bharel, @DimitrisJim, @bbayles, @sweeneyde
PRs	python/cpython#17422 python/cpython#17729 python/cpython#18427 python/cpython#20550
Files	new_merge.py: Iterative version for comparison tournament_heap.py: Using a heap that stores each item only once, items move from leaves to root. recursive_merge.py: Recursive 2-way-merging generators losers.py: Using a "loser tree" winners.py: like new_merge, except jump straight to leaf instead of traversing down to it key_and_reverse.py: Similar to winners.py but handles key and reverse; also uses lists as node structs. new_merge2.py: Tree formation is iterative and inlined less_movement.py

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = 'https://github.com/rhettinger' closed_at = None created_at = labels = ['type-feature', 'library', '3.10'] title = 'Possible performance improvement for heapq.merge()' updated_at = user = 'https://github.com/sweeneyde' ``` bugs.python.org fields: ```python activity = actor = 'bar.harel' assignee = 'rhettinger' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'Dennis Sweeney' dependencies = [] files = ['49160', '49164', '49166', '49167', '49170', '49180', '49184', '49198'] hgrepos = [] issue_num = 38938 keywords = ['patch'] message_count = 23.0 messages = ['357631', '357632', '357664', '357665', '358968', '364160', '364198', '364201', '364205', '364260', '364279', '364297', '368934', '368989', '369012', '369064', '369115', '369145', '369198', '369559', '369667', '370176', '370413'] nosy_count = 7.0 nosy_names = ['tim.peters', 'rhettinger', 'serhiy.storchaka', 'bar.harel', 'Jim Fasarakis-Hilliard', 'bbayles', 'Dennis Sweeney'] pr_nums = ['17422', '17729', '18427', '20550'] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue38938' versions = ['Python 3.10'] ```

sweeneyde commented 4 years ago

Although the implementation of the heapq.merge function uses an underlying heap structure, its behavior centers on iterators. For this reason, I believe there should either be an alias to this function in the itertools module or at least a recipe in the itertools docs describing the use of heapq.merge.

Furthermore, I have an implementation (attached) of heapq.merge that is twice as fast for two iterables (as used in mergesort and other common cases), and is faster for up to 6 or 7 iterables. I think it would be nice to special-case this for small numbers of iterables for this significant speedup.

sweeneyde commented 4 years ago

The following seems like it is a short, readable recipe for itertools.

sweeneyde commented 4 years ago

Disregard merge_recipe.py: it would skip over a value that had already been retrieved from the iterator when the loop finished.

rhettinger commented 4 years ago

Several thoughts:

heapq.merge() could have plausibly been in the itertools module instead of the heapq module. However, since it already has a home, there is no reason to move it or to create duplication with an intermodule dependency. We certainly don't have a rule that the itertools module must house everything accepts an iterator as an input and returns an iterator in the output.
The iter_merge() recipe is a bit sprawling and IMHO not a useful teaching aid (a primary goal of the itertools recipe), so it doesn't have much documentation value. Instead, consider submitting this to the "more-itertools" project which is referenced by the docs. There it would likely be used directly rather than as a recipe.
Over the next few days, I'll take a closer look at the binary tree approach versus a heap approach. I like how it intrinsically avoids the overhead of the 4-tuple while maintaining order, but I want to test how well it does with slow comparison keys (i.e. does the total number of comparisons get worse). Am not sure whether the tree needs to be rebalanced as some of the input streams become exhausted. I have a vague recollection that Knuth's TAOCP deemed the heap approach to be algorithmically superior, but need to dig out my books to be verify that.
We could just build-out the current heapq.merge() code to include a special case for only two input iterables. That would speed-up a common case by avoiding the overhead of tracking a 4-tuple for each element. If you want to submit a PR for that, I would be happy to take a close look and verify the timings.
It might be interesting to time the pure python variants against "sorted(chain(a, b, c, d, e))". It isn't quite an apples-to-apples comparison because the latter pulls all the data in memory and it doesn't the runs pre-segregated, but it might give a sense of how well merge() could do if we decided to go gonzo on optimizations. Until now, we've been satisfied with letting the heap structure minimize the number of comparisons and there hasn't been a focus on reducing the constant factor overhead.

sweeneyde commented 4 years ago

PR 17729 is a C implementation of a non-recursive "flattening" of the the recursive-lazy-mergesort algorithm into a tournament whose state is a tree of losers of comparisons.

serhiy-storchaka commented 4 years ago

This is a large change. And making heaqq.merge a class looks unrelated to the declared goal. I afraid it may add significant overhead.

Since this is an optimization, could you please provide any benchmarks which we can use to check the performance boost?

sweeneyde commented 4 years ago

First, as I posted at https://github.com/python/cpython/pull/17729#issuecomment-571864662, there is a theoretical advantage of fewer comparisons in all cases, and the new algorithm would be especially dominant when one iterable keeps winning. (I'm given to understand that "lists very often do have exploitable partial order in real life" ;-)

making heaqq.merge a class looks unrelated to the declared goal.

Is there a better way to implement a C accelerator for a Python function that returns a generator? I figured it would be better to have the pure-python implementation match the C implementation.

As for the timing, I'm running Windows and pyperf gives a "WARNING: unable to increase process priority", and a "WARNING: no operation available for your platform" when I run "pyperf system tune", but I'm still able to get the following results:

.\python.bat -m pyperf timeit -s "from random import random; from heapq import merge; from collections import deque; iters = [sorted(random() for j in range(10_000)) for i in range(20)]" "deque(merge(*iters), maxlen=0)"

Master: Mean +- std dev: 88.0 ms +- 10.3 ms
This PR: Mean +- std dev: 28.2 ms +- 1.0 ms

The above I'll call "merging 20 of size 10_000." For "merging 2 of size 100_000", I get:

Master: Mean +- std dev: 34.4 ms +- 3.2 ms
This PR: Mean +- std dev: 10.6 ms +- 0.7 ms

Merging 10_000 of size 100:

Master: Mean +- std dev: 1.56 sec +- 0.11 sec
This PR: Mean +- std dev: 628 ms +- 22 ms

serhiy-storchaka commented 4 years ago

Sorry, I did not notice that there is a C implementation in PR 18427. Changes in the Python implementations are so larger that I though this is the goal of the PR.

Often the most clear and efficient way to implement an iterator in Python is to write a generator function. In C you need to write a class with the __next__ method, but Python has better way.

I have tested your first example with the Python implementation and got 93.9 msec on master vs 314 msec with PR 18427 applied. It is expected that the C implementation is faster than the Python implementation, but was there a need to make the Python implementation 3 times slower?

sweeneyde commented 4 years ago

The existing Python implementation is benefiting from the C accelerators for heapify and heapreplace. When forcing pure python using test.support, I get these results:

.\python.bat -m pyperf timeit -s "from random import random; from collections import deque; from test import support; merge = support.import_fresh_module('heapq', blocked=['_heapq']).merge; iters = [sorted(random() for j in range(1_000)) for i in range(20)]" "deque(merge(*iters), maxlen=0)"

Master: Mean +- std dev: 73.1 ms +- 8.4 ms
PR: Mean +- std dev: 63.7 ms +- 7.8 ms

I think I remember getting similarly slightly-better results out of PyPy, but I will double-check.

So while the existing "mixed-c-and-Python" implementation would be deleted, the "true pure-Python" implementation would get faster and the pure-c implementation would be created much faster. Is that an acceptable trade-off?

Regarding using a generator for the Python implementation, is it okay if type(heapq.merge) gives \<class 'function'> for the Python implementation but \<class 'type'> for the c implementation? How much seemingly-irrelevant variation like this is acceptable between the APIs?

sweeneyde commented 4 years ago

My suspicion was confirmed about PyPy (My PyPy here is Python 3.6.1 (784b254d6699, Apr 16 2019, 12:10:48) [PyPy 7.1.1-beta0 with MSC v.1910 32 bit] on win32). In what follows, "heapq2.py" had exactly the class merge Python implementation from PR 18427.

pypy -m pyperf timeit -s "from random import random; from heapq import merge; from collections import deque; iters = [sorted(random() for j in ra nge(10_000)) for i in range(20)]" "deque(merge(*iters), maxlen=0)"

Mean +- std dev: 191 ms +- 3 ms

pypy -m pyperf timeit -s "from random import random; from heapq2 import merge; from collections import deque; iters = [sorted(random() for j in r ange(10_000)) for i in range(20)]" "deque(merge(*iters), maxlen=0)"

Mean +- std dev: 69.3 ms +- 2.2 ms

Another test: PyPy merging 2 iterables of size 100_000:

heapq: Mean +- std dev: 91.4 ms +- 2.4 ms
heapq2: Mean +- std dev: 52.4 ms +- 2.1 ms

PyPy merging 10_000 iterables of size 100:

heapq: Mean +- std dev: 2.74 sec +- 0.07 sec
heapq2: Mean +- std dev: 781 ms +- 19 ms

rhettinger commented 4 years ago

Serhiy, I've got this one. It will still take a little to get to it, but I've already invested work in it. The code idea is plausible but I want to think it through for both CPython and PyPy.

serhiy-storchaka commented 4 years ago

Are there any algorithmic optimizations in the Python version or it was just rewritten from a generator function to a class? If yes, how hard to keep the functional design?

sweeneyde commented 4 years ago

As Serhiy suggested, keeping the algorithm but moving the Python implementation to be a generator again (as I recently changed in PR 18427) gives another performance boost (although this unrolling is many lines of code).

Timing the C implementation:

.\\python.bat -m pyperf timeit -s "from random import random; from collections import deque; from heapq import merge;  iters = [sorted(random() for j in range(10_000)) for i in range(20)]" "deque(merge(*iters), maxlen=0)"

    On Master: Mean +- std dev: 66.9 ms +- 0.6 ms
    With PR:   Mean +- std dev: 17.3 ms +- 0.9 ms

Timing the Python Implementation in CPython:

.\python.bat -m pyperf timeit -s "from random import random; from collections import deque; from test import support; merge = support.import_fresh_module('heapq', blocked=['_heapq']).merge; iters = [sorted(random() for j in range(10_000)) for i in range(20)]" "deque(merge(*iters), maxlen=0)"

    On Master: Mean +- std dev: 385 ms +- 3 ms
    With PR:   Mean +- std dev: 250 ms +- 2 ms

Timing PyPy:

pypy -m pyperf timeit -s "... from heapq import merge; ..." ...
versus
pypy -m pyperf timeit -s "... from heapq2 import merge; ..." ...

    Testing on PyPy 7.1.1: Mean +- std dev: 142 ms +- 2 ms
    This implementation:   Mean +- std dev: 38.0 ms +- 1.2 ms

\=============================================================================

Another positive for this proposal I just discovered is that object.__eq__ no longer needs to be called at each comparison:

    from heapq import merge
    from collections import deque

    class Int(int):
        lt = eq = 0
        def __lt__(self, other):
            __class__.lt += 1
            return int.__lt__(self, other)

        def __eq__(self, other):
            __class__.eq += 1
            return int.__eq__(self, other)

    def comparisons(iterables):
        Int.lt = Int.eq = 0
        deque(merge(*iterables), maxlen=0)
        return Int.lt, Int.eq

    no_overlap = comparisons(
        # (0..999), (1_000..1_999), (2_000..2_999), ...
        map(Int, range(x, x+1_000))
        for x in range(0, 16_000, 1_000)
    )

    interleaved = comparisons(
        # (0,16,32,...), (1,17,33,...), (2,18,34,...), ...
        map(Int, range(x, 16_000, 16))
        for x in range(16)
    )

    print("No overlap: {:,} lt; {:,} eq".format(*no_overlap))
    print("Interleaved: {:,} lt; {:,} eq".format(*interleaved))

Before:

No overlap: 65,004 lt; 65,004 eq
Interleaved: 64,004 lt; 64,004 eq

After:

No overlap: 32,000 lt; 0 eq
Interleaved: 63,968 lt; 0 eq

This comes from the way that tuples are compared by scanning item-wise for equality before comparing the first discrepancy. Using the positional information in the tree with the logic

yield (right if right < left else left)

requires only one rich comparison, while breaking ties with the stored index and the logic

yield (b if (b, b_index) \< (a, a_index) else a)

requires two arbitrary rich comparisons (a != b, then a \< b). This can be somewhat fixed with the logic

    if a_index < b_index:
        yield (b if b < a else a)
    else:
        yield (a if a < b else b)

...but this is another branch in the innermost loop.

\==============================================================================

I also played with a "Tree of Losers" method[1] here[2], and on the same benchmarks I got:

CPython (C): Mean +- std dev: 22.7 ms +- 0.2 ms
    (slower than in [PR 18427](https://github.com/python/cpython/pull/18427))
CPython (Pure Python): Mean +- std dev: 197 ms +- 9 ms
    (faster -- likely because of fewer int operations)
PyPy: Mean +- std dev: 38.8 ms +- 0.8 ms
    (about the same)

Maybe some more optimizations could be made to that code.

The two approaches should be "isomorphic" in some sense: both should do the same number of comparisons on the same data. But I'll emphasize the differences:

Tree of Losers:

Data structure: Loser Tree (does not satisfy heap invariant)
Nodes: (item, index) pairs
Tree has n nodes
To get the next item: do exchanges from leaf to root, swapping as you find better items
Comparisons: branching logic based on index comparison

PR 18427 "Tree of winners":

Data structure: binary heap (always satisfies heap invariant)
Nodes: just the items
Tree has 2n-1 nodes (more eagerly evaluated)
To get the next item: replace parent with better child, root to leaf
Comparisons: right \< left (simple)

Personally, I find the Tree of Winners approach more intuitive, but the Tree of Losers approach seems to be the one that comes up more in the literature.

[1] https://en.wikipedia.org/wiki/K-way_merge_algorithm [2] https://github.com/sweeneyde/cpython/tree/replacement-selection

rhettinger commented 4 years ago

The nested generators approach looks promising. I hope you keep working on that :-)

I'm not too keen on PR 18427 because it is a massive explosion in code volume and complexity.

With some continued effort, I expect we'll get to something much simpler and more maintainable. The existing code already performs well for typical applications, so there is no need to "go gonzo" and throw a massive wall of code at the problem. The ensuing maintenance costs would be too high.

To save time and effort, please hold-off on a C implementation until we've agreed to a pure python version of the algorithm.

For timings, try using strings, tuples, or dataclass instances. Timing integer inputs isn't representative of how merge() is used and it tends to exaggerate improvements. For interesting merges, the dominant cost is the comparison step.

By switching to a binary tree arrangement, the payoff we're aiming for is to avoid the double comparison in the tuple inequality logic — "('a',1)\<('b',2)" tests both 'a'=='b' and 'a'\<='b'. I suggest aiming for the simplest possible code that avoids the double test.

sweeneyde commented 4 years ago

The attached recursive_merge.py should be much less ugly and still somewhat performant.

It should be the same algorithm as that PR, just written recursively rather than iteratively.

I got some text files from http://www.gwicks.net/dictionaries.htm and tried merging them line-by-line:

py -3.9 -m pyperf timeit -s "from heapq import merge; from collections import deque" "deque(merge(open('english.txt'), open('dutch.txt'), open('french.txt'), open('german.txt'), open('italian.txt')), maxlen=0)"

Mean +- std dev: 391 ms +- 9 ms

py -3.9 -m pyperf timeit -s "from recursive_merge import merge; from collections import deque" "deque(merge(open('english.txt'), open('dutch.txt'), open('french.txt'), open('german.txt'), open('italian.txt')), maxlen=0)"

Mean +- std dev: 262 ms +- 9 ms

Perhaps that's a more real-world benchmark.

rhettinger commented 4 years ago

For comparison, I'm attaching an iterative version that manipulates the tree nodes with all local variables. Try it out and see what you think.

sweeneyde commented 4 years ago

It seems to me that the code sprawl mostly comes from the separate handling of the four keyed/unkeyed and forward/reverse cases, which as far as I can tell requires a branch in the innermost loop if not unrolled into separate cases. I think recursive_merge.py is about as concise as possible while still efficiently handling all four cases.

Is there any issue with recursion in this application? Even if there are 2**64-1 iterables, this would only mean 64 nested next() calls, which should be within the stack on most machines, no?

I did the following benchmark:

py -3.9 -m pyperf timeit -s "from [FILENAME] import merge; from collections import deque" "deque(merge(open('english.txt'), open('dutch.txt'), open('french.txt'), open('german.txt'), open('italian.txt')), maxlen=0)"

new_merge.py:          Mean +- std dev: 773 ms +- 16 ms
tournament_heap.py:    Mean +- std dev: 665 ms +- 42 ms
losers.py:             Mean +- std dev: 470 ms +- 31 ms
Existing heapq:        Mean +- std dev: 313 ms +- 2 ms
recursive_merge.py:    Mean +- std dev: 266 ms +- 3 ms

I can look at some more diverse benchmarks (types, iterable length imbalance, lengths of an iterator's win-streak, etc.) soon.

rhettinger commented 4 years ago

Am leaning toward the iterative approach in new_merge.py because it most directly implements the core concept of storing the data in a binary tree.

Merits: no recursion, no nested generators, avoids the clutter of left-empty and right-empty checks, easy to understand, simple and fast loops, all local variables or instance variables, no calls to memory allocator after then initial tree construction, runs well on PyPy, and easily Cythonized.

sweeneyde commented 4 years ago

I mostly like new_merge.py too, especially the dynamic reduction of the tree.

However, it looks like list(merge([2],[1],[1])) currently fails, and I think what's missing is the following in the sibling-promotion:

+ if sibling.left is not None: + sibling.left.parent = sibling.right.parent = parent

Also for what it's worth, I think both appearances of "c1 if c1.value \< c2.value else c2" should become "c2 if c2.value \< c1.value else c1" for stability.

I've attached winners.py, which is similar to new_merge.py, but replaces the "while node.left is not None: node = node.source" traversal with a single "node = node.leaf" call and adds a fast yield-from for the last iterator. I don't think it's much more complex either.

sweeneyde commented 4 years ago

key_and_reverse.py employs the same strategy as winners.py, but uses lists as the nodes of the tree rather than using Node instances. It also eliminates the recursion of treeify, and adds (with neither much of a performance hit nor much code duplication) support for key=some_func and/or reverse=True keyword arguments, so it now passes the test_heapq suite.

rhettinger commented 4 years ago

Thanks. I'll look at this shortly. We're getting much closer.

Just so that I don't lose other work I've done, am uploading new_merge2.py with in-line iterative tree construction of the initial tree.

sweeneyde commented 4 years ago

less_movement.py is my favorite so far. It still handles key and reverse, but using instance attributes instead of the list indices I tried before. It does this by only advancing the "key" and "leaf" attributes up toward the root (where the key is just the value if key is None), while the value is stored only in the leaf. Since the "value" attribute is no longer used except for at the leaves, we can pack a leaf's value into its left slot and reduce the number of slots.

This seems to run very well on pypy and it looks like it would be pretty receptive to a c implementation, which I would be happy to work on.

sweeneyde commented 4 years ago

PR 20550 uses a linked structure like what we've been talking about.

python / cpython

Possible performance improvement for heapq.merge() #83119