python / cpython

The Python programming language
https://www.python.org
Other
63.46k stars 30.39k forks source link

heapq module could be more efficient #102061

Closed msoxzw closed 1 year ago

msoxzw commented 1 year ago

Both Python heap implementation and C heap implementation are roughly based on bottom-up heapsort with opposite function name. Compared to original scheme on Wikipedia, CPython implementation has unnecessary additional item swaps both in "siftup" and "siftdown" instead of only "siftup". Similarly, GCC adopts the same approach.

Furthermore, CPython implements partial sort based on heap, i.e. nsmallest, in O(n log k) time complexity rather than O(n + k log n) time complexity, where in practice k is much smaller than n. Likewise, GCC adopts the same approach, as well.

So I am curious about why both CPython and GCC seemingly inefficiently implement heap.

terryjreedy commented 1 year ago

@rhettinger

pochmann commented 1 year ago

Both _siftup and nsmallest have great explanation and analysis right above them where you linked to, explaining why it's more efficient that way.

rhettinger commented 1 year ago

As Stefan Pochmann observed, the explanatory comments are in source.

In particular, having a single siftdown at the end saves exit tests in the inner loop. The comment cites testing which shows empirical evidence that this saves a substantial number of comparisons. Also, note that in Python the object comparisons dominate the running time.

In nsmallest and nlargest algorithms are designed to run in a single pass, keeping no more than k items in memory, and giving the fewest comparisons in the common case (small k and large n). When the data is in random order, the algorithm performs much better than the O(n log k) worst case. For random data, the number of comparisons gets close to that for min().

To explore the question further, please try StackOverflow. This isn't the right forum for asking "I curious, why ..." questions.

msoxzw commented 1 year ago

Thanks a lot, and I will try StackOverflow.

By carefully reading those explanatory comments once again, CPython's siftup is a variant of bottom-up heapsort with equal comparisons and more item swaps, and reiterates its advantage over the top-down version, if I correctly understand it.

As for nsmallest and nlargest algorithms, it makes perfect sense if no more than k items are absolutely required to be kept in memory. By the way, O(n log k) is not only the worst case, but also the best case and average case.

pochmann commented 1 year ago

By the way, O(n log k) is not only the worst case, but also the best case

No, you're overlooking the if elem < top: guard that commonly (and in the best case always) avoids the heap operation:

https://github.com/python/cpython/blob/60bbed7f174e481d3fc69984430a4667951678b3/Lib/heapq.py#L497-L499

msoxzw commented 1 year ago

By the way, O(n log k) is not only the worst case, but also the best case

No, you're ignoring the if elem < top: guard that commonly (and I the best case always) avoids the heap operation:

https://github.com/python/cpython/blob/60bbed7f174e481d3fc69984430a4667951678b3/Lib/heapq.py#L497-L499

So the best case is O(k+n).