Add jump table for certain safe match-case statements

sweeneyde commented 3 years ago

BPO	44283
Nosy	@rhettinger, @markshannon, @serhiy-storchaka, @corona10, @brandtbucher, @sweeneyde, @AndersMunch, @Fidget-Spinner, @Kshitiz-Arya
PRs	python/cpython#26697
Files	matchperf.py

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['interpreter-core', '3.11', 'performance'] title = 'Add jump table for certain safe match-case statements' updated_at = user = 'https://github.com/sweeneyde' ``` bugs.python.org fields: ```python activity = actor = 'brandtbucher' assignee = 'none' closed = False closed_date = None closer = None components = ['Interpreter Core'] creation = creator = 'Dennis Sweeney' dependencies = [] files = ['50106'] hgrepos = [] issue_num = 44283 keywords = ['patch'] message_count = 20.0 messages = ['394874', '394878', '394882', '394884', '394887', '394889', '394891', '394892', '394975', '394977', '395019', '395069', '395265', '395278', '395717', '395732', '395738', '395797', '395809', '395813'] nosy_count = 9.0 nosy_names = ['rhettinger', 'Mark.Shannon', 'serhiy.storchaka', 'corona10', 'brandtbucher', 'Dennis Sweeney', 'AndersMunch', 'kj', 'Kshitiz17'] pr_nums = ['26697'] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'performance' url = 'https://bugs.python.org/issue44283' versions = ['Python 3.11'] ```

sweeneyde commented 3 years ago

Anecdotally, a few people I've talked to have expected that match-case statements would improve performance in the same way that switch-cases sometimes do in C-like languages. While this is impossible in general without changing semantics (since, for example, __eq__ can have side effects), it would be nice if a jump table was implemented in certain safe cases.

My idea was to implement this whenever the list of cases begins with 2 or more of the following in a row:

(1) ints
(2) strings
(3) None
(4) OR (`|`) patterns of any of the above
(5?) potentially later add support for tuples

Example:

match x:
    case 10:             # safe
        print("A")
    case 20:             # safe
        print("B")
    case 30 | 31:        # safe
        print("C")
    case 40:             # safe
        print("D")
    case MyClass(10, 20): # unsafe
        print("E")
    case []:              # unsafe
        print("F")
    case whatever:        # unsafe
        print("G")

This would compile to something like

LOAD_FAST (x)
LOAD_CONST 16            ({10: 16, 20: 38, 30: 80, 31: 80, 40: 102})
ATTEMPT_SAFE_MATCH 110

... all the rest of the code is the same ...

Where the new opcode would be would be something like

case TARGET(ATTEMPT_SAFE_MATCH): {
    PyObject *jump_dict = POP();
    PyObject *x = TOP();

    if (PyLong_CheckExact(x) ||
        PyUnicode_CheckExact(x) ||
        Py_Is(x, Py_None))
    {
        PyObject *boxed = PyDict_GetItemWithError(jump_dict, x);
        Py_DECREF(jump_dict);
        if (res == NULL) {
            if (PyErr_Occurred()) {
                goto error;
            }
            else {
                JUMPBY(oparg);
            }
        }
        assert(PyLong_CheckExact(boxed));
        int target = (int)PyLong_AsLong(boxed);
        JUMPBY(target);
    }
    else {
        // fall back to general matching code
        Py_DECREF(jump_dict);
        DISPATCH();
    }
}

I can work on an implementation in the next couple of weeks.

rhettinger commented 3 years ago

Big +1 from me. This would make use of match-case more compelling.

brandtbucher commented 3 years ago

Seems like a good idea as long as we're careful about the implementation. I've just jotted down a few notes here:

We should probably break the table upon encountering a guard.
We probably can't get away with storing a dictionary in the code object's constants, since the entire code object needs to be hashable. We could store a tuple of key-value pairs and either (a) build the dictionary each time we enter the block, or (b) stash it somewhere. If we stash it:
- Should we defer building it until the first time we hit the block, or earlier than that?
- Where would we stash it?
We should probably have some heuristic perhaps more restrictive than "two or more of the following in a row". I'm not entirely sure that loading/building a dictionary, hashing an integer/string/None, looking it up in a dictionary, and then either (a) recovering from the error, or (b) converting the Python integer value into a C long and jumping on it is faster than the current implementation for the first two cases. I know it's really fast, but I'm not *sure* it will be an obvious win yet. Ideally, the best-case scenario (match on first case) would be just as fast as it is currently. I'm probably willing to speed up the average case at the expense of the best case, though.
I'm not sure we need to limit this to leading cases. What's stopping us from having one of these jump tables in the middle of a match block (or even having more than one scattered throughout)? Unless I'm missing something, this should work anytime we observe contiguous "simple" cases.

sweeneyde commented 3 years ago

At first I thought of matches with only the int/str/None cases and then I saw the easy generalization, but yeah, each contiguous block seems better.

One solution to the code hashability/lookup speed apparent dilemma: writing the equivalent of this in C:

class HashableDict(dict):
    def __new__(cls, pairs):
        return dict.__new__(cls, pairs)
    def __eq__(self, other):
        return tuple(self.mapping.items()) == tuple(other.sequence.items())
    def __hash__(self):
        return CONSTANT ^ hash(tuple(self.items()))
    def __getnewargs__(self):
        return tuple(self.items())

    # forbid all mutating methods
    __setitem__ = __delitem__ = update = __ior__ = None # etc.

(Or maybe using composition rather than inheritence)

Maybe using the HAMT in /Python/hamt.c would suffice instead.

brandtbucher commented 3 years ago

Hm. I’m not sure if the HAMT makes much sense. We don’t really care about efficient mutation or copies... constant-time lookups seem to be overwhelmingly the most valuable factor here.

I personally think that creating some sort of hashable dict and teaching marshal how to serialize/deserialize it could be a promising path forward. I imagine its actual design could mirror dict (sort of like set/frozenset).

That might be a rather drastic step though. Perhaps it’s better to first prove that building a mapping at runtime is too slow.

sweeneyde commented 3 years ago

I agree that we should benchmark for what the value of "two" should be ;) . Maybe only match statements with >=20 consecutive simple cases end up being worth it. I would assume that if "case 1, case 2, ..., case 20" are all in a row then "case 1" wouldn't be *that* much more likely than "case 20", so a slowdown on case 1 wouldn't matter too much if the average gets better.

I'm under the impression that match-case statements would frequently be used only once per function call. If I understand your proposal, that would mean that calling a function with a N-case constant-pattern match would require N hashes and N insertions into a dict (including memory allocation), followed by O(1) lookup. Wouldn't that eliminate any hope of making such a function O(1)? Unless there's a way to cache the populated dict somewhere else?

I suggested HAMT because it's a well-tested pre-existing fast immutable mapping that already implements __hash__, so it would be safe to store in co_consts already populated. Re-implementing all of the dict logic seems like an unnecessary pain, which is why I was suggesting either the HAMT or a thin wrapper around dict, not a re-implementation of a new hash table. I was hoping to do the minimum amount of disruption possible to get reasonable O(1) performance, and then seeing where that leaves us.

brandtbucher commented 3 years ago

If I understand your proposal, that would mean that calling a function with a N-case constant-pattern match would require N hashes and N insertions into a dict (including memory allocation), followed by O(1) lookup. Wouldn't that eliminate any hope of making such a function O(1)? Unless there's a way to cache the populated dict somewhere else?

Yeah, that was the idea sketched out above. Basically, if we go with a vanilla dictionary here, we're going to need to build it at runtime. It only really makes sense to do that once, the first time we need it. Then we stash it away... uh... somewhere. As you note, it doesn't make much sense to spend linear time building a constant-time jump table if it's only going to be used once. :)

Maybe somebody familiar with the constantly-evolving opcaches could chime in if this is the sort of thing that could benefit from that infrastructure. Basically, we would need a separate cache per bytecode offset, per code object. My understanding is that we don't have anything like that yet.

(A quick-and-dirty prototype could probably store them at "hidden" local names like ".table0", ".table1", ".table2", etc. I know comprehensions do something like that.)

Re-implementing all of the dict logic seems like an unnecessary pain, which is why I was suggesting either the HAMT or a thin wrapper around dict, not a re-implementation of a new hash table.

Well, I don't think we'd need to do any of that. I believe set and frozenset share the same core design and routines, but differ only in the interfaces provided by the objects themselves. I imagine we could do something similar with a hypothetical _PyFrozenDict_Type... copy most of the type definition, dropping all of the methods except mp_subcript, tp_hash, and a few other members. That would probably be all we needed to get this design up and running for a proof-of-concept.

A lot of work goes into making dicts fast (especially for things like strings), so it would be nice for a new type to be able to benefit from those incremental improvements.

I was hoping to do the minimum amount of disruption possible to get reasonable O(1) performance, and then seeing where that leaves us.

Agreed, but the HAMT mapping has logarithmic time complexity for lookups, right? Maybe for realistic cases the coefficients make it basically equivalent, but on the surface it seems more promising to try reusing the dict guts instead.

sweeneyde commented 3 years ago

FWIW PEP-603 includes a graph (Figure 2) of dict.get versus hamt.get performance as the mappings grow, and it seems the hamt is roughly 1.4x slower on inputs of sizes between 10 and 1000.

brandtbucher commented 3 years ago

For anyone curious, I had some free time today and took a stab at creating a minimal _frozendict type (sharing as much implementation with dict as possible) to see how difficult it would be.

It was actually much simpler than I thought... just a few dozen lines of code, including marshal support. If you'd like to see it, you can check out my "frozendict" branch here:

https://github.com/python/cpython/compare/main...brandtbucher:frozendict

For testing purposes, I've exposed in the _testcapi module. It has the same constructor signature as dict, and basically only supports lookups:

>>> from _testcapi import _frozendict
>>> fd = _frozendict({1: 2, "3": 4, None: 5})
>>> fd
<_frozendict object at 0x7f4e127e4ef0>
>>> fd[1]
2
>>> fd[3]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 3
>>> import marshal
>>> marshal.loads(marshal.dumps(fd)) == fd
True

I'm not gonna lie... I really like it. It sidesteps any runtime construction/caching issues that using a normal dict would have, but with all of the same performance characteristics. It also seems like it would not introduce much of a maintenance burden, since it has very little of its own code.

Anyways, food for thought. It was a fun experiment at the very least.

sweeneyde commented 3 years ago

Very interesting -- that is shorter than I thought also!

If we really wanted to optimize the tar out of this, we could probably find a way to re-use just the PyDictKeysObject to find the index into a C-array of integer jump targets and not have to worry about unboxing.

brandtbucher commented 3 years ago

I'm hoping we can get something close that for free by just waiting... the faster-cpython folks are working on a tagged-pointer implementation of integers as we speak.

I've been looking for a new project, so I'd love to help work on this issue (if you're open to it). The pattern compiler is a bit of a beast, and I like to think I have a better than-average-understanding of it. ;)

sweeneyde commented 3 years ago

Hi Brandt,

I would welcome your collaboration/mentorship in whatever capacity makes sense. I sent an email to brandt@python.org from sweeney.dennis650@gmail.com.

4ab1b499-c4d2-4986-b08a-7a379be00275 commented 3 years ago

Are you sure you want to do this?

This optimisation is not applicable if the matched values are given symbolic names. You would be encouraging people to write bad code with lots of literals, for speed.

brandtbucher commented 3 years ago

Are you sure you want to do this?

No, not yet. But there has been some positive feedback here, and it seems like an interesting project to prototype. :)

This optimisation is not applicable if the matched values are given symbolic names. You would be encouraging people to write bad code with lots of literals, for speed.

I had the same concern initially, but I'm honestly not losing too much sleep over it. Literals already get lots of optimizations that symbolic constants don't, and to me it seems a bit silly to avoid optimizing literals here when it is quite straightforward (and powerful) to do so. There are also many cases where using symbolic constants for certain literals doesn't make much sense.

I'd be okay not loudly publicizing this change if it lands. After all, I'm the guy who spent the better part of a year trying to convince people that this is Not A Switch Statement. :)

(Although, on the flip side, maybe it will help appease the people who have been asking for one all these years.)

sweeneyde commented 3 years ago

Here are some benchmarks:

PS C:\Users\sween\Source\Repos\cpython2\cpython> .\python.bat -m pyperf compare_to .\main.json .\PR-26697.json Running Release|x64 interpreter... 1: Mean +- std dev: [main] 117 us +- 4 us -> [PR-26697] 122 us +- 3 us: 1.04x slower 2: Mean +- std dev: [main] 202 us +- 11 us -> [PR-26697] 180 us +- 7 us: 1.12x faster 3: Mean +- std dev: [main] 320 us +- 11 us -> [PR-26697] 243 us +- 13 us: 1.32x faster 4: Mean +- std dev: [main] 474 us +- 15 us -> [PR-26697] 306 us +- 15 us: 1.55x faster 5: Mean +- std dev: [main] 625 us +- 27 us -> [PR-26697] 341 us +- 6 us: 1.83x faster 6: Mean +- std dev: [main] 806 us +- 24 us -> [PR-26697] 404 us +- 20 us: 1.99x faster 7: Mean +- std dev: [main] 1.01 ms +- 0.04 ms -> [PR-26697] 461 us +- 19 us: 2.19x faster 8: Mean +- std dev: [main] 1.22 ms +- 0.03 ms -> [PR-26697] 538 us +- 20 us: 2.27x faster 9: Mean +- std dev: [main] 1.47 ms +- 0.05 ms -> [PR-26697] 607 us +- 27 us: 2.42x faster 10: Mean +- std dev: [main] 1.75 ms +- 0.07 ms -> [PR-26697] 666 us +- 36 us: 2.62x faster

serhiy-storchaka commented 3 years ago

How common match-case statements with literal integers? I expect that in most cases such magic numbers are not used directly, but as named constants.

rhettinger commented 3 years ago

How common match-case statements with literal integers?

Nothing is common yet because the feature hasn't been released ;-)

I suspect that if match-case were optimized for integer constants, they would be used. In discussions about case statements, there are two groups, one looking for a prettier if-elif-else chain and the other looking for a fast vectored jump. This optimization is aimed at the latter group -- they would be happy to have this opportunity to make faster code, even if it means using integer constants.

markshannon commented 3 years ago

This is going in the wrong direction.

Rather than add more complex instructions for use only by pattern matching, we need to simplify the individual operations and re-use existing instructions. That way pattern matching can benefit from the general performance improvements that we are making.

If you are serious about improving the performance of pattern matching, you need to do the following in the compiler:

Generate a decision tree.
Improve that decision tree so that fewer decisions are required.
Generate bytecode from that tree that uses lower level operations.

E.g.

match x: case [1]: A case [2]: B

The initial decision tree looks like:

if is_sequence(x):
    if len(x) == 1:
        if x[0] == 1:
            A
if is_sequence(x):
    if len(x) == 1:
        if x[0] == 2:
            B

which can be improved to:

if is_sequence(x):
    if len(x) == 1:
        if x[0] == 1:
            A
        elif x[0] == 2:
            B

For a sequence of integer constants, introducing the test type(x) == int at the start would allow you to convert the linear sequence of tests into a tree. Reducing n tests to ln(n) + 1 tests.

brandtbucher commented 3 years ago

Sorry, I’ve been out on vacation this weekend. I didn’t realize that there was already a PR for this… I’m honestly not sure that it’s totally ready yet.

While I absolutely agree that compiling efficient decision trees lies in our future, it certainly seems to me that using *both* strategies (decision trees and jump tables) would create the most efficient branching structure. In effect, our control flow would be a rose tree.

In your example, Mark, we could perform the checks for the integers 1 and 2 simultaneously, right?

Or consider a slightly more complicated example (simplified from PEP-636):

match …: case ["quit"]: … case ["look"]: … case ["get", obj]: … case ["go", direction]: …

We could compile this to something like:

Check for Sequence.
Multi-way jump on length.
Multi-way jump on first item.

…which looks ideal to me.

I’m also confident that the jumping ops could also be broken up into fairly primitive instructions. A multi-way branch would just look something like:

(subject on TOS) LOAD_CONST(\<some hashable defaultdict>) ROT_TWO BINARY_SUBSCR JUMP_FORWARD_TOS

…where only JUMP_FORWARD_TOS is new.

For a sequence of integer constants, introducing the test type(x) == int at the start would allow you to convert the linear sequence of tests into a tree. Reducing n tests to ln(n) + 1 tests.

Sorry, I’m having a bit of difficulty following this part. Is it something like the strategy I just described?

brandtbucher commented 3 years ago

Also (because some of the people who might be interested are nosied on this issue), I’ve been working a bit on general performance benchmarks for our pattern-matching implementation:

https://github.com/brandtbucher/patmaperformance

I still need something that hits sequence patterns hard, but I think these will help give us a good baseline for measuring future perf work in this area.

(There’s also the perf script at the bottom of Lib/test/test_patma.py, but the test cases it runs on probably are much less representative of “real” code.)

python / cpython

Add jump table for certain safe match-case statements #88449