Branching design for Tier 2 (uops) interpreter

python / cpython

The Python programming language

https://www.python.org

Other

62.88k stars 30.12k forks source link

Branching design for Tier 2 (uops) interpreter #106529

Open gvanrossum opened 1 year ago

gvanrossum commented 1 year ago

This issue is part of the larger epic of gh-104584. In PR gh-106393 I tried to implement branching, but it was premature. Here's a better design, following @markshannon's guidance.

We have the following jump instructions (not counting the instrumented versions):

Unconditional jumps:

[x] JUMP_BACKWARD
[ ] JUMP_BACKWARD_NO_INTERRUPT
[x] JUMP_FORWARD

Branches, a.k.a. conditional jumps:

[x] POP_JUMP_IF_FALSE, POP_JUMP_IF_TRUE, ~POP_JUMP_IF_NONE, POP_JUMP_IF_NOT_NONE~
[x] FOR_ITER's specializations:
- [x] FOR_ITER_LIST
- [x] FOR_ITER_TUPLE
- [x] FOR_ITER_RANGE
[ ] FOR_ITER_GEN
[ ] SEND
[ ] Add counters to to POP_JUMPIF{FALSE,TRUE} to determine likeliness

The translation strategies could be as follows:

Unconditional jumps

JUMP_BACKWARD

If this jumps to exactly the top of the current trace, emit a Tier 2 JUMP_TO_TOP uop, and stop projecting (i.e., exit the trace generation loop). The JUMP_TO_TOP uop implementation should include a CHECK_EVAL_BREAKER call.
If this jumps anywhere else, emit a SAVE_IP uop with the destination of the jump, followed by an EXIT_TRACE uop, and stop projecting.

JUMP_BACKWARD_NO_INTERRUPT

Since this is typically only used in special circumstances, just emit a SAVE_IP instruction with the destination and an EXIT_TRACE uop, and stop projecting.
Alternatively, we could make CHECK_EVAL_BREAKER a separate UOP that is inserted for JUMP_BACKWARD but not for JUMP_BACKWARD_NO_INTERRUPT, and otherwise treat the two backward jumps the same.

JUMP_FORWARD

Emit a SAVE_IP uop with the destination of the jump, and continue projecting from there (i.e. set instr to the destination of the jump).

Conditional jumps (branches)

POP_JUMP_IF_FALSE and friends

Consider the following Python code:

if cond:
    block
rest

This translates roughly to the following Tier 1 bytecode (using B1, B2, ... to label Tier 1 instructions, and <cond>, <block> etc. to represent code blocks that evaluate or execute the corresponding Python fragments):

B1: <cond>
B2: POP_JUMP_IF_FALSE B4
B3: <block>
B4: <rest>
B5:

I propose the following translation into Tier 2 uops, assuming the branch is "unlikely":

    SAVE_IP B1
    <cond>
    SAVE_IP B2
    JUMP_IF_FALSE stub
    POP_TOP
    SAVE_IP B3
    <block>
    SAVE_IP B4
    EXIT_TRACE

stub:
    POP_TOP
    SAVE_IP B4
    EXIT_TRACE

Where JUMP_IF_FALSE inspects the top of stack but doesn't pop it, and has an argument that executes a jump in the Tier 2 uop instruction sequence.

If the branch is "likely", we do this instead:

    SAVE_IP B1
    <cond>
    SAVE_IP B2
    JUMP_IF_TRUE stub
    POP_TOP
    SAVE_IP B4
    <rest>
    SAVE_IP B5
    EXIT_TRACE

stub:
    POP_TOP
    SAVE_IP B3
    EXIT_TRACE

Note how in this case, <rest> is projected as part of the trace, while <block> is not, since the likely case is that we jump over <block> to <rest>.

For the other simple conditional jumps (POP_JUMP_IF_TRUE, ~POP_JUMP_IF_NONE, POP_JUMP_IF_NOT_NONE~) we do the same: if the jump is unlikely, emit a JUMP_IF_XXX uop and a stub; if the jump is likely, emit the inverse JUMP_IF_NOT_XXX uop and a different stub, and continue projecting at the destination of the original jump bytecode.

I propose to have hand-written cases both in the superblock generator and in the Tier 2 interpreter for these, since the translations are too irregular to fit easily in the macro expansion data structure. The stub generation will require additional logic and data structures in translate_bytecode_to_trace() to keep track of the stubs required so far, the available space for those, and the back-patching required to set the operands for the JUMP_IF_XXX uops.

FOR_ITER and (especially) its specializations

The common case for these is not to jump. The bytecode definitions are too complex to duplicate in hand-written Tier 2 uops. My proposal is to change these in bytecodes.c so that, instead of using the JUMPBY(n) macro, they use JUMPBY_POP_DISPATCH(n), which in Tier 1 translates into just JUMPBY(n), but in Tier 2 translates into roughly

frame->prev_instr += (x);
PY_DECREF(stack_pointer[-1]);
stack_pointer -= 1;
goto exit;

thereby exiting the trace when the corresponding for-loop terminates.

I am assuming here that most loops have several iterations. I don't think it's worth special-casing the occasional for-loop that almost always immediately terminates.

SEND

Possibly we could treat this the same as FOR_ITER. But right now I propose to just punt here, and when we encounter it, stop projecting, as we do with any other unsupported bytecode instruction.

Linked PRs

gh-106542
gh-106543
gh-106551
gh-106599
gh-106613
gh-106638
gh-106651
gh-106696
gh-106756
gh-106796
gh-112134
gh-112214

gvanrossum commented 1 year ago

Separate from all this is the design for the counters to be used to determine whether a particular POP_JUMP_IF_XXX is more likely to jump or not. @markshannon's suggestion is to add a 16 bit cache entry to those bytecode instructions, initialized with a pattern of alternating ones and zeros. Each time we execute the instruction we shift the value left by 1, shifting in a 1 for a jump taken or a 0 for a jump not taken. When the question is asked, "is this jump likely", we simply count the bits, and if it's mostly ones we assume it is, if it's mostly zeros we assume it isn't. If it's half-half, well, we have a tough choice.

UPDATE: See gh-109039.

gvanrossum commented 1 year ago

Q. In Tier 2, should we have POP_JUMP_IF_XXX, which (like its Tier 1 counterpart) combines a pop and a branch, or should we emit separate POP_TOP instructions?

Emitting separate POP_TOP uops seems perhaps simpler, but it's less performant to interpret, since the explicit POP_TOP will call Py_DECREF() on the stack top, whereas POP_JUMP_IF_TRUE/FALSE won't need to do that at all (True an False are immortal), an POP_JUMPIF[NOT_]NONE can skip that in the None case.

I don't have much intuition yet about which will be easier for the optimizer to handle. It's easy to change.

This also reminds me of a question for @brandtbucher: assuming that at some point the copy-and-patch machinery will use a Tier 2 uop sequence as input, how would you want to handle hand-written uops like JUMP_IF_FALSE or SAVE_IP? (If the answer is long or complicated, we should take this to the faster-cpython/ideas tracker.)

gvanrossum commented 1 year ago

Similarly, in his original guidance @markshannon suggests that POP_JUMP_IF_NONE be translated into LOAD_CONST None; IS_OP; POP_JUMP_IF_TRUE (and similar for POP_JUMP_IF_NOT_NONE). I think that if we decide to do that, it should be done in compile.c when generating Tier 1 bytecode, since in the optimizer it would be hard to ensure there is a co_consts entry for None. In the short term it might be simpler to just add aditional uops JUMP_IF_[NOT_]NONE with hand-written implementations and translations. If we're doing it in compile.c I'd like to enroll @iritkatriel's help. But it definitely feels like a pessimization to replace one bytecode instruction with three (and the same for uops in the Tier 2 interpreter).

UPDATE: Never mind, Mark did this in GH-106599 using a macro in bytecodes.c.

markshannon commented 1 year ago

POP_JUMP_IF_[TRUE|FALSE] seems not only faster than JUMP_IF_[TRUE|FALSE], but more consistent with other operations. The VM is a stack machine; it is normal for an operation to consume its argument(s).

markshannon suggests that POP_JUMP_IF_NONE be translated into LOAD_CONST None; IS_OP; POP_JUMP_IF_TRUE

The point of my original suggestions was that using additional micro-ops can help us to reduce the number of branch micro-ops. As you point out None might not be present in co_consts, so we would want to translate POP_JUMP_IF_NONE as IS_NONE; POP_JUMP_IF_TRUE. We can add as many micro-ops as we want. We just want to minimize the number of micro-ops that need special treatment, like branches, jumps and calls.

how would you want to handle hand-written uops like... SAVE_IP...

All micro-ops are hand written, so SAVE_IP would be handled like any other simple micro-op.

mdboom commented 1 year ago

Separate from all this is the design for the counters to be used to determine whether a particular POP_JUMP_IF_XXX is more likely to jump or not. @markshannon's suggestion is to add a 16 bit cache entry to those bytecode instructions, initialized with a pattern of alternating ones and zeros. Each time we execute the instruction we shift the value left by 1, shifting in a 1 for a jump taken or a 0 for a jump not taken. When the question is asked, "is this jump likely", we simply count the bits, and if it's mostly ones we assume it is, if it's mostly zeros we assume it isn't. If it's half-half, well, we have a tough choice.

How is this different from a counter, where you add for jump taken and subtract for jump not taken? I understand it records some recency information, but if that isn't used, isn't it better to get a more accurate number (with a memory larger than 16 entries?)

markshannon commented 1 year ago

A counter doesn't tell you anything about the recent history. A branch that goes one way a 50 times then the other way a 50 times is different to one that alternates. I suspect it won't matter much in practice, though, as long we have some hint as to the bias of the branch.

An alternative is to count the length and direction of the last two runs, but that's more complex and slower.

gvanrossum commented 1 year ago

All micro-ops are hand written, so SAVE_IP would be handled like any other simple micro-op.

There seems to be some terminological confusion here. I'v been thinking of uops as anything that the Tier 2 interpreter can interpret. Most of those (over a 100 of them) are generated by the generator. There's only a handful hand-written ones (i.e., explicitly written down in the switch in _PyUopExecute) -- the OGs (EXIT_TRACE and SAVE_IP) and now the POP_JUMP_IF_XXX crowd.

But it doesn't matter, we seem to be in agreement.

gvanrossum commented 1 year ago

To clarify, my concern about the few hand-written uops is that IIUC @brandbucher's copy-and-patch tooling uses as its input either bytecodes.c or one of its .c.h outputs in order to generate its templates. For the hand-written uops it will need a way to use the hand-written code as template source. Maybe we need to add them to bytecodes.c after all, marks as uops.

gvanrossum commented 1 year ago

Off-line we discussed the FOR_ITER specializations (gh-106542). We have to rethink this, splitting them up into micro-ops, one of which is a clean jump to an exit stub that pops stuff off the stack and then exits to the successor of the END_FOR bytecode.

The first one to land is likely POP_JUMP_IF_{TRUE,FALSE} (gh-106551). After that, JUMP_BACKWARD is a quick win (gh-106543).

gvanrossum commented 1 year ago

FOR_ITER specializations, revisited

Note that FOR_ITER is a conditional jump. The jump target is always an END_FOR instruction which pops two items off the stack. Even the unspecialized FOR_ITER instruction normally jumps past this END_FOR; it only exists for the FOR_ITER_GEN specialization, which inlines a generator call and relies on END_FOR to clean up when the generator exits.

FOR_ITER starts with an iterator on top of the stack. In the normal case (iterator not exhausted) it pushes the next value from the iterator onto the stack, and continues to the next instruction, without jumping.

If the iterator is exhausted, it logically pushes a dummy value on top of the stack and jumps to the END_FOR instruction, which will pop the dummy and the iterator off the stack;. In practice, it usually (the exception being FOR_ITER_GEN) pops the iterator off the stack and jumps to the instruction immediately following the END_FOR.

Take FOR_ITER_LIST. It roughly does the following:

Deopt if the top of the stack is not a PyListIter. This should be rare (only if we repeatedly call for x in a where a is sometimes a list and sometimes not).
If the iterator is exhausted, pop it off the stack and jump to the instruction right after the target END_FOR opcode.
If the iterator is not exhausted, get the next value, updating the iterator state, and push the next value onto the stack.

FOR_ITER_TUPLE and FOR_ITER_RANGE do the same thing, with slight variations.

Turning each of these three steps into a separate micro-op would cause a 3x3 explosion, since the details of each depends on the type (list, tuple or range). Perhaps a somewhat better approach would be to have one new micro-op for each type that pushes either the next value or NULL onto the stack. This micro-op would combine a guard and an action. We would then follow this with a micro-op JUMP_IF_NULL which inspects the top of the stack (but doesn't pop) and if it's NULL, jumps to a stub. The stub would have POP_NULL; POP_TOP; SAVE_IP; EXIT_TRACE (pointing the IP to the bytecode instruction following the END_FOR).

So we could possibly do with 5 new micro-ops:

ITER_NEXT_LIST
ITER_NEXT_TUPLE
ITER_NEXT_RANGE
JUMP_IF_NULL
POP_NULL

We can't use POP_TOP for the latter, since it doesn't handle NULL, and probably shouldn't be made to, since that would require an extra branch at the CPU level for all uses of POP_TOP.

If we wanted to split the first micro-op into a separate guard and action micro-ops, we'd still be slightly ahead: 8 new micro-ops instead of 9. But there's probably not much point in separate the guard from the action; it's unlikely that any optimization pass would be able to do much guard elimination, given how FOR_ITER is typically used.

Here's a sample implementation of FOR_ITER_RANGE:

        op(_ITER_NEXT_RANGE, (iter -- iter, next)) {
            _PyRangeIterObject *r = (_PyRangeIterObject *)iter;
            DEOPT_IF(Py_TYPE(r) != &PyRangeIter_Type, FOR_ITER);
            STAT_INC(FOR_ITER, hit);
            if (r->len <= 0) {
                next = NULL;
            }
            else {
                long value = r->start;
                r->start = value + r->step;
                r->len--;
                next = PyLong_FromLong(value);
                ERROR_IF(next == NULL, error);
            }
        }

        op(_FOR_ITER_END, (iter, next -- iter, next)) {
            if (next == NULL) {
                Py_DECREF(iter);
                STACK_SHRINK(1);
                SKIP_OVER(INLINE_CACHE_ENTRIES_FOR_ITER);
                // Jump over END_FOR instruction.
                JUMPBY(oparg + 1);
                DISPATCH();
            }
        }

        macro(FOR_ITER_RANGE) = unused/1 + _ITER_NEXT_RANGE + _FOR_ITER_END;

This appears to work in the Tier 1 interpreter. For Tier 2 I have to finagle a few more things, like hand-written versions of _FOR_ITER_END and POP_NULL, and a hand-written translation for FOR_ITER_RANGE.

markshannon commented 1 year ago

I don't see why 9 micro-ops is an "explosion", but 5 is fine. Regularity and clarity are important. The number of micro-ops is not.

This micro-op would combine a guard and an action

This will prevent the optimizer from removing guards. Please keep guards and actions separate.

markshannon commented 1 year ago

For all instructions, we want to keep them efficient for the tier 1 interpreter, while producing optimizable micro-ops for tier 2. For most instructions we can do that.

The problem with instructions like FOR_ITER_RANGE is that they incorporate a branch within the instruction. Loosely, the behavior of FOR_ITER_RANGE is:

    DEOPT_IF(Py_TYPE(iter) != &PyRangeIter_Type);
    if (!is_exhausted(iter)) {
        PUSH_RANGE_ITER_NEXT_VALUE;
    }
    else {
        POP_AND_CLEANUP_RANGE_ITERATOR;
        JUMP_BY(oparg + 1); // The +1 is for the END_FOR
   }

We want a definition of FOR_ITER_RANGE that allows the tier 1 interpreter generator to generate the code it does now, and the tier 2 superblock generator to emit either: The common case (not exhausted)

    CHECK_RANGE_ITERATOR
    IS_RANGE_ITERATOR_NOT_EXHAUSTED
    JUMP_IF_FALSE exit_stub
    PUSH_RANGE_ITER_NEXT_VALUE
    ... # following instruction

or, for the uncommon case (exhausted)

    CHECK_RANGE_ITERATOR
    IS_RANGE_ITERATOR_NOT_EXHAUSTED
    JUMP_IF_TRUE exit_stub
    POP_AND_CLEANUP_RANGE_ITERATOR
    ... # instruction following the END_FOR

markshannon commented 1 year ago

The question is, how to express this? We want to limit the internal CFG of instructions to a tree, so we could use the ternary operator in C.

Something like:

macro(FOR_ITER_RANGE) = 
    CHECK_ITER_RANGE + 
    (IS_RANGE_ITERATOR_NOT_EXHAUSTED ? 
        PUSH_RANGE_ITER_NEXT_VALUE :
        (POP_AND_CLEANUP_RANGE_ITERATOR + JUMPBY(oparg+1))
    );

I'm glossing over a bunch of awkward details here, I realize. This could be quite fiddly to implement, but it should be doable.

gvanrossum commented 1 year ago

Hmm... Implementing the ternary op in the macro syntax looks like fairly costly. And it looks like this will only ever be used for 3-4 instructions (basically just the FOR_ITER specializations, plus perhaps unspecialized FOR_ITER, if we end up wanting to support e.g. iteration over numpy arrays without special-casing them).

The parser changes are straightforward enough, but we'd also need to be able to handle that in the Tier 1 code generation (combining ops efficiently is already some of the most complex code in the generator, and that will become more complicated with the conditional), as well as in the Tier 2 code generation. And then the Tier translation will need to become aware of it as well, and the Tier 2 interpreter.

Separately, it looks like you are proposing a syntax addition where in a macro you'd be able to write JUMPBY(<expr>) where <expr> is a simple expression using oparg. That would be another complication to thread through all of these. For this, I would definitely propose something simpler -- a new uop that gets passed oparg and jumps oparg+1, e.g.

op(JUMP_FORWARD_ADD1, (--)) {
    JUMPBY(oparg + 1);
}

which gets special-cased in the superblock generator so it just emits a SAVE_IP instruction with the right target. This would only be used in a stub anyway. It would be straightforward (the treatment would be nearly identical to the treatment for JUMP_FORWARD).

I'm still thinking about how best to do the ternary expression. I'll keep you posted.

gvanrossum commented 1 year ago

So here's a possible way to deal with FOR_ITER specializations (notable _LIST, _TUPLE, _RANGE) without adding ternary expression support to macro expansion.

My primary goals here are:

Keep the bytecode as efficient as before
Emit simple micro-ops to the superblock

I'm willing to sacrifice some complexity in the translation -- in particular, I propose to have three hand-written cases in the translator (one for each FOR_ITER specialization). Also, I'm not concerned yet with FOR_ITER_GEN -- we have CALL to deal with before we get to that.

Taking FOR_ITER_RANGE as an example, I propose the following split:

macro(FOR_ITER_RANGE) = unused/1 +
    _ITER_CHECK_RANGE + _ITER_JUMP_RANGE + _ITER_NEXT_RANGE;

where

_ITER_CHECK_RANGE is the guard
_ITER_JUMP_RANGE checks if the iterator is exhausted, and if so, pops the iterator off the stack and jumps oparg + 1 units (bypassing END_FOR); but in Tier 2 this will be done differently
_ITER_NEXT_RANGE pushes the next iterator value onto the stack and increments the iterator's index

In Tier 2, the expansion will be slightly different. We reuse _ITER_CHECK_RANGE and _ITER_NEXT_RANGE, but instead of _ITER_JUMP_RANGE we emit a different (Tier 2 only) micro-op, let's say _ITER_EXIT_RANGE, which also checks for exhaustion, but instead of popping and jumping in Tier 1 bytecode address space, it just executes a Tier 2 jump to a stub containing POP_TOP; SAVE_IP; EXIT_TRACE where the IP saved is the instruction following the END_FOR.

I'll whip up a demo.

gvanrossum commented 1 year ago

In Tier 2, the expansion will be slightly different. We reuse _ITER_CHECK_RANGE and _ITER_NEXT_RANGE, but instead of _ITER_JUMP_RANGE we emit a different (Tier 2 only) micro-op, let's say _ITER_EXIT_RANGE, which also checks for exhaustion, but instead of popping and jumping in Tier 1 bytecode address space, it just executes a Tier 2 jump to a stub containing POP_TOP; SAVE_IP; EXIT_TRACE where the IP saved is the instruction following the END_FOR.

Correction. Following Mark's suggestion, the Tier 2 translation (which I propose to do manually) should be

_ITER_CHECK_RANGE (iter -- iter)
_ITER_EXHAUSTED_RANGE (iter -- iter, exhausted: bool)
_POP_JUMP_IF_TRUE (exhausted: bool --)
_ITER_NEXT_RANGE (iter -- iter, next)

together with a stub doing

POP_TOP (value --)
SAVE_IP target  // target following END_FOR
EXIT_TRACE

gvanrossum commented 1 year ago

PR: gh-106638

The manual Tier 2 translation code is here.

Demo function (from test_misc.py):

        def testfunc(n):
            total = 0
            for i in range(n):
                total += i
            return total

Tier 1 disassembly (by dis):

2588           0 RESUME                   0

2589           2 LOAD_CONST               1 (0)
               4 STORE_FAST               1 (total)

2590           6 LOAD_GLOBAL              1 (NULL + range)
              16 LOAD_FAST                0 (n)
              18 CALL                     1
              26 GET_ITER
         >>   28 FOR_ITER                 7 (to 46)
              32 STORE_FAST               2 (i)

2591          34 LOAD_FAST_LOAD_FAST     18 (total, i)
              36 BINARY_OP               13 (+=)
              40 STORE_FAST               1 (total)
              42 JUMP_BACKWARD            9 (to 28)

2590     >>   46 END_FOR

2592          48 LOAD_FAST                1 (total)
              50 RETURN_VALUE

Tier 2 disassembly (output from commented-out code in the test, annotated):

   0: SAVE_IP               14
   1: _ITER_CHECK_RANGE      0    -- FOR_ITER_RANGE expansion starts here
   2: _ITER_EXHAUSTED_RANGE  0
   3: _POP_JUMP_IF_TRUE     18
   4: _ITER_NEXT_RANGE       0    -- ends here
   5: SAVE_IP               16
   6: STORE_FAST             2
   7: SAVE_IP               17
   8: LOAD_FAST              1
   9: LOAD_FAST              2
  10: SAVE_IP               18
  11: _GUARD_BOTH_INT       13
  12: _BINARY_OP_ADD_INT    13
  13: SAVE_IP               20
  14: STORE_FAST             1
  15: SAVE_IP               21
  16: JUMP_TO_TOP            0
  17: EXIT_TRACE             0
  18: POP_TOP                0    -- Stub starts here
  19: SAVE_IP               24
  20: EXIT_TRACE             0

markshannon commented 1 year ago

When you say "do manually", I assume that means "special case these instructions in the code generator".

I see two problems with doing this:

It will prevent anyone from adding new specializations of FOR_ITER, or new bytecodes, without understand the internals of the code generator. We want to keep the barriers to contribution as low as possible, so it should be possible to add new bytecodes without understanding the internals of the code generator.
The possibility for bugs in the form of subtle differences in behavior between tier 1 and tier 2 is increased. This is less of an issue, given the vast amount of implicit testing these instructions will get.

If you think that it is worth taking on this technical debt to unblock other work, then I accept that, but we will need to schedule implementing a more principled approach.

gvanrossum commented 1 year ago

Okay, let's go with this for now. I tried to come up with a way to encode translations like this in the "macro expansion" data structure and it would become very hacky. If we end up wanting to add new FOR_ITER specializations regularly we can iterate.

I will land the FOR_ITER_RANGE PR and then work on FOR_ITER_LIST/TUPLE. After that I'll return to CALL specialization.

gvanrossum commented 11 months ago

The design for POP_JUMP_IF_FALSE etc. has changed again since gh-112045 -- there are now no longer any "branching" ops in Tier 2, everything is done through deoptimization.

However, we have another case: unspecialized FOR_ITER. According to the stats this is now by far the most frequent untranslated opcode, so we would like to translate it to Tier 2. FOR_ITER is already macro(FOR_ITER) = _SPECIALIZE_FOR_ITER + _FOR_ITER.

Alas, _FOR_ITER (the non-specializing part of the macro) is not straightforward to translate -- it does a bunch of work (calling tp_iternext) and if this returns NULL without setting an exception it jumps over over the corresponding END_FOR instruction. The best I can think of for Tier 2 is to make it deopt in that case to the END_FOR opcode (which will handle the stack pops). But this feels like a bit of a hack. @markshannon?

gvanrossum commented 11 months ago

The best I can think of for Tier 2 is to make it deopt in that case to the END_FOR opcode (which will handle the stack pops).

That seems to work, except the deopt code must be duplicated (since we can't use target as the deopt destination) and we must deopt to the instruction past the END_FOR, since that instruction pops two stack items, but we only have one (the iterator). See gh-112134.