Open gvanrossum opened 1 year ago
Separate from all this is the design for the counters to be used to determine whether a particular POP_JUMP_IF_XXX is more likely to jump or not. @markshannon's suggestion is to add a 16 bit cache entry to those bytecode instructions, initialized with a pattern of alternating ones and zeros. Each time we execute the instruction we shift the value left by 1, shifting in a 1 for a jump taken or a 0 for a jump not taken. When the question is asked, "is this jump likely", we simply count the bits, and if it's mostly ones we assume it is, if it's mostly zeros we assume it isn't. If it's half-half, well, we have a tough choice.
UPDATE: See gh-109039.
Q. In Tier 2, should we have POP_JUMP_IF_XXX, which (like its Tier 1 counterpart) combines a pop and a branch, or should we emit separate POP_TOP instructions?
Emitting separate POP_TOP uops seems perhaps simpler, but it's less performant to interpret, since the explicit POP_TOP will call Py_DECREF() on the stack top, whereas POP_JUMP_IF_TRUE/FALSE won't need to do that at all (True an False are immortal), an POP_JUMPIF[NOT_]NONE can skip that in the None case.
I don't have much intuition yet about which will be easier for the optimizer to handle. It's easy to change.
This also reminds me of a question for @brandtbucher: assuming that at some point the copy-and-patch machinery will use a Tier 2 uop sequence as input, how would you want to handle hand-written uops like JUMP_IF_FALSE or SAVE_IP? (If the answer is long or complicated, we should take this to the faster-cpython/ideas tracker.)
Similarly, in his original guidance @markshannon suggests that POP_JUMP_IF_NONE
be translated into LOAD_CONST None; IS_OP; POP_JUMP_IF_TRUE
(and similar for POP_JUMP_IF_NOT_NONE
). I think that if we decide to do that, it should be done in compile.c when generating Tier 1 bytecode, since in the optimizer it would be hard to ensure there is a co_consts
entry for None
. In the short term it might be simpler to just add aditional uops JUMP_IF_[NOT_]NONE
with hand-written implementations and translations. If we're doing it in compile.c I'd like to enroll @iritkatriel's help. But it definitely feels like a pessimization to replace one bytecode instruction with three (and the same for uops in the Tier 2 interpreter).
UPDATE: Never mind, Mark did this in GH-106599 using a macro in bytecodes.c.
POP_JUMP_IF_[TRUE|FALSE]
seems not only faster than JUMP_IF_[TRUE|FALSE]
, but more consistent with other operations. The VM is a stack machine; it is normal for an operation to consume its argument(s).
markshannon suggests that
POP_JUMP_IF_NONE
be translated intoLOAD_CONST None; IS_OP; POP_JUMP_IF_TRUE
The point of my original suggestions was that using additional micro-ops can help us to reduce the number of branch micro-ops. As you point out None
might not be present in co_consts
, so we would want to translate POP_JUMP_IF_NONE
as IS_NONE; POP_JUMP_IF_TRUE
.
We can add as many micro-ops as we want. We just want to minimize the number of micro-ops that need special treatment, like branches, jumps and calls.
how would you want to handle hand-written uops like... SAVE_IP...
All micro-ops are hand written, so SAVE_IP
would be handled like any other simple micro-op.
Separate from all this is the design for the counters to be used to determine whether a particular POP_JUMP_IF_XXX is more likely to jump or not. @markshannon's suggestion is to add a 16 bit cache entry to those bytecode instructions, initialized with a pattern of alternating ones and zeros. Each time we execute the instruction we shift the value left by 1, shifting in a 1 for a jump taken or a 0 for a jump not taken. When the question is asked, "is this jump likely", we simply count the bits, and if it's mostly ones we assume it is, if it's mostly zeros we assume it isn't. If it's half-half, well, we have a tough choice.
How is this different from a counter, where you add for jump taken and subtract for jump not taken? I understand it records some recency information, but if that isn't used, isn't it better to get a more accurate number (with a memory larger than 16 entries?)
A counter doesn't tell you anything about the recent history. A branch that goes one way a 50 times then the other way a 50 times is different to one that alternates. I suspect it won't matter much in practice, though, as long we have some hint as to the bias of the branch.
An alternative is to count the length and direction of the last two runs, but that's more complex and slower.
All micro-ops are hand written, so
SAVE_IP
would be handled like any other simple micro-op.
There seems to be some terminological confusion here. I'v been thinking of uops as anything that the Tier 2 interpreter can interpret. Most of those (over a 100 of them) are generated by the generator. There's only a handful hand-written ones (i.e., explicitly written down in the switch in _PyUopExecute
) -- the OGs (EXIT_TRACE
and SAVE_IP
) and now the POP_JUMP_IF_XXX
crowd.
But it doesn't matter, we seem to be in agreement.
To clarify, my concern about the few hand-written uops is that IIUC @brandbucher's copy-and-patch tooling uses as its input either bytecodes.c or one of its .c.h outputs in order to generate its templates. For the hand-written uops it will need a way to use the hand-written code as template source. Maybe we need to add them to bytecodes.c after all, marks as uops.
Off-line we discussed the FOR_ITER
specializations (gh-106542). We have to rethink this, splitting them up into micro-ops, one of which is a clean jump to an exit stub that pops stuff off the stack and then exits to the successor of the END_FOR
bytecode.
The first one to land is likely POP_JUMP_IF_{TRUE,FALSE}
(gh-106551). After that, JUMP_BACKWARD
is a quick win (gh-106543).
Note that FOR_ITER
is a conditional jump. The jump target is always an END_FOR
instruction which pops two items off the stack. Even the unspecialized FOR_ITER
instruction normally jumps past this END_FOR
; it only exists for the FOR_ITER_GEN
specialization, which inlines a generator call and relies on END_FOR
to clean up when the generator exits.
FOR_ITER
starts with an iterator on top of the stack. In the normal case (iterator not exhausted) it pushes the next value from the iterator onto the stack, and continues to the next instruction, without jumping.
If the iterator is exhausted, it logically pushes a dummy value on top of the stack and jumps to the END_FOR
instruction, which will pop the dummy and the iterator off the stack;. In practice, it usually (the exception being FOR_ITER_GEN
) pops the iterator off the stack and jumps to the instruction immediately following the END_FOR
.
Take FOR_ITER_LIST
. It roughly does the following:
Deopt if the top of the stack is not a PyListIter
. This should be rare (only if we repeatedly call for x in a
where a
is sometimes a list and sometimes not).
If the iterator is exhausted, pop it off the stack and jump to the instruction right after the target END_FOR
opcode.
If the iterator is not exhausted, get the next value, updating the iterator state, and push the next value onto the stack.
FOR_ITER_TUPLE
and FOR_ITER_RANGE
do the same thing, with slight variations.
Turning each of these three steps into a separate micro-op would cause a 3x3 explosion, since the details of each depends on the type (list, tuple or range). Perhaps a somewhat better approach would be to have one new micro-op for each type that pushes either the next value or NULL
onto the stack. This micro-op would combine a guard and an action. We would then follow this with a micro-op JUMP_IF_NULL
which inspects the top of the stack (but doesn't pop) and if it's NULL
, jumps to a stub. The stub would have POP_NULL; POP_TOP; SAVE_IP; EXIT_TRACE
(pointing the IP to the bytecode instruction following the END_FOR
).
So we could possibly do with 5 new micro-ops:
ITER_NEXT_LIST
ITER_NEXT_TUPLE
ITER_NEXT_RANGE
JUMP_IF_NULL
POP_NULL
We can't use POP_TOP
for the latter, since it doesn't handle NULL
, and probably shouldn't be made to, since that would require an extra branch at the CPU level for all uses of POP_TOP
.
If we wanted to split the first micro-op into a separate guard and action micro-ops, we'd still be slightly ahead: 8 new micro-ops instead of 9. But there's probably not much point in separate the guard from the action; it's unlikely that any optimization pass would be able to do much guard elimination, given how FOR_ITER
is typically used.
Here's a sample implementation of FOR_ITER_RANGE
:
op(_ITER_NEXT_RANGE, (iter -- iter, next)) {
_PyRangeIterObject *r = (_PyRangeIterObject *)iter;
DEOPT_IF(Py_TYPE(r) != &PyRangeIter_Type, FOR_ITER);
STAT_INC(FOR_ITER, hit);
if (r->len <= 0) {
next = NULL;
}
else {
long value = r->start;
r->start = value + r->step;
r->len--;
next = PyLong_FromLong(value);
ERROR_IF(next == NULL, error);
}
}
op(_FOR_ITER_END, (iter, next -- iter, next)) {
if (next == NULL) {
Py_DECREF(iter);
STACK_SHRINK(1);
SKIP_OVER(INLINE_CACHE_ENTRIES_FOR_ITER);
// Jump over END_FOR instruction.
JUMPBY(oparg + 1);
DISPATCH();
}
}
macro(FOR_ITER_RANGE) = unused/1 + _ITER_NEXT_RANGE + _FOR_ITER_END;
This appears to work in the Tier 1 interpreter. For Tier 2 I have to finagle a few more things, like hand-written versions of _FOR_ITER_END
and POP_NULL
, and a hand-written translation for FOR_ITER_RANGE
.
I don't see why 9 micro-ops is an "explosion", but 5 is fine. Regularity and clarity are important. The number of micro-ops is not.
This micro-op would combine a guard and an action
This will prevent the optimizer from removing guards. Please keep guards and actions separate.
For all instructions, we want to keep them efficient for the tier 1 interpreter, while producing optimizable micro-ops for tier 2. For most instructions we can do that.
The problem with instructions like FOR_ITER_RANGE
is that they incorporate a branch within the instruction.
Loosely, the behavior of FOR_ITER_RANGE
is:
DEOPT_IF(Py_TYPE(iter) != &PyRangeIter_Type);
if (!is_exhausted(iter)) {
PUSH_RANGE_ITER_NEXT_VALUE;
}
else {
POP_AND_CLEANUP_RANGE_ITERATOR;
JUMP_BY(oparg + 1); // The +1 is for the END_FOR
}
We want a definition of FOR_ITER_RANGE
that allows the tier 1 interpreter generator to generate the code it does now, and the tier 2 superblock generator to emit either:
The common case (not exhausted)
CHECK_RANGE_ITERATOR
IS_RANGE_ITERATOR_NOT_EXHAUSTED
JUMP_IF_FALSE exit_stub
PUSH_RANGE_ITER_NEXT_VALUE
... # following instruction
or, for the uncommon case (exhausted)
CHECK_RANGE_ITERATOR
IS_RANGE_ITERATOR_NOT_EXHAUSTED
JUMP_IF_TRUE exit_stub
POP_AND_CLEANUP_RANGE_ITERATOR
... # instruction following the END_FOR
The question is, how to express this? We want to limit the internal CFG of instructions to a tree, so we could use the ternary operator in C.
Something like:
macro(FOR_ITER_RANGE) =
CHECK_ITER_RANGE +
(IS_RANGE_ITERATOR_NOT_EXHAUSTED ?
PUSH_RANGE_ITER_NEXT_VALUE :
(POP_AND_CLEANUP_RANGE_ITERATOR + JUMPBY(oparg+1))
);
I'm glossing over a bunch of awkward details here, I realize. This could be quite fiddly to implement, but it should be doable.
Hmm... Implementing the ternary op in the macro syntax looks like fairly costly. And it looks like this will only ever be used for 3-4 instructions (basically just the FOR_ITER
specializations, plus perhaps unspecialized FOR_ITER
, if we end up wanting to support e.g. iteration over numpy arrays without special-casing them).
The parser changes are straightforward enough, but we'd also need to be able to handle that in the Tier 1 code generation (combining ops efficiently is already some of the most complex code in the generator, and that will become more complicated with the conditional), as well as in the Tier 2 code generation. And then the Tier translation will need to become aware of it as well, and the Tier 2 interpreter.
Separately, it looks like you are proposing a syntax addition where in a macro you'd be able to write JUMPBY(<expr>)
where <expr>
is a simple expression using oparg
. That would be another complication to thread through all of these. For this, I would definitely propose something simpler -- a new uop that gets passed oparg
and jumps oparg+1
, e.g.
op(JUMP_FORWARD_ADD1, (--)) {
JUMPBY(oparg + 1);
}
which gets special-cased in the superblock generator so it just emits a SAVE_IP
instruction with the right target. This would only be used in a stub anyway. It would be straightforward (the treatment would be nearly identical to the treatment for JUMP_FORWARD
).
I'm still thinking about how best to do the ternary expression. I'll keep you posted.
So here's a possible way to deal with FOR_ITER
specializations (notable _LIST
, _TUPLE
, _RANGE
) without adding ternary expression support to macro expansion.
My primary goals here are:
I'm willing to sacrifice some complexity in the translation -- in particular, I propose to have three hand-written cases in the translator (one for each FOR_ITER
specialization). Also, I'm not concerned yet with FOR_ITER_GEN
-- we have CALL
to deal with before we get to that.
Taking FOR_ITER_RANGE
as an example, I propose the following split:
macro(FOR_ITER_RANGE) = unused/1 +
_ITER_CHECK_RANGE + _ITER_JUMP_RANGE + _ITER_NEXT_RANGE;
where
_ITER_CHECK_RANGE
is the guard_ITER_JUMP_RANGE
checks if the iterator is exhausted, and if so, pops the iterator off the stack and jumps oparg + 1
units (bypassing END_FOR
); but in Tier 2 this will be done differently_ITER_NEXT_RANGE
pushes the next iterator value onto the stack and increments the iterator's indexIn Tier 2, the expansion will be slightly different. We reuse _ITER_CHECK_RANGE
and _ITER_NEXT_RANGE
, but instead of _ITER_JUMP_RANGE
we emit a different (Tier 2 only) micro-op, let's say _ITER_EXIT_RANGE
, which also checks for exhaustion, but instead of popping and jumping in Tier 1 bytecode address space, it just executes a Tier 2 jump to a stub containing POP_TOP; SAVE_IP; EXIT_TRACE
where the IP saved is the instruction following the END_FOR
.
I'll whip up a demo.
In Tier 2, the expansion will be slightly different. We reuse
_ITER_CHECK_RANGE
and_ITER_NEXT_RANGE
, but instead of_ITER_JUMP_RANGE
we emit a different (Tier 2 only) micro-op, let's say_ITER_EXIT_RANGE
, which also checks for exhaustion, but instead of popping and jumping in Tier 1 bytecode address space, it just executes a Tier 2 jump to a stub containingPOP_TOP; SAVE_IP; EXIT_TRACE
where the IP saved is the instruction following theEND_FOR
.
Correction. Following Mark's suggestion, the Tier 2 translation (which I propose to do manually) should be
_ITER_CHECK_RANGE (iter -- iter)
_ITER_EXHAUSTED_RANGE (iter -- iter, exhausted: bool)
_POP_JUMP_IF_TRUE (exhausted: bool --)
_ITER_NEXT_RANGE (iter -- iter, next)
together with a stub doing
POP_TOP (value --)
SAVE_IP target // target following END_FOR
EXIT_TRACE
PR: gh-106638
The manual Tier 2 translation code is here.
Demo function (from test_misc.py):
def testfunc(n):
total = 0
for i in range(n):
total += i
return total
Tier 1 disassembly (by dis):
2588 0 RESUME 0
2589 2 LOAD_CONST 1 (0)
4 STORE_FAST 1 (total)
2590 6 LOAD_GLOBAL 1 (NULL + range)
16 LOAD_FAST 0 (n)
18 CALL 1
26 GET_ITER
>> 28 FOR_ITER 7 (to 46)
32 STORE_FAST 2 (i)
2591 34 LOAD_FAST_LOAD_FAST 18 (total, i)
36 BINARY_OP 13 (+=)
40 STORE_FAST 1 (total)
42 JUMP_BACKWARD 9 (to 28)
2590 >> 46 END_FOR
2592 48 LOAD_FAST 1 (total)
50 RETURN_VALUE
Tier 2 disassembly (output from commented-out code in the test, annotated):
0: SAVE_IP 14
1: _ITER_CHECK_RANGE 0 -- FOR_ITER_RANGE expansion starts here
2: _ITER_EXHAUSTED_RANGE 0
3: _POP_JUMP_IF_TRUE 18
4: _ITER_NEXT_RANGE 0 -- ends here
5: SAVE_IP 16
6: STORE_FAST 2
7: SAVE_IP 17
8: LOAD_FAST 1
9: LOAD_FAST 2
10: SAVE_IP 18
11: _GUARD_BOTH_INT 13
12: _BINARY_OP_ADD_INT 13
13: SAVE_IP 20
14: STORE_FAST 1
15: SAVE_IP 21
16: JUMP_TO_TOP 0
17: EXIT_TRACE 0
18: POP_TOP 0 -- Stub starts here
19: SAVE_IP 24
20: EXIT_TRACE 0
When you say "do manually", I assume that means "special case these instructions in the code generator".
I see two problems with doing this:
FOR_ITER
, or new bytecodes, without understand the internals of the code generator. We want to keep the barriers to contribution as low as possible, so it should be possible to add new bytecodes without understanding the internals of the code generator.If you think that it is worth taking on this technical debt to unblock other work, then I accept that, but we will need to schedule implementing a more principled approach.
Okay, let's go with this for now. I tried to come up with a way to encode translations like this in the "macro expansion" data structure and it would become very hacky. If we end up wanting to add new FOR_ITER specializations regularly we can iterate.
I will land the FOR_ITER_RANGE PR and then work on FOR_ITER_LIST/TUPLE. After that I'll return to CALL specialization.
The design for POP_JUMP_IF_FALSE
etc. has changed again since gh-112045 -- there are now no longer any "branching" ops in Tier 2, everything is done through deoptimization.
However, we have another case: unspecialized FOR_ITER
. According to the stats this is now by far the most frequent untranslated opcode, so we would like to translate it to Tier 2. FOR_ITER
is already macro(FOR_ITER) = _SPECIALIZE_FOR_ITER + _FOR_ITER
.
Alas, _FOR_ITER
(the non-specializing part of the macro) is not straightforward to translate -- it does a bunch of work (calling tp_iternext
) and if this returns NULL
without setting an exception it jumps over over the corresponding END_FOR
instruction. The best I can think of for Tier 2 is to make it deopt in that case to the END_FOR
opcode (which will handle the stack pops). But this feels like a bit of a hack. @markshannon?
The best I can think of for Tier 2 is to make it deopt in that case to the
END_FOR
opcode (which will handle the stack pops).
That seems to work, except the deopt code must be duplicated (since we can't use target
as the deopt destination) and we must deopt to the instruction past the END_FOR
, since that instruction pops two stack items, but we only have one (the iterator). See gh-112134.
This issue is part of the larger epic of gh-104584. In PR gh-106393 I tried to implement branching, but it was premature. Here's a better design, following @markshannon's guidance.
We have the following jump instructions (not counting the instrumented versions):
Unconditional jumps:
Branches, a.k.a. conditional jumps:
[x] POP_JUMP_IF_FALSE, POP_JUMP_IF_TRUE, ~POP_JUMP_IF_NONE, POP_JUMP_IF_NOT_NONE~
[x] FOR_ITER's specializations:
[ ] FOR_ITER_GEN
[ ] SEND
[ ] Add counters to to POP_JUMPIF{FALSE,TRUE} to determine likeliness
The translation strategies could be as follows:
Unconditional jumps
JUMP_BACKWARD
JUMP_BACKWARD_NO_INTERRUPT
JUMP_FORWARD
instr
to the destination of the jump).Conditional jumps (branches)
POP_JUMP_IF_FALSE and friends
Consider the following Python code:
This translates roughly to the following Tier 1 bytecode (using B1, B2, ... to label Tier 1 instructions, and
<cond>
,<block>
etc. to represent code blocks that evaluate or execute the corresponding Python fragments):I propose the following translation into Tier 2 uops, assuming the branch is "unlikely":
Where JUMP_IF_FALSE inspects the top of stack but doesn't pop it, and has an argument that executes a jump in the Tier 2 uop instruction sequence.
If the branch is "likely", we do this instead:
Note how in this case,
<rest>
is projected as part of the trace, while<block>
is not, since the likely case is that we jump over<block>
to<rest>
.For the other simple conditional jumps (POP_JUMP_IF_TRUE, ~POP_JUMP_IF_NONE, POP_JUMP_IF_NOT_NONE~) we do the same: if the jump is unlikely, emit a JUMP_IF_XXX uop and a stub; if the jump is likely, emit the inverse JUMP_IF_NOT_XXX uop and a different stub, and continue projecting at the destination of the original jump bytecode.
I propose to have hand-written cases both in the superblock generator and in the Tier 2 interpreter for these, since the translations are too irregular to fit easily in the macro expansion data structure. The stub generation will require additional logic and data structures in
translate_bytecode_to_trace()
to keep track of the stubs required so far, the available space for those, and the back-patching required to set the operands for the JUMP_IF_XXX uops.FOR_ITER and (especially) its specializations
The common case for these is not to jump. The bytecode definitions are too complex to duplicate in hand-written Tier 2 uops. My proposal is to change these in bytecodes.c so that, instead of using the
JUMPBY(n)
macro, they useJUMPBY_POP_DISPATCH(n)
, which in Tier 1 translates into justJUMPBY(n)
, but in Tier 2 translates into roughlythereby exiting the trace when the corresponding for-loop terminates.
I am assuming here that most loops have several iterations. I don't think it's worth special-casing the occasional for-loop that almost always immediately terminates.
SEND
Possibly we could treat this the same as FOR_ITER. But right now I propose to just punt here, and when we encounter it, stop projecting, as we do with any other unsupported bytecode instruction.
Linked PRs