Open markshannon opened 8 months ago
Most uops do not change the frame, so the form stack_pointer = _UOP_func(tstate, frame, stack_pointer, [oparg]);
works.
If the frame is changed, then we can change the form to
frame = _UOP_func(tstate, frame, stack_pointer, [oparg]);
stack_pointer += ...
Some experimentation notes:
Just moving _INIT_CALL_PY_EXACT_ARGS
has a 1% overall speedup. Trying to move more functions than that, I'm struggling to make an improvement.
All uops greater than 350 bytes also has a net zero effect.
(A threshold of 400 would be equivalent to _INIT_CALL_PY_EXACT_ARGS
alone, which we've already seen improves speed by 1%).
Looking at both code size and execution counts together may be a better metric:
The dot on the far right is _INIT_CALL_PY_EXACT_ARGS
.
As a first rough cut, I tried taking the uops that are in the top 25%-ile in code size and the bottom 25%-ile in frequency, which gives us a set of 10 opcodes. This also has a net-zero effect on runtime.
I have a couple of theories about why none of this is working. Perhaps the code size has to be pretty large before the overhead of the function call becomes worth it (and _INIT_PY_CALL_EXACT_ARGS
is a real outlier in code size).
Also, I wonder if part of what is "hurting" with these additional functions is that they have exits (they have DEOPT_IF
and/or EXIT_IF
(unlike _INIT_CALL_PY_EXACT_ARGS
). Since the function can not just jump to a label in the caller (as would normally happen in a uop), I was handling this by returning 0, 1 or 2 from the function and handling the result in the caller:
switch (result) {
case 0:
break;
case 1:
JUMP_TO_JUMP_TARGET();
case 2:
JUMP_TO_ERROR();
}
(and only including cases that were actually used in each uop). This means that each exit has an additional branch ... I don't know how much that matters. There may be a better "trick" to make this work -- I have to admit that low-level C tricks aren't my forte. If there's a better way to "jump directly" somehow let me know.
The "simple" (non-exiting) uops are marked in yellow here:
I thought I would try only including uops that don't have exits, but not worrying as much about code size If I take non-exiting UOps, with counts of less than 10^8, and code sizes larger than 100, I get: ['_INIT_CALL_BOUND_METHOD_EXACT_ARGS', '_INIT_CALL_PY_EXACT_ARGS', '_COPY_FREE_VARS', '_SET_FUNCTION_ATTRIBUTE', '_COMPARE_OP_FLOAT']. That is 1% faster, and notably is faster for every interpreter-heavy benchmark.
@gvanrossum observed that the main thing that bulks out _INIT_CALL_PY_EXACT_ARGS
is the static inlining of _PyFrame_PushUnchecked
. If all of this experimentation still points to "only externalizing _INIT_CALL_PY_EXACT_ARGS
helps", it probably makes sense to just not inline _PyFrame_PushUnchecked
in the context of compiling the JIT templates, which could probably be done with a couple lines of C-preprocessor hackery. That would be much simpler than modifying the generator to generate these "external" uops. UPDATE: This is not noticeably faster.
Conclusion: It seems like externalizing these uops make things the fastest of anything I tried. (This is non-exiting uops, not too frequently used, bigger than double the size of the code for making the function call). ['_INIT_CALL_BOUND_METHOD_EXACT_ARGS', '_INIT_CALL_PY_EXACT_ARGS', '_COPY_FREE_VARS', '_SET_FUNCTION_ATTRIBUTE', '_COMPARE_OP_FLOAT'].
There may be a trick to making uops-with-exits fast enought to externalize as well (@brandtbucher, thoughts?). That could also be left as follow-on work.
This means that each exit has an additional branch ... I don't know how much that matters. There may be a better "trick" to make this work -- I have to admit that low-level C tricks aren't my forte. If there's a better way to "jump directly" somehow let me know.
Haven't thought about it too much, but one option could be to pass _JIT_CONTINUE
and _JIT_JUMP_TARGET
/_JIT_ERROR_TARGET
into the function, and it could return one, which is jumped to:
PyAPI_DATA(void) _JIT_CONTINUE;
PyAPI_DATA(void) _JIT_EXIT_TARGET;
jit_func next = _Py_FOO_func(tstate, frame, stack_pointer, CURRENT_OPARG(),
&_JIT_CONTINUE, &_JIT_EXIT_TARGET);
stack_pointer += ...;
__attribute__((musttail))
return ((jit_func)next)(frame, stack_pointer, tstate);
At @brandtbucher's suggestion, I instead tried large uops that are more frequent, giving the set _INIT_CALL_PY_EXACT_ARGS
, _COMPARE_OP_FLOAT
(there's just not that many large ones without exits). This, unfortunately, made no net change.
I think the next thing to tackle will be a faster way to handle exits which should make the set of uops that we can externalize larger.
An additional motivation for doing this, apart from the impact on the JIT, is the ability to analyze the micro-ops. Large micro-ops often have complex control flow, which makes it hard to analyze the code for stack spilling and top-of-stack caching.
Many of the micro-op bodies are quite large, and are likely to bloat jitted code, harming performance.
We should move these larger bodies into helper functions in the tier2 code generator.
For example, the generator code for
_INIT_CALL_PY_EXACT_ARGS
looks like this:By moving the bulk of this into a helper function, we can generate the much shorter:
with the helper function:
Linked PRs