python / cpython

The Python programming language
https://www.python.org
Other
62.42k stars 29.97k forks source link

Move large uop bodies into functions. #117224

Open markshannon opened 5 months ago

markshannon commented 5 months ago

Many of the micro-op bodies are quite large, and are likely to bloat jitted code, harming performance.

We should move these larger bodies into helper functions in the tier2 code generator.

For example, the generator code for _INIT_CALL_PY_EXACT_ARGS looks like this:

case _INIT_CALL_PY_EXACT_ARGS: {
    PyObject **args;
    PyObject *self_or_null;
    PyObject *callable;
    _PyInterpreterFrame *new_frame;
    oparg = CURRENT_OPARG();
    args = &stack_pointer[-oparg];
    self_or_null = stack_pointer[-1 - oparg];
    callable = stack_pointer[-2 - oparg];
    int has_self = (self_or_null != NULL);
    STAT_INC(CALL, hit);
    PyFunctionObject *func = (PyFunctionObject *)callable;
    new_frame = _PyFrame_PushUnchecked(tstate, func, oparg + has_self);
    PyObject **first_non_self_local = new_frame->localsplus + has_self;
    new_frame->localsplus[0] = self_or_null;
    for (int i = 0; i < oparg; i++) {
        first_non_self_local[i] = args[i];
    }
    stack_pointer[-2 - oparg] = (PyObject *)new_frame;
    stack_pointer += -1 - oparg;
    break;
}

By moving the bulk of this into a helper function, we can generate the much shorter:

case _INIT_CALL_PY_EXACT_ARGS: {
    stack_pointer = _INIT_CALL_PY_EXACT_ARGS_func(tstate, frame, stack_pointer, oparg);
    break;
}

with the helper function:

PyObject ** _INIT_CALL_PY_EXACT_ARGS_func(PyThreadState *tstate, _PyInterpreterFrame *frame, PyObject ** stack_pointer, int oparg) {
    PyObject **args;
    PyObject *self_or_null;
    PyObject *callable;
    _PyInterpreterFrame *new_frame;
    args = &stack_pointer[-oparg];
    self_or_null = stack_pointer[-1 - oparg];
    callable = stack_pointer[-2 - oparg];
    int has_self = (self_or_null != NULL);
    STAT_INC(CALL, hit);
    PyFunctionObject *func = (PyFunctionObject *)callable;
    new_frame = _PyFrame_PushUnchecked(tstate, func, oparg + has_self);
    PyObject **first_non_self_local = new_frame->localsplus + has_self;
    new_frame->localsplus[0] = self_or_null;
    for (int i = 0; i < oparg; i++) {
        first_non_self_local[i] = args[i];
    }
    stack_pointer[-2 - oparg] = (PyObject *)new_frame;
    stack_pointer += -1 - oparg;
    return stack_pointer;
}

Linked PRs

markshannon commented 5 months ago

Most uops do not change the frame, so the form stack_pointer = _UOP_func(tstate, frame, stack_pointer, [oparg]); works. If the frame is changed, then we can change the form to

    frame = _UOP_func(tstate, frame, stack_pointer, [oparg]);
    stack_pointer += ...
mdboom commented 5 months ago

Some experimentation notes:

Just moving _INIT_CALL_PY_EXACT_ARGS has a 1% overall speedup. Trying to move more functions than that, I'm struggling to make an improvement.

All uops greater than 200 bytes 39: ['_LIST_EXTEND', '_CONTAINS_OP_DICT', '_CALL_METHOD_DESCRIPTOR_NOARGS', '_INIT_CALL_PY_EXACT_ARGS_0', '_INIT_CALL_PY_EXACT_ARGS_1', '_LOAD_ATTR', '_STORE_SLICE', '_BUILD_SLICE', '_BINARY_SUBSCR_LIST_INT', '_STORE_SUBSCR_LIST_INT', '_CALL_BUILTIN_FAST', '_COMPARE_OP_INT', '_BINARY_SUBSCR_DICT', '_BINARY_SUBSCR_STR_INT', '_BINARY_SUBSCR_TUPLE_INT', '_CALL_ISINSTANCE', '_CALL_LEN', '_CONTAINS_OP_SET', '_CALL_BUILTIN_CLASS', '_CALL_BUILTIN_O', '_CALL_METHOD_DESCRIPTOR_FAST', '_CALL_METHOD_DESCRIPTOR_O', '_COMPARE_OP', '_FOR_ITER_TIER_TWO', '_INIT_CALL_PY_EXACT_ARGS', '_INIT_CALL_PY_EXACT_ARGS_2', '_INIT_CALL_PY_EXACT_ARGS_3', '_INIT_CALL_PY_EXACT_ARGS_4', '_LOAD_GLOBAL', '_STORE_SUBSCR', '_BUILD_MAP', '_LOAD_SUPER_ATTR_METHOD', '_BUILD_STRING', '_CALL_BUILTIN_FAST_WITH_KEYWORDS', '_BUILD_CONST_KEY_MAP', '_CALL_METHOD_DESCRIPTOR_FAST_WITH_KEYWORDS', '_GET_ANEXT', '_STORE_NAME', '_COMPARE_OP_FLOAT']
All uops greater than 300 bytes 16: ['_CALL_METHOD_DESCRIPTOR_NOARGS', '_BUILD_SLICE', '_CALL_BUILTIN_FAST', '_BINARY_SUBSCR_STR_INT', '_CALL_ISINSTANCE', '_CALL_LEN', '_CALL_BUILTIN_CLASS', '_CALL_BUILTIN_O', '_CALL_METHOD_DESCRIPTOR_FAST', '_CALL_METHOD_DESCRIPTOR_O', '_INIT_CALL_PY_EXACT_ARGS', '_INIT_CALL_PY_EXACT_ARGS_4', '_LOAD_GLOBAL', '_LOAD_SUPER_ATTR_METHOD', '_CALL_BUILTIN_FAST_WITH_KEYWORDS', '_CALL_METHOD_DESCRIPTOR_FAST_WITH_KEYWORDS']

All uops greater than 350 bytes also has a net zero effect.

(A threshold of 400 would be equivalent to _INIT_CALL_PY_EXACT_ARGS alone, which we've already seen improves speed by 1%).

Looking at both code size and execution counts together may be a better metric:

image

The dot on the far right is _INIT_CALL_PY_EXACT_ARGS.

As a first rough cut, I tried taking the uops that are in the top 25%-ile in code size and the bottom 25%-ile in frequency, which gives us a set of 10 opcodes. This also has a net-zero effect on runtime.

25%-ile in both size and execution counts ![image](https://github.com/python/cpython/assets/38294/3ebae7cf-0ee1-4ec9-93f7-525912b60b14) 10: ['_CALL_METHOD_DESCRIPTOR_O', '_INIT_CALL_PY_EXACT_ARGS', '_INIT_CALL_PY_EXACT_ARGS_3', '_BUILD_MAP', '_LOAD_SUPER_ATTR_METHOD', '_BUILD_STRING', '_CALL_BUILTIN_FAST_WITH_KEYWORDS', '_LOAD_ATTR_MODULE', '_BUILD_CONST_KEY_MAP', '_STORE_NAME']

I have a couple of theories about why none of this is working. Perhaps the code size has to be pretty large before the overhead of the function call becomes worth it (and _INIT_PY_CALL_EXACT_ARGS is a real outlier in code size).

Also, I wonder if part of what is "hurting" with these additional functions is that they have exits (they have DEOPT_IF and/or EXIT_IF (unlike _INIT_CALL_PY_EXACT_ARGS). Since the function can not just jump to a label in the caller (as would normally happen in a uop), I was handling this by returning 0, 1 or 2 from the function and handling the result in the caller:

switch (result) {
case 0:
  break;
case 1:
  JUMP_TO_JUMP_TARGET();
case 2:
  JUMP_TO_ERROR();
}

(and only including cases that were actually used in each uop). This means that each exit has an additional branch ... I don't know how much that matters. There may be a better "trick" to make this work -- I have to admit that low-level C tricks aren't my forte. If there's a better way to "jump directly" somehow let me know.

The "simple" (non-exiting) uops are marked in yellow here:

image

I thought I would try only including uops that don't have exits, but not worrying as much about code size If I take non-exiting UOps, with counts of less than 10^8, and code sizes larger than 100, I get: ['_INIT_CALL_BOUND_METHOD_EXACT_ARGS', '_INIT_CALL_PY_EXACT_ARGS', '_COPY_FREE_VARS', '_SET_FUNCTION_ATTRIBUTE', '_COMPARE_OP_FLOAT']. That is 1% faster, and notably is faster for every interpreter-heavy benchmark.

Perhaps a good selection of non-exiting uops ![image](https://github.com/python/cpython/assets/38294/76706c0d-00cd-4283-9f22-c7d8cc0bb1ea)

@gvanrossum observed that the main thing that bulks out _INIT_CALL_PY_EXACT_ARGS is the static inlining of _PyFrame_PushUnchecked. If all of this experimentation still points to "only externalizing _INIT_CALL_PY_EXACT_ARGS helps", it probably makes sense to just not inline _PyFrame_PushUnchecked in the context of compiling the JIT templates, which could probably be done with a couple lines of C-preprocessor hackery. That would be much simpler than modifying the generator to generate these "external" uops. UPDATE: This is not noticeably faster.

mdboom commented 5 months ago

Conclusion: It seems like externalizing these uops make things the fastest of anything I tried. (This is non-exiting uops, not too frequently used, bigger than double the size of the code for making the function call). ['_INIT_CALL_BOUND_METHOD_EXACT_ARGS', '_INIT_CALL_PY_EXACT_ARGS', '_COPY_FREE_VARS', '_SET_FUNCTION_ATTRIBUTE', '_COMPARE_OP_FLOAT'].

There may be a trick to making uops-with-exits fast enought to externalize as well (@brandtbucher, thoughts?). That could also be left as follow-on work.

brandtbucher commented 5 months ago

This means that each exit has an additional branch ... I don't know how much that matters. There may be a better "trick" to make this work -- I have to admit that low-level C tricks aren't my forte. If there's a better way to "jump directly" somehow let me know.

Haven't thought about it too much, but one option could be to pass _JIT_CONTINUE and _JIT_JUMP_TARGET/_JIT_ERROR_TARGET into the function, and it could return one, which is jumped to:

PyAPI_DATA(void) _JIT_CONTINUE;
PyAPI_DATA(void) _JIT_EXIT_TARGET;
jit_func next = _Py_FOO_func(tstate, frame, stack_pointer, CURRENT_OPARG(),
                             &_JIT_CONTINUE, &_JIT_EXIT_TARGET);
stack_pointer += ...;
__attribute__((musttail))
return ((jit_func)next)(frame, stack_pointer, tstate);
mdboom commented 5 months ago

At @brandtbucher's suggestion, I instead tried large uops that are more frequent, giving the set _INIT_CALL_PY_EXACT_ARGS, _COMPARE_OP_FLOAT (there's just not that many large ones without exits). This, unfortunately, made no net change.

I think the next thing to tackle will be a faster way to handle exits which should make the set of uops that we can externalize larger.

markshannon commented 1 month ago

An additional motivation for doing this, apart from the impact on the JIT, is the ability to analyze the micro-ops. Large micro-ops often have complex control flow, which makes it hard to analyze the code for stack spilling and top-of-stack caching.