python / cpython

The Python programming language
https://www.python.org/
Other
61.16k stars 29.52k forks source link

Set up tagged pointers in the evaluation stack #117139

Open Fidget-Spinner opened 3 months ago

Fidget-Spinner commented 3 months ago

Feature or enhancement

Proposal:

This issue mostly mirrors Mark's original one at https://github.com/faster-cpython/ideas/issues/632.

As part of PEP 703, tagged pointers will help achieve deferred reference counting.

Tagged pointers will also help the default GIL builds in 3.14 and future with unboxed ints.

Here's the initial design (open for comments):

We change the evaluation stack from PyObject * to PyTaggedObject. PyTaggedObject looks like this.

typedef union PyTaggedObject {
    PyObject *obj;
    uintptr_t bits;
} PyTaggedObject;

Note: I am limiting the tagged pointers to just the evaluation stack (this is up for debate as well). The main reason is to keep the change as minimally invasive as possible for 3.13 and not leak out any implementation details for now. For example, I don't want to support tagged pointers in C API functions. This may/will change in 3.14.

Thanks to the cases generator, we can then autogenerate all the proper accessors to said object. For example, a stack effect of PyObject * will generate the code effect.obj, while a stack effect of PyTaggedObject (the new default) will generate the code (PyObject *)(effect.bits & ~TAGS) or something similar.

Unboxed ints will also help both free-threaded and non-free-threaded builds by reducing refcounting even more for simple arithmetic operations (and making integer operations native). However, I am aiming for deferred tags in 3.13, and anything else is a plus.

Further note: due to wanting to respect immediate finalizers, we are not deferring that many objects. So this may not be a win on default builds. On free-threaded builds, I see a slight (10%) speedup on fibonacci microbenchmarks. Will benchmark on the default bulid as well and see the results.

Has this already been discussed elsewhere?

No response given

Links to previous discussion of this feature:

No response

Linked PRs

### Tasks
- [ ] https://github.com/python/cpython/issues/118926
- [ ] https://github.com/python/cpython/issues/118930
gvanrossum commented 3 months ago

So the mentions of unboxed ints are speculative and not part of the initial implementation, right?

And the tagged pointers just have their tag bit(s) removed when used in the "blob of C code" that's the instruction definition's body. For example,

        inst(STORE_FAST, (value --)) {
            SETLOCAL(oparg, value);
        }

would expand its body to

            SETLOCAL(oparg, (PyObject *)(value.bits & ~TAGS));

How do pointers receive their tag bit? Does every stack push just "or in" the tag bit(s)?

I recall trying to do this manually (before we had the generator) and it was a lot of work -- the code generator should make this much simpler. It also was quite a bit slower -- have you thought about this?

My biggest question is what does the tag bit mean? Is there a specific section of PEP 703 I should read to learn more about this?

colesbury commented 3 months ago

My biggest question is what does the tag bit mean?

PEP 703 doesn't go into detail about the tag bits. I'd expect it to match the description in Mark's issue, except with unboxed ints not part of the initial implementation:

Tag Meaning Encoding
00 Normal reference (uintptr)ptr
01 Deferred reference ((uintptr)ptr) + 1

01 includes both immortal objects and those whose reference count has been deferred (but are not immortal).

I recall trying to do this manually... It also was quite a bit slower

We should only enable this in the free-threaded build to start. We do not want to introduce any performance regressions in the default build. If it turns out to be a net win for the default build, then we can consider enabling it there. Otherwise, we can revisit that in 3.14, if we pursue tagged integers.

How do pointers receive their tag bit? Does every stack push just "or in" the tag bit(s)?

Yeah, most pushes just "or in" the tag bit, which is an immortal check. So for, example GET_ITER currently has:

https://github.com/python/cpython/blob/40d75c2b7f5c67e254d0a025e0f2e2c7ada7f69f/Python/generated_cases.c.h#L2922

That might be generated instead as:

#define PACK(obj) ((PyTaggedObject){.bits = (uintptr_t)obj | _Py_IsImmortal(obj)})

stack_pointer[-1] = PACK(iter);

For example, ... [STORE_FAST]

I think we want to use tagged pointers in the locals as well as the evaluation stack, but it may be easier to start with just using tagged pointers the evaluation stack. In that case STORE_FAST will need extra refcounting to convert a tagged pointer into a regular object reference. For example,

PyObject *obj = (PyObject *)(value.bits & ~TAGS);
if ((value.bits & IS_DEFERRED) {
  Py_INCREF(obj);
}
SETLOCAL(oparg, obj);

If we use tagged pointers for the locals as well, then we could skip the extra work in STORE_FAST.

Motivation

The primary motivation is to avoid reference count contention in multithreaded programs on things like functions/methods (in the free-threaded build). In particular, the benefit (from avoid reference counting contention) is going to be in LOAD_ATTR, LOAD_GLOBAL, and their variants.

For example, currently LOAD_ATTR_SLOT has code that looks like:

PyObject *attr = ...;
Py_INCREF(attr);
stack_pointer[-1] = attr;

We want the generated code to instead look something like the following, but probably refactored with helper macros/functions:

PyObject *attr = ...;
PyTaggedObject attr_tag;
if (_Py_IsImmortalOrDeferred(attr)) {
    attr_tag.bits = (uintptr_t)attr | IS_DEFERRED;
}
else {
    attr_tag.obj = attr;
}
stack_pointer[-1] = attr_tag;
gvanrossum commented 3 months ago

Okay...

So from this definition

#define PACK(obj) ((PyTaggedObject){.bits = (uintptr_t)obj | _Py_IsImmortal(obj)})

I gather that the tag bit is (by default) only set if the object is immortal. From your Motivation here (and also from the section on deferred refcounting in PEP 703) I take it that the tag bit would also be set for selected objects like functions and methods.

Mark's original issue describes variants of INCREF/DECREF to check for the tag bit; presumably these variants should only be used in the interpreter.

@Fidget-Spinner Good luck!

markshannon commented 1 month ago

One thing to note:

If we are going to use tagged ints, all plain "borrow" transformations will need to be removed, as it is impossible to borrow a PyObject * from a tagged int or vice-versa. Ultimately, we will want to extend the API that takes stack references to avoid having the double conversion ref -> PyObject * -> ref when calling common API functions that take PyObject *.

In cases where it is known that the stack ref is not an int we could add StackRefNotInt_ToObjectBorrow and the reverse, so avoid the cost of two conversions.