Compatibility with free-threaded build (PEP 703)

pybind / pybind11

Seamless operability between C++11 and Python

https://pybind11.readthedocs.io/

Other

15.55k stars 2.09k forks source link

Compatibility with free-threaded build (PEP 703) #5112

Closed rostan-t closed 1 month ago

rostan-t commented 5 months ago

Required prerequisites

[X] Make sure you've read the documentation. Your issue may be addressed there.
[X] Search the issue tracker and Discussions to verify that this hasn't already been reported. +1 or comment there if it has.
[X] Consider asking first in the Gitter chat room or in a Discussion.

What version (or hash if on master) of pybind11 are you using?

2.12.0

Problem description

PEP 703 proposes making the GIL optional. The steering council accepted it, and its integration is ongoing within CPython. Is there a plan to make pybind11 compatible with free-threaded builds?

Some changes would probably involve using new references instead of borrowed references. Apart from this, could the global state stored in internals and local_internals pose an issue if accessed from different threads simultaneously without the GIL to protect this?

Reproducible example code

No response

Is this a regression? Put the last known working version here if it is.

Not a regression

colesbury commented 4 months ago

I would love to get pybind11 working with the free-threaded build now that 3.13 beta 1 is out. I have some older patches that I can update for the latest version of pybind11.

The basic strategy is to lock around accesses to internals in the free-threaded build (but not in the default build). I found it helpful to use the following pattern, but we could use RAII instead:

template <typename F>
inline auto with_internals(const F &cb) -> decltype(cb(get_internals())) {
    auto &internals = get_internals();
    PYBIND11_LOCK_INTERNALS(internals);  // no-op in default build
    return cb(internals);
}

...

    with_internals([&](internals &internals) {
       // access or modify internals
    });

It's also helpful to have special handling for registered_instances because it can be a bottleneck in multi-threaded programs. Sharding the table and using a mutex per shard help scalability a lot.

wjakob commented 4 months ago

I'm interested in this discussion also in the context of nanobind. The idea of threads fighting over the cache line holding the internals mutex is a bit concerning though. A lot of those accesses are read-only, e.g. to fetch a type object.

Is std::mutex generally the right kind of lock to use? I think this will introduce atomic memory operations all over the place that are quite expensive even if they are not contended. I assume that's a question also for the CPython internals, where the equivalent would be pthread_mutex_lock—is this what you use to protect, say, a dictionary being concurrently mutated?

One option could be to "freeze" certain data structures after a module is loaded, after which they can only be accessed in read-only mode.

colesbury commented 4 months ago

Is std::mutex generally the right kind of lock to use?

I'd be inclined to start with std::mutex. In CPython, we use a custom lock PyMutex. PyMutex is a little bit faster than pthread_mutex_t because the fast-path can be inlined, but it's not a huge difference. It's also smaller (1 byte vs. 40 bytes for pthread_mutex_t typically), which matters for CPython because there's a mutex embedded in PyObject (in the free-threaded build), but wouldn't matter much for pybind11 or nanobind because you only need a few locks.

I think this will introduce atomic memory operations all over the place that are quite expensive even if they are not contended

I think "expensive" depends on the scale you are talking about. For example, in CPython a mutex acquisition is expensive compared to say a Py_INCREF, but not so bad compared to a PyDictObject modification.

To be clear, I think the mutex acquisitions should only happen when compiling for the "free-threaded" build and not for the "default" build of CPython with the GIL.

A lot of those accesses are read-only, e.g. to fetch a type object

Can you point me at the relevant code? It's been a while since I looked at this. I remember the instance map being a bottleneck without special handling, but not fetching type objects.

One option could be to "freeze" certain data structures after a module is loaded

The more things that can be made effectively immutable the better.

wjakob commented 4 months ago

Can you point me at the relevant code?

In internals.h, there is a std::unordered_map named registered_types_py that (apart from the instances map you mentioned) is also queried very often. But it is really just manipulated when types are created, which is probably not happening on the hot path.

Two follow-up questions:

Can PyMutex be used by extension libraries when compiling for a free-threaded Python version? In this case, it probably makes sense to use the same locking mechanism.
Is there a guideline on adopting Python extensions to a free-threaded build, or do you plan to create one? I'm thinking of something that explains previously innocuous design patterns in CPython extensions that are now considered broken in a free-threaded build, and then how to fix them. I read in the PEP about double-locking patterns and that made me worry that there are probably countless lingering issues that will cause difficult-to-track bugs. (Not just in pybind11/nanobind, but also in classes that use these libraries and the underlying wrapper types and lack such multi-object locks).

colesbury commented 4 months ago

Can PyMutex be used by extension libraries when compiling for a free-threaded Python version?

It's not currently part of the public C API and the discussions about making it public have stalled. We're considering extracting PyMutex (and a few other synchronization primitive) into a separate library for use by NumPy, but that's mostly because NumPy is written in C and so doesn't have access to things like std::mutex or std::once_flag.

Is there a guideline on adopting Python extensions to a free-threaded build...?

Not yet, but we plan to write one. It will definitely happen before the 3.13 final release in October; hopefully much sooner. Right now our priority is the libraries at the "bottom of the stack" like pybind11, Cython, and NumPy. Some of the coordination work is tracked in https://github.com/Quansight-Labs/free-threaded-compatibility. I expect that the work on making these libraries thread-safe without the GIL will help make the "how to" guide better.

wjakob commented 4 months ago

It's not currently part of the public C API and the discussions about making it public have stalled.

That seems potentially bad if it's a central piece of infrastructure needed for efficient locking. If every extension comes up with its own half-baked locking scheme, that could turn into a bigger issue in the future. For example, debugging locking issues is probably much easier if everyone is generally using the same kind of lock.

There are still some fundamental questions that are unclear to me in the absence of a developer guide. Some are probably quite naïve.

pybind11 and nanobind offer wrapper classes that expose CPython API using an object-oriented interface. With the new free-threaded builds, some of the Python API is not meant to be used anymore (e.g. PyList_GetItem). Those kinds of things are easy to replace by their reference-counted replacement. But is that it? Is it then safe to use the rest of the CPython API in the presence of concurrency? Naturally, there could be data races when concurrently updating data structures, and that's not something I am referring to here. Rather, I am thinking: can this sequence of innocuous-looking CPython API calls actually segfault the interpreter, and should pybind11/nanobind do something with its API wrappers to avoid users from running into such crashes?
Does a Python extension module ever need to explicitly lock another Python object (or potentially pairs of them), or is this lock always implicitly acquired by performing CPython API calls involving those objects?
PEP 703 introduces the Py_BEGIN_CRITICAL_SECTION API and explains how this design can avoid deadlocks compared to simpler per-object locking. I saw that you are not using those anywhere in the pybind11 patch. How should extensions work with these critical sections. Alternatively, how can they avoid deadlocks from per-object locking if Py_BEGIN_CRITICAL_SECTION is reserved for CPython-internal code? (Here, I am thinking of something similar to the "tricky" operations mentioned in the PEP, e.g. a custom container data structure in C++ that includes an operation like .extend() that must access the internals of both self and its argument)

Thank you in advance!

wjakob commented 4 months ago

Actually, here is one more: in Python extensions, it's quite common to create simple types that cannot form reference cycles without the Py_TPFLAGS_HAVE_GC flag (partly because it adds quite a bit of storage overhead to the size of an object that seems unnecessary).

PEP 703 mentions that this now affects how reference counting works for such types. What are the practical implications, and does are non-GCed types an anti-pattern in free-threaded builds?

colesbury commented 4 months ago

Is it then safe to use the rest of the CPython API in the presence of concurrency?

Generally yes, aside for bugs and a few exceptions. I don't think pybind11 should do extra things to protect callers to the C API; if there are things we need to change, we should do it in CPython.

It's not safe to access other thread's frames while those threads are running. This is true for both Python code and the C-API.
PyCapsule internals are not synchronized. In other words, it's not safe to modify the name or pointer of a capsule while other threads may be accessing the capsule.
Accessing internal fields directly, (e.g., PyListObject *x; x->ob_item[2]) is not thread-safe
PyList_SET_ITEM and PyTuple_SET_ITEM do not do any internal locking; they should only be used for initialization.

The public functions of, e.g., dict, list, and set all have internal synchronization. Most other objects are interacted with through the abstract API (e.g., PyObject_SetAttr). They should be thread-safe too.

some of the Python API is not meant to be used anymore (e.g. PyList_GetItem)...

It's still safe to use these functions in contexts where no other thread may be modifying the object. For example, extensions often use PyDict_GetItem for extracting keyword arguments from the kwargs dict. This is fine because the kwargs are effectively private to that function call. In my opinion, it's not worth changing those calls, but that's up to the individual extension authors.

However, the generic dict_getitem and dict_getitemstring functions in pybind11 should use the PyDict_GetItemRef and similar functions that return strong references.

Does a Python extension module ever need to explicitly lock another Python object...?

Python extension modules shouldn't directly lock builtin objects like dict or list. If you need to synchronize some operation on these objects, the locking should be done externally.

Py_BEGIN_CRITICAL_SECTION

This isn't part of the public C API yet. I'd like to make it public, but I was hoping to get PyMutex resolved first.

There's a tradeoff between Py_BEGIN_CRITICAL_SECTION and simple mutexes. Py_BEGIN_CRITICAL_SECTION makes it easier to avoid deadlocks, but it's harder to reason about what invariant the critical section protects. This is analogous to the trade-offs between recursive mutexes and simple (non-recursive) mutexes, but perhaps more extreme.

In general, I think it's better to use simple mutexes in extensions unless there's a concrete reason not to.

What are the practical implications, and does are non-GCed types an anti-pattern in free-threaded builds?

I think you might be referring to this section of PEP 703.

I don't think there are any practical implications for extensions. Non-GC enabled objects are not an anti-pattern; you should continue to use GC or non-GC types as you do currently.

For CPython internals, the implication is that code objects are now tracked by the GC in the free-threaded build.

wjakob commented 4 months ago

Thank you for your responses Sam! Another point I wanted to discuss: I think that to safely expose free-threading in a library like pybind11, or nanobind, it would be desirable to add further annotations for arguments and classes.

For example, a ".lock" argument for parameters.

m.def("modify_foo", [](Foo *foo) { .. }, "foo"_a.lock());

So that the binding layer can guarantee that there isn't concurrent access to the internals of an object. Unsafe concurrent access would in general not just cause race conditions but actually segfault the process, which would be very bad.

Similarly some classes may not permit concurrent access, and the binding layer could then synchronize with this at the method level like a "synchronized" class in Java, e.g.

py::class_<MyClass>(m, "MyClass", py::locked())
    .def("modify", &MyClass::modify);

But for all of this, it would be desirable to use the locking support of the underlying PyObject.

wjakob commented 4 months ago

(So it probably makes sense to wait until the PyMutex discussion has stabilized and there is an API for us to use)

colesbury commented 4 months ago

I think I understand py:locked() in the class-level example, but I'm a bit confused by the "foo"_a.lock() in the method-level example. Can you explain a bit more about how it would work and what the syntax means? I'm not familiar with the "foo"_a syntax.

wjakob commented 4 months ago

"foo"_a is syntax sugar for py::arg("foo") which specifies the Python keyword name of a C++ function argument.

You can specify default values, nullability, etc -- e.g. py::arg("foo").none() = Foo() means: the function has an argument "foo", which should be default-initialized with the C++ object Foo(), and the argument may also be equal to NULL / None, and that's okay (the binding layer shouldn't try to prevent such a call).

The .lock() suffix would say: lock this argument before dispatching the call to the C++ implementation, and unlock it afterwards. It would probably be limited to at most one function argument per call, otherwise it would seem difficult to avoid deadlocks. (Or perhaps the implementation could reorder the pointers before locking, which is IIRC also what free-threaded Python does).

henryiii commented 4 months ago

The first, simple step I'd assume is to make sure we can compile against free-threaded mode but still hold the GIL. I've fixed the four tests fail due to refcount issues when doing so against beta 1.

If I actually disable the GIL, two more from py::implicitly_convertable:

FAILED ../../tests/test_thread.py::test_implicit_conversion - TypeError: test(): incompatible function arguments. The following argument types are supported:
FAILED ../../tests/test_thread.py::test_implicit_conversion_no_gil - TypeError: test_no_gil(): incompatible function arguments. The following argument types are supported:

We don't do much threading in the test suite, so I'd not expect too many failures. This seems to be an issue with registering type conversions.