pybind / pybind11

Seamless operability between C++11 and Python
https://pybind11.readthedocs.io/
Other
15.55k stars 2.09k forks source link

Compatibility with free-threaded build (PEP 703) #5112

Closed rostan-t closed 1 month ago

rostan-t commented 5 months ago

Required prerequisites

What version (or hash if on master) of pybind11 are you using?

2.12.0

Problem description

PEP 703 proposes making the GIL optional. The steering council accepted it, and its integration is ongoing within CPython. Is there a plan to make pybind11 compatible with free-threaded builds?

Some changes would probably involve using new references instead of borrowed references. Apart from this, could the global state stored in internals and local_internals pose an issue if accessed from different threads simultaneously without the GIL to protect this?

Reproducible example code

No response

Is this a regression? Put the last known working version here if it is.

Not a regression

colesbury commented 4 months ago

I would love to get pybind11 working with the free-threaded build now that 3.13 beta 1 is out. I have some older patches that I can update for the latest version of pybind11.

The basic strategy is to lock around accesses to internals in the free-threaded build (but not in the default build). I found it helpful to use the following pattern, but we could use RAII instead:

template <typename F>
inline auto with_internals(const F &cb) -> decltype(cb(get_internals())) {
    auto &internals = get_internals();
    PYBIND11_LOCK_INTERNALS(internals);  // no-op in default build
    return cb(internals);
}

...

    with_internals([&](internals &internals) {
       // access or modify internals
    });

It's also helpful to have special handling for registered_instances because it can be a bottleneck in multi-threaded programs. Sharding the table and using a mutex per shard help scalability a lot.

wjakob commented 4 months ago

I'm interested in this discussion also in the context of nanobind. The idea of threads fighting over the cache line holding the internals mutex is a bit concerning though. A lot of those accesses are read-only, e.g. to fetch a type object.

Is std::mutex generally the right kind of lock to use? I think this will introduce atomic memory operations all over the place that are quite expensive even if they are not contended. I assume that's a question also for the CPython internals, where the equivalent would be pthread_mutex_lock—is this what you use to protect, say, a dictionary being concurrently mutated?

One option could be to "freeze" certain data structures after a module is loaded, after which they can only be accessed in read-only mode.

colesbury commented 4 months ago

Is std::mutex generally the right kind of lock to use?

I'd be inclined to start with std::mutex. In CPython, we use a custom lock PyMutex. PyMutex is a little bit faster than pthread_mutex_t because the fast-path can be inlined, but it's not a huge difference. It's also smaller (1 byte vs. 40 bytes for pthread_mutex_t typically), which matters for CPython because there's a mutex embedded in PyObject (in the free-threaded build), but wouldn't matter much for pybind11 or nanobind because you only need a few locks.

I think this will introduce atomic memory operations all over the place that are quite expensive even if they are not contended

I think "expensive" depends on the scale you are talking about. For example, in CPython a mutex acquisition is expensive compared to say a Py_INCREF, but not so bad compared to a PyDictObject modification.

To be clear, I think the mutex acquisitions should only happen when compiling for the "free-threaded" build and not for the "default" build of CPython with the GIL.

A lot of those accesses are read-only, e.g. to fetch a type object

Can you point me at the relevant code? It's been a while since I looked at this. I remember the instance map being a bottleneck without special handling, but not fetching type objects.

One option could be to "freeze" certain data structures after a module is loaded

The more things that can be made effectively immutable the better.

wjakob commented 4 months ago

Can you point me at the relevant code?

In internals.h, there is a std::unordered_map named registered_types_py that (apart from the instances map you mentioned) is also queried very often. But it is really just manipulated when types are created, which is probably not happening on the hot path.

Two follow-up questions:

  1. Can PyMutex be used by extension libraries when compiling for a free-threaded Python version? In this case, it probably makes sense to use the same locking mechanism.

  2. Is there a guideline on adopting Python extensions to a free-threaded build, or do you plan to create one? I'm thinking of something that explains previously innocuous design patterns in CPython extensions that are now considered broken in a free-threaded build, and then how to fix them. I read in the PEP about double-locking patterns and that made me worry that there are probably countless lingering issues that will cause difficult-to-track bugs. (Not just in pybind11/nanobind, but also in classes that use these libraries and the underlying wrapper types and lack such multi-object locks).

colesbury commented 4 months ago

Can PyMutex be used by extension libraries when compiling for a free-threaded Python version?

It's not currently part of the public C API and the discussions about making it public have stalled. We're considering extracting PyMutex (and a few other synchronization primitive) into a separate library for use by NumPy, but that's mostly because NumPy is written in C and so doesn't have access to things like std::mutex or std::once_flag.

Is there a guideline on adopting Python extensions to a free-threaded build...?

Not yet, but we plan to write one. It will definitely happen before the 3.13 final release in October; hopefully much sooner. Right now our priority is the libraries at the "bottom of the stack" like pybind11, Cython, and NumPy. Some of the coordination work is tracked in https://github.com/Quansight-Labs/free-threaded-compatibility. I expect that the work on making these libraries thread-safe without the GIL will help make the "how to" guide better.

wjakob commented 4 months ago

It's not currently part of the public C API and the discussions about making it public have stalled.

That seems potentially bad if it's a central piece of infrastructure needed for efficient locking. If every extension comes up with its own half-baked locking scheme, that could turn into a bigger issue in the future. For example, debugging locking issues is probably much easier if everyone is generally using the same kind of lock.

There are still some fundamental questions that are unclear to me in the absence of a developer guide. Some are probably quite naïve.

Thank you in advance!

wjakob commented 4 months ago

Actually, here is one more: in Python extensions, it's quite common to create simple types that cannot form reference cycles without the Py_TPFLAGS_HAVE_GC flag (partly because it adds quite a bit of storage overhead to the size of an object that seems unnecessary).

PEP 703 mentions that this now affects how reference counting works for such types. What are the practical implications, and does are non-GCed types an anti-pattern in free-threaded builds?

colesbury commented 4 months ago

Is it then safe to use the rest of the CPython API in the presence of concurrency?

Generally yes, aside for bugs and a few exceptions. I don't think pybind11 should do extra things to protect callers to the C API; if there are things we need to change, we should do it in CPython.

The public functions of, e.g., dict, list, and set all have internal synchronization. Most other objects are interacted with through the abstract API (e.g., PyObject_SetAttr). They should be thread-safe too.

some of the Python API is not meant to be used anymore (e.g. PyList_GetItem)...

It's still safe to use these functions in contexts where no other thread may be modifying the object. For example, extensions often use PyDict_GetItem for extracting keyword arguments from the kwargs dict. This is fine because the kwargs are effectively private to that function call. In my opinion, it's not worth changing those calls, but that's up to the individual extension authors.

However, the generic dict_getitem and dict_getitemstring functions in pybind11 should use the PyDict_GetItemRef and similar functions that return strong references.

Does a Python extension module ever need to explicitly lock another Python object...?

Python extension modules shouldn't directly lock builtin objects like dict or list. If you need to synchronize some operation on these objects, the locking should be done externally.

Py_BEGIN_CRITICAL_SECTION

This isn't part of the public C API yet. I'd like to make it public, but I was hoping to get PyMutex resolved first.

There's a tradeoff between Py_BEGIN_CRITICAL_SECTION and simple mutexes. Py_BEGIN_CRITICAL_SECTION makes it easier to avoid deadlocks, but it's harder to reason about what invariant the critical section protects. This is analogous to the trade-offs between recursive mutexes and simple (non-recursive) mutexes, but perhaps more extreme.

In general, I think it's better to use simple mutexes in extensions unless there's a concrete reason not to.

What are the practical implications, and does are non-GCed types an anti-pattern in free-threaded builds?

I think you might be referring to this section of PEP 703.

I don't think there are any practical implications for extensions. Non-GC enabled objects are not an anti-pattern; you should continue to use GC or non-GC types as you do currently.

For CPython internals, the implication is that code objects are now tracked by the GC in the free-threaded build.

wjakob commented 4 months ago

Thank you for your responses Sam! Another point I wanted to discuss: I think that to safely expose free-threading in a library like pybind11, or nanobind, it would be desirable to add further annotations for arguments and classes.

For example, a ".lock" argument for parameters.

m.def("modify_foo", [](Foo *foo) { .. }, "foo"_a.lock());

So that the binding layer can guarantee that there isn't concurrent access to the internals of an object. Unsafe concurrent access would in general not just cause race conditions but actually segfault the process, which would be very bad.

Similarly some classes may not permit concurrent access, and the binding layer could then synchronize with this at the method level like a "synchronized" class in Java, e.g.

py::class_<MyClass>(m, "MyClass", py::locked())
    .def("modify", &MyClass::modify);

But for all of this, it would be desirable to use the locking support of the underlying PyObject.

wjakob commented 4 months ago

(So it probably makes sense to wait until the PyMutex discussion has stabilized and there is an API for us to use)

colesbury commented 4 months ago

I think I understand py:locked() in the class-level example, but I'm a bit confused by the "foo"_a.lock() in the method-level example. Can you explain a bit more about how it would work and what the syntax means? I'm not familiar with the "foo"_a syntax.

wjakob commented 4 months ago

"foo"_a is syntax sugar for py::arg("foo") which specifies the Python keyword name of a C++ function argument.

You can specify default values, nullability, etc -- e.g. py::arg("foo").none() = Foo() means: the function has an argument "foo", which should be default-initialized with the C++ object Foo(), and the argument may also be equal to NULL / None, and that's okay (the binding layer shouldn't try to prevent such a call).

The .lock() suffix would say: lock this argument before dispatching the call to the C++ implementation, and unlock it afterwards. It would probably be limited to at most one function argument per call, otherwise it would seem difficult to avoid deadlocks. (Or perhaps the implementation could reorder the pointers before locking, which is IIRC also what free-threaded Python does).

henryiii commented 4 months ago

The first, simple step I'd assume is to make sure we can compile against free-threaded mode but still hold the GIL. I've fixed the four tests fail due to refcount issues when doing so against beta 1.

If I actually disable the GIL, two more from py::implicitly_convertable:

FAILED ../../tests/test_thread.py::test_implicit_conversion - TypeError: test(): incompatible function arguments. The following argument types are supported:
FAILED ../../tests/test_thread.py::test_implicit_conversion_no_gil - TypeError: test_no_gil(): incompatible function arguments. The following argument types are supported:

We don't do much threading in the test suite, so I'd not expect too many failures. This seems to be an issue with registering type conversions.