Double-nested gil_scoped_release in a pythread causes fatal interpreter error

pybind / pybind11

Seamless operability between C++11 and Python

https://pybind11.readthedocs.io/

Other

15.6k stars 2.09k forks source link

Double-nested gil_scoped_release in a pythread causes fatal interpreter error #1276

Open KyleFromKitware opened 6 years ago

KyleFromKitware commented 6 years ago

Assume the following scenario:

Create a new pythread
Call a C++ function from within that pythread
That C++ function calls some Python
That Python calls some MORE C++

Here is the minimum working example:

#include <pybind11/pybind11.h>
#include <pybind11/embed.h>

#include <iostream>
#include <thread>

static void thread()
{
  pybind11::gil_scoped_acquire acquire;
  {
    // Pretend this block is nested inside Python
    pybind11::gil_scoped_release release;
    std::cout << "Calling C++ code" << std::endl;
  }
}

static void init_threads()
{
  // NOTE: This is a workaround for #1273
  if (!PyEval_ThreadsInitialized())
  {
    {
      pybind11::gil_scoped_acquire acquire;
    }
    PyEval_SaveThread();
  }
}

static const char thread_code[] =
"import threading\n"
"import testmod\n"
"t = threading.Thread(target=testmod.func)\n"
"t.start()\n"
"t.join()\n";

PYBIND11_EMBEDDED_MODULE(testmod, m)
{
  m.def("func", thread, pybind11::call_guard<pybind11::gil_scoped_release>());
}

int main()
{
  pybind11::scoped_interpreter interp;
  init_threads();

  {
    pybind11::gil_scoped_acquire acquire;
    pybind11::exec(thread_code);
  }

  return 0;
}

When the inner nested gil_scoped_release is destructed, it causes the Python interpreter to throw a fatal error:

Calling C++ code
Fatal Python error: Invalid thread state for this thread
Aborted (core dumped)

because there are two different thread states for the same thread, one created by the pythread and one created by pybind.

KyleFromKitware commented 6 years ago

Note: the fatal error only happens if you are running a DEBUG build of the Python interpreter (VERY important).

KyleFromKitware commented 6 years ago

Here's an even more concrete and tangible example that doesn't require a debug build of Python:

#include <pybind11/pybind11.h>
#include <pybind11/embed.h>

#include <iostream>
#include <thread>

static void thread()
{
  pybind11::gil_scoped_acquire acquire;
  {
    // Call some Python code
    // ...
    // Python code calls a 3rd party library which does this
    PyGILState_STATE state = PyGILState_Ensure();
    std::cout << "Calling 3rd party library" << std::endl;
    PyGILState_Release(state);
  }
}

static const char thread_code[] =
"import threading\n"
"import testmod\n"
"t = threading.Thread(target=testmod.func)\n"
"t.start()\n"
"t.join()\n";

PYBIND11_EMBEDDED_MODULE(testmod, m)
{
  m.def("func", thread, pybind11::call_guard<pybind11::gil_scoped_release>());
}

int main()
{
  pybind11::scoped_interpreter interp;
  pybind11::gil_scoped_release release;

  {
    pybind11::gil_scoped_acquire acquire;
    pybind11::exec(thread_code);
  }

  return 0;
}

This one deadlocks as soon as it reaches the PyGILState_Ensure() call in the hypothetical 3rd party library.

Normally, the PyGILState API is smart enough to handle recursive GIL acquisition. However, because gil_scoped_acquire and gil_scoped_release have their own bastardized version of this API with its own internal data structures, the two are getting out of sync, and the Python interpreter doesn't know about the pybind11 structures, so it fails to recognize the recursive GIL acquisition.

The solution to this problem is to get rid of pybind's internal thread states and use the PyGILState APIs like it did in older versions. However, this breaks the thread dissociation feature of gil_scoped_release, which some projects still use. I propose adding a configuration option that uses the pybind method by default, but falls back to the PyGILState API when you set a flag in order to make this use case work. I am currently working on a PR with this change.

wjakob commented 6 years ago

Calling pybind11's implementation "bastardized" is quite inappropriate. It enables several useful applications that are impossible to realize with Python's restricted GIL state API. Naturally, the two are not supposed to be combined in this way.

wjakob commented 6 years ago

There is really nothing forcing you to use pybind121::gil_scoped_release by the way. You can declare your own RAII wrapper that is appropriate for your project.

KyleFromKitware commented 6 years ago

My apologies, I wasn't trying to be inappropriate. In English, we sometimes use "bastardized" as another word for "hacked" or "modified" - perhaps one of those would have been more appropriate. In any case, I do understand what this is trying to accomplish, but the behavior was not what I was expecting.

My biggest concern is that the PYBIND11_OVERLOAD() macros use gil_scoped_acquire, which uses this version of the API, though I suppose I could rewrite these macros too.

Erotemic commented 6 years ago

@wjakob continuing our conversation, let summarize the main technical points and respond to some issues here, and perhaps @jagerman can chime in with his thoughts.

There are currently two options for interfacing with the GIL 1) CPython's PyGILState API 2) pybind11's gil_scoped_acquire / gil_scoped_release API

These methods are incompatible and cannot be nested because they each maintain there own state and do not synchronize with each other. Unfortunately, as noted in pybind11.h, the standard PyGILState API is limited and does not support some reasonable (but advanced) operations. Fortunately, the nesting that occurs will happen fairly infrequently because most python projects are small scripts and not large long-running systems.

In my opinion, because this incompatibility exists and most programs will not use the advanced features of the current interface, I believe that the pybind11 API should be changed to be compatible by default and expose a simple way to use the current interface that allows for advanced thread control. This way it would be the responsibility of the developer explicitly using the advanced features to ensure that nested calls were safe (or at least provide some warning or a note in the docs).

However, as @wjakob points out

Simply switching to PyGILState is likely a show-stopper: it will break compatibility for software that uses any of the advanced features and thus will require a new major version (i.e. pybind11 3.x).

This is a valid point, breaking compatibility does require a major version change. I hope to convince you that its worthwhile to do so. My main points of my argument are:

Most projects will not use the advanced threading features
Most projects will not use nested calls
However, I think it is more likely (especially as Python grows and becomes used in larger and larger systems) for nested calls to occur. Kitware's KWIVER project is an example of one such system. Perhaps, I'm biased, but I don't view KWIVER as a project with special requirements. I view it as an example of a larger system that is emerging.
Pybind11 has become widely adopted by projects such as caffe2 and pytorch and is quickly becoming a standard in the community.
I believe standards should be compatible by default with the current overarching CPython standard.

Some of personal projects actually critically depend on the improved GIL state API, so my excitement for such a change is very limited.

I very much understand this. Ideally we can come up with a good solution that makes upgrading to a new API simple. Of course this means that the new version must have an API to behave like the current version. (Alternatively we can petition for Python 4 to integrate the improved GIL API or we can wait for @larryhastings to finish his gilectomy).

Hopefully, I've laid out a reasonable argument as to why changing the current behavior of pybind11 is desirable. As for discussing the specifics of those changes (maybe even with ones that don't require a major version bump) I'll make a post in #1276

KyleFromKitware commented 6 years ago

@Erotemic's comments on this matter are spot on. Here are some of my additional thoughts.

We could simply write our own basic_gil_scoped_acquire/release RAII classes that use the "basic" PyGILState_* API, but the biggest problem we're having is that there are places in pybind that use the advanced API, and these can't be changed without changing pybind. Most notable of these is the PYBIND11_OVERLOAD() macros and error_already_set. I was able to write "basic" versions of the PYBIND11_OVERLOAD() macros, but because error_already_set is used in so many places and is thrown any time Python code throws an exception, trying to make a new version of error_already_set would involve redefining so many classes and functions that it would be equivalent to rewriting pybind. At that point, it would be a lot simpler to just fork the project entirely. In order to solve this issue, we will need to make some kind of change to pybind - simply redefining the problematic classes and macros won't be enough.

In short, there are two issues here:

Come up with a way to select the desired GIL behavior for projects (such as ours) that don't want to use the advanced API and are negatively impacted by its use. (This is the higher priority concern.)
What should the default be? Basic or advanced? Personally, I agree with @Erotemic that the pybind default should be compatible with the CPython standard, but changing this default would require a pybind major version update, so it is understandable why there would be resistance to such a change. When I wrote #1286, I designed it with the specific intention of leaving the default behavior alone (use the pybind11 GIL API) and only use the PyGILState_* API if the developer explicitly selects it, in order to avoid breaking compatibility.

@Erotemic forwarded me this from @wjakob:

Right now the solution looks unappealing to me due to the introduction of extra ‘options’ (which seems very heavy-handed). An optional define like PYBIND11_USE_PYTHON_GILSTATE that switches over to an alternative gil_scoped_acquire_basic / gil_scoped_release_basic RAII wapper may be more suitable.

Can you explain your thoughts here? What I'm thinking is that, at least to the end user, there isn't a great deal of difference between my proposal and adding a #define. Both would require code changes in the project using pybind to select what they want to do. There are only two major differences: my option adds a rather large amount of code whereas the #define would only be a few lines long (an understandable concern in any project), and the #define happens at compile time whereas my proposal with an option would happen at runtime. (@Erotemic pointed out to me that the runtime option would have a slight performance penalty since there would be a few extra branching instructions, but given that the GIL - and mutexes in general - already use expensive kernel system calls for lock contention, this doesn't seem like a major bottleneck to me.)

One concern that @jagerman had is that having a global runtime option could change the value for other libraries (if library A uses the proposed "basic" PyGILState_* API and library B uses the advanced API, one of them could clobber the other.) What I found during my testing is that this doesn't happen (in fact, since we have multiple shared objects in our project, I had to insert my option in more than one place.) This would make sense to me, because pybind11 is a header-only library, so each executable/shared library object would get its own copy of the static option variable, though relying on this behavior is not necessarily a good idea.

@Erotemic also came up with the idea of having both a runtime and compile-time option, where the compile-time option sets the default for the runtime option, but the runtime option can still be changed.

Another option: instead of simply having an either/or between the basic and advanced APIs, what if we created a template function to allow the user to select any class they want? Example:

pybind11::options options;
options.set_gil_scoped_acquire<basic_gil_scoped_acquire>();

Internally, this would use a virtual class with the selected class as a templated member.

We could also do something similar as a compile time option. Example:

#define PYBIND11_GIL_SCOPED_ACQUIRE basic_gil_scoped_acquire

And then inside pybind:

#ifndef PYBIND11_GIL_SCOPED_ACQUIRE
#define PYBIND11_GIL_SCOPED_ACQUIRE ::pybind11::gil_scoped_acquire
#endif

One thing is for sure: as pybind becomes more widespread, and larger and more complicated projects start using it, they too are going to encounter this issue. It is certainly having a major impact on KWIVER, so I will take it upon myself to implement any proposal we can come up with. I hope that we can find a solution that meets everyone's needs.

jagerman commented 6 years ago

Some food for thought:

having a global runtime option could change the value for other libraries (if library A uses the proposed "basic" PyGILState_* API and library B uses the advanced API, one of them could clobber the other.) What I found during my testing is that this doesn't happen ...

Yeah, I was thinking ahead a bit with that concern: It doesn't happen because, right now, the options class state stack is via a static variable, as you found. I don't think, however, as a runtime option that this is the right approach—the options class was really meant to control how pybind handles bind definitions; its main purpose is to allow you to do something like { options local_opts; local_opts.enable_function_signatures(); /* ... module definitions ... */; } to control how those definitions are applied. (It should probably have been named definition_options in retrospect). With an options-based approach you'd run into trouble if, for example, one module has a options local; to set some unrelated options, but loads your module in its definition code: the localized scope gil choice option would be reset at the end of the first module's definition scope. I don't see that sort of scoped interface helping at all here (instead it just gets in the way).

A much more natural place for this sort of global runtime control is in internals. That data is shared across modules (at least for modules with the same layout, which typically means the same major/minor version). It might even be worthwhile storing similarly to how we store internals but outside internals itself so that it can apply even across modules compiled with different pybind versions.

But back to the core issue. The biggest issue, as I see it, is that we are forcing the use of pybind11::gil_scoped_release: most notably in the OVERLOAD macros, though I also see another use in functional.h. We should fix that.

As to what that fix looks like: I don't think that either the PYBIND11_GIL_SCOPED_ACQUIRE or set_gil_scoped_acquire approachs are particularly nice or that they entirely solve the issue: you can still get a deadlock and/or interpreter error fairly easily if two modules do different things w.r.t the GIL type to use. Your module may set the define, but if another module doesn't you're right back to the same issue.

I don't personally have much experience with the GIL intricacies, so maybe there's a reason this wouldn't work, but we ought to be able to detect whether a Python "simple" GIL or a pybind "advanced" GIL is already acquired, and if so, we could have a default pybind11::new_gil_scoped_acquire (a new class—and just a placeholder name) which just extends whichever approach is currently in effect. That would, I think, solve the issue for OVERLOAD and functional. As @Erotemic points out, most modules don't really care about the implementation, they just want a GIL, and that's what this new class would do.

We could then rename gil_scoped_acquire to pybind11::advanced_gil_acquire (or something along those lines) with a typedef for backwards compatibility. A new pybind11::basic_gil_acquire, as described earlier in the issue, would provide the CPython-compatibile implementation. We might still want something like pybind11::require_gil<basic_scoped_acquire>() or pybind11::require_gil<gil_scoped_acquire>() set the default—and also to double as a runtime-check against two modules requiring incompatible gil implementations.

KyleFromKitware commented 6 years ago

but we ought to be able to detect whether a Python "simple" GIL or a pybind "advanced" GIL is already acquired, and if so, we could have a default pybind11::new_gil_scoped_acquire (a new class—and just a placeholder name) which just extends whichever approach is currently in effect.

This certainly seems doable. It would just be a matter of calling PyGILState_GetThisThreadState() for the basic API and PyThread_get_key_value(internals.tstate) for the advanced API - though if we are going to generalize it with templates, we will also need to add some sort of static method to advanced_gil_scoped_acquire and basic_gil_scoped_acquire - something like basic_gil_scoped_acquire::thread_has_gil(), and then at that point we would need some sort of function like pybind11::register_gil<basic_gil_scoped_acquire>(). I'm imagining an internal list of GIL implementations which has basic_gil_scoped_acquire and advanced_gil_scoped_acquire by default, and then a single implementation from that list is selected as the default. In fact, since the "release" and "acquire" classes need to be grouped together, they should be typedefs within a larger overarching "implementation" class, something like basic_gil_impl and advanced_gil_impl, so I'm going to refer to this impl class from here on out. The acquire class would be basic_gil_impl::acquire and the release class would be basic_gil_impl::release. These can of course be typedef-ed into basic_gil_scoped_acquire for simplicity.

If we're going to have something like new_gil_scoped_acquire (I'm imagining it as auto_gil_impl::acquire, and I think it's a good idea), then we also need a way to tell it what to do if no GIL state has been created for the thread - something like auto_gil_impl<basic_gil_impl>::acquire would select the basic API if nothing existing can be found.

Actually... as I'm thinking about this some more, maybe auto_gil_impl doesn't need to be parameterized, it can just use the "default" implementation that was selected. And I do like the idea of require_gil<advanced_gil_impl>() doubling as a runtime check and throwing an error for conflicting selections.

It might even be worthwhile storing similarly to how we store internals but outside internals itself so that it can apply even across modules compiled with different pybind versions.

I like this idea too. Unfortunately, it won't help for projects that have been compiled with a version of pybind prior to the implementation of this proposal, but it should at least help going forward. If we make some sort of new builtin object, like __pybind11_gil_internals__, the first field would need to be something like unsigned int gil_internals_version, and we would need to take care to maintain backwards compatibility here. I would also prefer to store this in a way that doesn't require us to acquire the GIL to get at it (PyEval_GetBuiltins() requires the GIL to be held!) I'm not sure what to do here - if anyone has any ideas they would be appreciated.

KyleFromKitware commented 6 years ago

I have opened #1322. It's not finished, but it should be enough to get a discussion going.

wjakob commented 6 years ago

Hi Kyle,

I took a look at PR #1322. For me, it is too complicated. At its core, the GIL is something straightforward (a reference-counted mutex), and I'd be hesitant to introduce such complex machinery to deal with it in pybind11. (I am interested in resolving this issue though, so let's try to figure out what can be done.)

It's been a long time since I wrote the GIL handling in pybind11, and I just took a moment to go through it again. This made me realize that CPython's GILState and pybind11's "advanced GIL" are really supposed to be compatible. In fact, the reason why the code is a bit complicated is because pybind11 has to jump through some hoops to get access to the CPython internal variables, which it then tries to update in exactly the same way.

You can take a look e.g. at https://github.com/python/cpython/blob/master/Python/pystate.c#L1040. The "tss key" corresponds exactly to internals.tstate in pybind11's implementation. The gilstate_counter in both codebases also refers to the same place in memory.

So rather than cooking up a completely new subsystem, I think the way to go is to ensure that whatever pybind11 does is compatible with CPython. It sounds to me like you have potentially run into a corner case where the handling slightly differs.

Now that said, I don't really "get" the code you posted before. Why can't you write the following? This works perfectly on my machine:

#include <pybind11/pybind11.h>
#include <pybind11/embed.h>

#include <iostream>
#include <thread>

static void thread()
{
  pybind11::gil_scoped_acquire acquire;
  {
    // Pretend this block is nested inside Python
    pybind11::gil_scoped_release release;
    std::cout << "Calling C++ code" << std::endl;
  }
}

static const char thread_code[] =
"import threading\n"
"import testmod\n"
"t = threading.Thread(target=testmod.func)\n"
"t.start()\n"
"t.join()\n";

PYBIND11_EMBEDDED_MODULE(testmod, m)
{
  m.def("func", thread, pybind11::call_guard<pybind11::gil_scoped_release>());
}

int main()
{
  pybind11::scoped_interpreter interp;
  pybind11::exec(thread_code);

  return 0;
}

Best, Wenzel

KyleFromKitware commented 6 years ago

The crux of this entire issue is that "tss key". In Python 2.7, it's called autoTLSkey (https://github.com/python/cpython/blob/baca85fcc7cdf70af4a76ea0966d4842c173de1a/Python/pystate.c#L599), and it's declared static inside pystate.c (https://github.com/python/cpython/blob/baca85fcc7cdf70af4a76ea0966d4842c173de1a/Python/pystate.c#L40), so external software can't get at it and know its value. (In practice, it is almost certainly going to be 0, but we can't rely on this behavior.) The internals.tstate is a separate key. The two APIs don't know about the other one's key, and so they end up creating two separate thread states for the same thread. Python has internal assert()s which check for that specific condition (they only trigger if you have debug build of Python.)

The first example I gave was when I was in the earlier stages of tracking down this issue. It gets triggered because Python internally uses the PyGILState_* API when it creates a pythread. The second example with PyGILState_* more accurately illustrates our use case. The software that uses PyGILState_* is 3rd party, so we don't have a good way of changing it. Our software is large and complex, and has long-running operations in C++ land, so it always releases the GIL when it leaves Python land and reacquires it when it reenters, hence the release followed by acquire that I had in the main thread in my original example. That release and acquire, coupled with the debug build in Python, was crucial to trigger the assert() error.

But back to my original point: none of this would be an issue if we had access to autoTLSkey. Before #1286, I did experiment with trying to fix the advanced API. I made it first check PyGILState_GetThisThreadState(). The problem is that this breaks the thread disassociation feature of the advanced API: since we don't have access to autoTLSkey, we can't cut the relationship between the thread and the thread state the way the advanced API does. At that point, I decided not to try to fix the advanced API and to instead just give the user the option to switch between the advanced API and PyGILState_*, because our team doesn't need any of the features of the advanced API, and we would prefer to just use PyGILState_*. So that's where we're at now.

If we can find a way to get the advanced API to use whatever thread state was created with autoTLSkey and also be able to cut that relationship to preserve the thread disassociation feature, we can put this whole thing to bed. I currently don't have any ideas on how to do that other than hard-coding a value of 0 for the predicted value of autoTLSkey.

KyleFromKitware commented 6 years ago

Any more ideas on what we can do here?

scdub commented 1 year ago

I think the work on PYBIND11_SIMPLE_GIL_MANAGEMENT and its eventual movement to become the default gil_scoped classes supercedes this work. For our use cases, setting that value successfully eliminates the crashes caused by the conflicts with the Python C API GIL calls.

rwgk commented 1 year ago

I think the work on PYBIND11_SIMPLE_GIL_MANAGEMENT and its eventual movement to become the default gil_scoped classes supercedes this work.

I couldn't find evidence that PYBIND11_SIMPLE_GIL_MANAGEMENT actually resolves any problem, even though I tried very hard (the massively expanded testing under #4216). I've paid attention to DEADLOCK flakes for a few months since, I've seen them with both PYBIND11_SIMPLE_GIL_MANAGEMENT defined or undefined, macOS only: #4373

For our use cases, setting that value successfully eliminates the crashes caused by the conflicts with the Python C API GIL calls.

That's interesting. Do you have a reliable reproducer?

Every time I took a closer look before, such observations turned out to be chance.

scdub commented 1 year ago

Unfortunately I don't have an isolated test case to provide for this. We've observed crashes in a large application with an embedded Python interpreter, which already manages GIL state using the same Python C API primitives as PYBIND11_SIMPLE_GIL_MANAGEMENT. When we include third party dependencies which include pybind11 using the more complex gil_scoped classes, we've seen crashes. The change hasn't been installed long enough to be thoroughly tested, but I will update this with any further findings once I have more data.