Open zachariahreed opened 3 years ago
@zachariahreed, that's interesting. Do you have concrete code that manages to show this, or how did you run into this?
I was about to shout "But of course you should hold the GIL", but then you already thought of that. However, by the time the interpreter stops running and hands back control to pybind11, I'd expect the GIL to be held again? Do you mind explaining slightly more in depth how the GIL can be released without pybind11 knowing about it?
The GIL is held when you get control back. The problem comes when another thread with pybind11-wrapped code gets scheduled during the time when you don't have control. Then you end up in a situation like:
Thread1 Thread2
-------------- --------------
Push ...
Cast ...
ByteCode ...
Yield ...
... Push
... ByteCode
... Yield
Pop -> Boom ...
The specific code that triggered this was use of pybind11::module::import
inside of a custom caster. Much of the import machinery is written in pure python these days and its complex enough that it can certainly run longer than the multiplexing period.
Argh, right, yes. So rather than pop
ing, we should remember which pointer was added?
I don't think that's sufficient. During the time between when thread 1 gets control back and before the pop, if additional temporaries are created they'll end up in thread2's "arena". The flow control is not linear, so I don't think there's any way a vector
-based strategy can work. I looked very briefly at moving this into thread-local storage, but I was a bit concerned about performance as this seems to be a hot path.
You're probably right... To the best of your knowledge right now (not asking you to dive into pybind11), is it just loader_patient_stack
(which could get a reasonably quick fix, if we have a way of trying this out) or should we do a full investigation of what else can go wrong?
I don't fully understand the details of the casting implementation / when temporaries are created. But, as far as I can tell, this seems to be the only piece of global / singleton data mutated. I think it's probably just loader_patient_stack
.
So, does this mean I cannot use openmp or intel tbb ?
I believe I am experiencing the same issue, but via a slightly different mechanism. In my case, I am observing this issue via temporaries stored in the loader_patient_stack
via the argument_loader
. Specifically for types where py::implicitly_convertible
is used, and where arguments of those types are passed by reference to the bound function. Argument objects where an implicit conversion is performed are stored in this stack to keep them alive.
If calling such a function concurrently from multiple threads, and where that function releases the GIL, I am intermittently observing the argument reference becoming "invalidated", i.e. showing garble at some point after the GIL is released. The way I was able to observe this was by copying the argument into a local variable prior to releasing the GIL, then comparing the copy to the argument once the GIL is released. I'm able to reproduce this frequently with two threads calling the same function concurrently within a loop. Capturing and inspecting items from the loader_patient_stack
before and after the GIL is released show the current thread is basically observing the pushing/popping of the stack from other threads. The upshot is that if we have an argument passed by reference that is backed by an item in the loader_patient_stack
, that memory can be wiped out by another thread if the GIL is released - rendering the passed-by-reference argument invalid.
So in summary, it appears to not be not thread-safe to pass arguments by reference, where those arguments have implicit conversions specified - due to the aforementioned thread-safety issues of loader_patient_stack
.
OK, so here is a small example that reproduces the above issue. I used the cmake_example
project as a starting point, hence the naming.
pybind11 module:
#include <pybind11/pybind11.h>
#include <chrono>
#include <thread>
namespace py = pybind11;
using namespace std::chrono_literals;
struct Bar
{
int value;
};
void foo(const Bar& bar)
{
const auto copy = bar;
{
py::gil_scoped_release release;
// Simulate long running task.
std::this_thread::sleep_for(100ms);
}
if (bar.value != copy.value)
{
throw std::exception("Reference changed!!");
}
}
PYBIND11_MODULE(cmake_example, m)
{
py::class_<Bar>(m, "Bar")
.def(py::init([](const int i) { return Bar{i}; }));
py::implicitly_convertible<int, Bar>();
m.def("foo", &foo);
}
Python test:
import cmake_example as c
import pytest
from concurrent.futures.thread import ThreadPoolExecutor
@pytest.mark.parametrize('repeat', range(10))
def test(repeat):
thread_count = 2
call_count = 10
values = [i for i in range(call_count)]
with ThreadPoolExecutor(max_workers=thread_count) as executor:
results = [executor.submit(c.foo, i) for i in values]
# Materialise the results - exception will be thrown from 'foo' when reference is invalid.
results = list(r.result() for r in results)
Example results:
========================================================================= short test summary info =========================================================================
FAILED main.py::test[1] - RuntimeError: Reference changed!!
FAILED main.py::test[3] - RuntimeError: Reference changed!!
FAILED main.py::test[4] - RuntimeError: Reference changed!!
FAILED main.py::test[5] - RuntimeError: Reference changed!!
FAILED main.py::test[7] - RuntimeError: Reference changed!!
FAILED main.py::test[8] - RuntimeError: Reference changed!!
======================================================================= 6 failed, 4 passed in 5.27s =======================================================================
I looked very briefly at moving this into thread-local storage, but I was a bit concerned about performance as this seems to be a hot path.
I can confirm that using thread-local storage (TLS) is indeed a solution that will solve this problem. @zachariahreed, what specifically around performance were you concerned about? As I understand, there is some concern around TLS overhead when used in a dynamic module, but I'm wondering whether this is anything to be particularly concerned about here. Some benchmarks (see https://testbit.eu/2015/thread-local-storage-benchmark) suggest very minimal overhead on modern platforms. Obviously it would be prudent to run benchmarks for this particular case, but I'm curious what others' thoughts may be around this.
From my brief look into this, using a thread_local
variable to replace the patient_loader_stack
from the "internals" would probably involve the least amount of change. E.g. something like the following in the loader_life_support
class:
static std::vector<PyObject*>& loader_patient_stack()
{
thread_local std::vector<PyObject*> stack;
return stack;
}
https://github.com/pybind/pybind11/compare/v2.7.1...daniel-smith:bugfix/loader_life_support
Another approach might be to have an instance of a PyObject* list per call - wrapped in some kind of "context" that gets passed through with each function call here:
That would avoid the use of TLS - but would also involve an arguably unfeasible amount of refactoring - as the instance would also need to make its way through to the type casters. Not sure if anyone else has any opinions on either of these approaches, or any alternative approaches. I personally feel that TLS seems appropriate, but would be good to see what others think.
@daniel-smith Would welcome a PR on this front (with relevant tests). We do use TLS for the get_internals() struct, so it wouldn't be out of the question. It also wouldn't impact single_threaded performance as long as it's guarded WITH_THREADS like other TLS for get_internals() is.
There also has been some refactoring wanted on the patients STLs to use unordered_maps or such for faster access in case of duplicates such as #1253 and this might be a good time to do both.
@Skylion007 - ok thanks, I'll look into this.
Given that the dispatcher
function that uses loader_life_support
looks to be quite expensive already (lots of memory allocation, Python API calls, etc.), not to mention the cost of all of the individual type_casters, I would be extremely surprised if a single TLS lookup adds any noticeable overhead at all.
This benchmark suggests a TLS lookup may be ~12 ns, for example.
http://david-grs.github.io/tls_performance_overhead_cost_linux/
I don't know much about threading in general, and this problem in particular, but if the current behavior is demonstrably incorrect (https://github.com/pybind/pybind11/issues/2765#issuecomment-907572642), it makes no sense at all to me to even think about relative runtime comparisons. The only meaningful comparison is between two correct implementations.
My thinking:
internals::loader_patient_stack
to deprecated_threading_incompatible_loader_patient_stack
(or similar). But do not delete it to not break ABI compatibility just for this. (We can remove it later when we're breaking compatibility anyway for other reasons.)thread_local
function, maybe with name thread_local_loader_patient_stack
for clarity.@rwgk I think that sounds like a sensible approach. Does anyone see any reason to use Python TLS here rather than thread_local
? I note that the internals struct is currently using Python TLS for tstate
, which is why I ask.
UPDATE:
Depending on the answer to this question, thinking that the function can be a static method on the loader_life_support
class that also takes into account WITH_THREAD
, e.g.
private:
static std::vector<PyObject *> &loader_patient_stack() {
#if defined(WITH_THREAD)
thread_local auto stack = std::vector<PyObject *>{};
#else
static auto stack = std::vector<PyObject *>{};
#endif
return stack;
The realization of thread_local is complex, generates awful code (try looking at this on godbolt.org), and has suffered from poor support on some platforms. Our reliance on Python TLS so far basically punts on the issue and relies on what is surely a working implementation underlying Python. It is also highly inadvisable to use static
variables (esp. when they have a C++ constructor) within methods -- same story, look at godbolt.org to see what happens.
It is much better to move initialization of this data structure to some static/global context (e.g. pybind11's internals
data structure), thus implicitly relying on GIL for exclusive access, and being able to sidestep the expensive locking / TLS prelude and initialization checks.
I tried a similar thing in my workspace, and there were some cases where the python / c++ thread mapping appeared to be incorrect. I ended up with this for now:
diff -u -r a/include/pybind11/detail/internals.h b/include/pybind11/detail/internals.h
--- a/include/pybind11/detail/internals.h
+++ b/include/pybind11/detail/internals.h
@@ -107,13 +107,15 @@
std::unordered_map<const PyObject *, std::vector<PyObject *>> patients;
std::forward_list<ExceptionTranslator> registered_exception_translators;
std::unordered_map<std::string, void *> shared_data; // Custom data to be shared across extensions
- std::vector<PyObject *> loader_patient_stack; // Used by `loader_life_support`
std::forward_list<std::string> static_strings; // Stores the std::strings backing detail::c_str()
PyTypeObject *static_property_type;
PyTypeObject *default_metaclass;
PyObject *instance_base;
-#if defined(WITH_THREAD)
+#if !defined(WITH_THREAD)
+ std::vector<PyObject *> loader_patient_stack;
+#else // defined(WITH_THREAD)
PYBIND11_TLS_KEY_INIT(tstate);
+ PYBIND11_TLS_KEY_INIT(patient_stack_key);
PyInterpreterState *istate = nullptr;
~internals() {
// This destructor is called *after* Py_Finalize() in finalize_interpreter().
@@ -124,6 +126,7 @@
// of those have anything to do with CPython internals.
// PyMem_RawFree *requires* that the `tstate` be allocated with the CPython allocator.
PYBIND11_TLS_FREE(tstate);
+ PYBIND11_TLS_FREE(patient_stack_key);
}
#endif
};
@@ -301,6 +304,7 @@
#if PY_VERSION_HEX < 0x03090000
PyEval_InitThreads();
#endif
+
PyThreadState *tstate = PyThreadState_Get();
#if PY_VERSION_HEX >= 0x03070000
internals_ptr->tstate = PyThread_tss_alloc();
@@ -314,7 +318,27 @@
PyThread_set_key_value(internals_ptr->tstate, tstate);
#endif
internals_ptr->istate = tstate->interp;
+
+ // begin local modification:
+#if PY_VERSION_HEX >= 0x03070000
+ internals_ptr->patient_stack_key = PyThread_tss_alloc();
+ if (!internals_ptr->patient_stack_key ||
+ (PyThread_tss_create(internals_ptr->patient_stack_key) != 0))
+ pybind11_fail(
+ "get_internals: could not successfully initialize the "
+ "patient_stack TLS key!");
+#else
+ internals_ptr->patient_stack_key = PyThread_create_key();
+ if (internals_ptr->patient_stack_key == -1) {
+ pybind11_fail(
+ "get_internals: could not successfully initialize the "
+ "patient_stack TLS key!");
+ }
#endif
+ // end local modification:
+
+#endif // defined(WITH_THREAD)
+
builtins[id] = capsule(internals_pp);
internals_ptr->registered_exception_translators.push_front(&translate_exception);
internals_ptr->static_property_type = make_static_property_type();
@@ -342,6 +366,28 @@
return locals;
}
+// begin local modification:
+inline std::vector<PyObject *> &get_loader_patient_stack() {
+#if !defined(WITH_THREAD)
+ return get_internals().loader_patient_stack;
+#else
+ auto &internals = get_internals();
+ assert(internals.patient_stack_key != 0);
+ auto *patient_stack = PYBIND11_TLS_GET_VALUE(internals.patient_stack_key);
+ if (patient_stack) {
+ return *static_cast<std::vector<PyObject *> *>(patient_stack);
+ }
+ // This leaks a vector allocation for each thread; is there a cleanup mechanism?
+ auto *stack = new std::vector<PyObject *>();
+#if PY_VERSION_HEX >= 0x03070000
+ PyThread_tss_set(internals.patient_stack_key, stack);
+#else
+ PyThread_set_key_value(internals.patient_stack_key, stack);
+#endif
+ return *stack;
+#endif
+}
+// end local modification:
/// Constructs a std::string with the given arguments, stores it in `internals`, and returns its
/// `c_str()`. Such strings objects have a long storage duration -- the internal strings are only
diff -u -r a/include/pybind11/detail/internals.h b/include/pybind11/detail/internals.h
--- a/include/pybind11/detail/type_caster_base.h
+++ b/include/pybind11/detail/type_caster_base.h
@@ -34,12 +34,12 @@
public:
/// A new patient frame is created when a function is entered
loader_life_support() {
- get_internals().loader_patient_stack.push_back(nullptr);
+ get_loader_patient_stack().push_back(nullptr);
}
/// ... and destroyed after it returns
~loader_life_support() {
- auto &stack = get_internals().loader_patient_stack;
+ auto &stack = get_loader_patient_stack();
if (stack.empty())
pybind11_fail("loader_life_support: internal error");
@@ -55,7 +55,7 @@
/// This can only be used inside a pybind11-bound function, either by `argument_loader`
/// at argument preparation time or by `py::cast()` at execution time.
PYBIND11_NOINLINE static void add_patient(handle h) {
- auto &stack = get_internals().loader_patient_stack;
+ auto &stack = get_loader_patient_stack();
if (stack.empty())
throw cast_error("When called outside a bound function, py::cast() cannot "
"do Python -> C++ conversions which require the creation "
I think that the thing to do is something more along the lines of a little internally linked list.
loader_life_support::loader_life_support()
would allocate the link on the thread-local and save the parent pointer.
loader_life_support::loader_life_support()
would deallocate the link.
loader_life_support::add_patient
would get the current head of the linked list and add a PyObject to the keepalive set.
there were some cases where the python / c++ thread mapping appeared to be incorrect
@laramiel - I wonder if you might be able to expand what you mean by this. It would be good to understand cases where a thread_local
implementation would actually fail. I also had something very similar to what you posted above (using Python TLS) and the leaked vector allocations per thread was one of the reasons why I considered thread_local
to be a simpler solution (i.e. rely on the compiler). Your linked list idea sounds interesting - do you have a small example of what this might look like? Are you saying that the linked list would live in the internals struct?
@wjakob - thanks for your comments on this. If there are some platforms where measurable inefficiencies have been observed, then avoiding thread_local
makes sense. My thinking was along the same lines as @jbms (see https://github.com/pybind/pybind11/issues/2765#issuecomment-911006437), i.e. given everything else that is going on in the dispatcher, any inefficiencies are, in practice, not likely to make a real difference. I also feel if people have something truly performance critical, where compiler generated code matters - then my guess is that Python interop with C++ probably wouldn't be the first choice. Also, as @rwgk mentioned - the current implementation is not thread-safe and so we have nothing to compare any thread-safe implementation with at the moment. All of that being said, I'm not a pybind11 expert and I'm happy to go with the consensus on this.
@daniel-smith See my solution in the above commit.
https://github.com/laramiel/pybind11/commit/53b391900d67763efccda2990bdc492aa05022cd
The realization of thread_local is complex, generates awful code (try looking at this on godbolt.org).
I disagree. The code is pretty concise when compared with the python thread api initialization, etc. I think that compiler writers have some motivation to improve it as well, so in my opinion it's preferable.
in particular use of
loader_life_support
seems problematic. This function keeps state in the globalloader_patient_stack
: https://github.com/pybind/pybind11/blob/b7dfe5cc84ca1891b50e728e83c9b5b393cf5272/include/pybind11/detail/internals.h#L105 In the case where a cast can release the GIL, two thread's push/pop ofloader_patient_stack
can become interleaved.While releasing the GIL in a conversion seems like a bad idea, it can be surprisingly hard to avoid. Any cast that results in the bytecode interpreter being run (either directly or through something like a
__del__
method) can cause python itself to release GIL (sys.setswitchinterval
).