pybind / pybind11

Seamless operability between C++11 and Python
https://pybind11.readthedocs.io/
Other
15.39k stars 2.07k forks source link

[BUG]: 2.11.0 regression on PyPy3: `pybind11::handle::dec_ref() is being called while the GIL is either not held or invalid.` while running tests #4748

Closed mgorny closed 1 year ago

mgorny commented 1 year ago

Required prerequisites

What version (or hash if on master) of pybind11 are you using?

2.11.0

Problem description

At the very end of pybind11's test suite I'm getting the following exception:

pybind11::handle::dec_ref() is being called while the GIL is either not held or invalid. Please see https://pybind11.readthedocs.io/en/stable/advanced/misc.html#common-sources-of-global-interpreter-lock-errors for debugging advice.
The failing pybind11::handle::dec_ref() call was triggered on a type object.
terminate called after throwing an instance of 'std::runtime_error'
  what():  pybind11::handle::dec_ref() PyGILState_Check() failure.

This is with PyPy3.10 7.3.12. In pybind11 2.10.4, the test suite passes without any exceptions.

Full build/test log: dev-python:pybind11-2.11.0:20230715-024418.log

I have to leave now,, I can investigate further later.

Reproducible example code

mkdir build
cd build
pypy3 -m venv .venv
. .venv/bin/activate
pip install -r ../tests/requirements.txt
cmake -DPYBIND11_LTO_CXX_FLAGS= -DPYBIND11_TEST=ON -DCMAKE_BUILD_TYPE=Debug -G Ninja ..
ninja
ninja check

Is this a regression? Put the last known working version here if it is.

2.10.4

mgorny commented 1 year ago

bisect blames the following commit:

commit 65374c8e62488c557dccf70c9276e3f9fe04b007 Author: Ralf W. Grosse-Kunstleve rwgk@google.com Date: Thu Dec 8 22:06:51 2022 -0800

`pybind11::handle` `inc_ref()` & `dec_ref()` `PyGILState_Check()` **excluding** `nullptr` (#4246)

* pybind11/pytypes.h `inc_ref()`, `dec_ref()` `PyGILState_Check()` **excluding** `nullptr`

Guarded by `PYBIND11_ASSERT_GIL_HELD_INCREF_DECREF`

* Disable `PYBIND11_ASSERT_GIL_HELD_INCREF_DECREF` for PyPy under Windows.

* Add reference to PR #4268 (PyPy Windows)

include/pybind11/detail/common.h | 9 +++++++++ include/pybind11/pytypes.h | 10 ++++++++++ 2 files changed, 19 insertions(+)

mgorny commented 1 year ago

FWICS you don't have to actually run any tests to trigger it — it's sufficient to import pybind11_tests extension:

$ PYTHONPATH=tests pypy3 -c 'import pybind11_tests'
pybind11::handle::dec_ref() is being called while the GIL is either not held or invalid. Please see https://pybind11.readthedocs.io/en/stable/advanced/misc.html#common-sources-of-global-interpreter-lock-errors for debugging advice.
The failing pybind11::handle::dec_ref() call was triggered on a type object.
terminate called after throwing an instance of 'std::runtime_error'
  what():  pybind11::handle::dec_ref() PyGILState_Check() failure.
Aborted (core dumped)

Apparently the problem occurs when destructors are called while the interpreter is exiting:

(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x00007f4c84a8f00f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
#2  0x00007f4c84a3fef2 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007f4c84a294ad in __GI_abort () at abort.c:79
#4  0x00007f4c828a0c53 in ?? () from /usr/lib/gcc/x86_64-pc-linux-gnu/13/libstdc++.so.6
#5  0x00007f4c828b3306 in ?? () from /usr/lib/gcc/x86_64-pc-linux-gnu/13/libstdc++.so.6
#6  0x00007f4c828b2289 in ?? () from /usr/lib/gcc/x86_64-pc-linux-gnu/13/libstdc++.so.6
#7  0x00007f4c828b2a42 in __gxx_personality_v0 () from /usr/lib/gcc/x86_64-pc-linux-gnu/13/libstdc++.so.6
#8  0x00007f4c84c04b6b in ?? () from /usr/lib/gcc/x86_64-pc-linux-gnu/13/libgcc_s.so.1
#9  0x00007f4c84c05464 in _Unwind_Resume () from /usr/lib/gcc/x86_64-pc-linux-gnu/13/libgcc_s.so.1
#10 0x00007f4c82d4b913 in std::allocator<char>::~allocator (this=<optimized out>, __in_chrg=<optimized out>)
    at /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/bits/allocator.h:184
#11 pybind11::handle::dec_ref() const & (this=0x7f4c83afce88 <pybind11::detail::get_exception_object<MyException5_1>()::ex>)
    at /tmp/pybind11/include/pybind11/detail/../detail/../pytypes.h:270
#12 0x00007f4c82d4bb6e in pybind11::object::~object (
    this=0x7f4c83afce88 <pybind11::detail::get_exception_object<MyException5_1>()::ex>, __in_chrg=<optimized out>)
    at /tmp/pybind11/include/pybind11/detail/../detail/../pytypes.h:356
#13 0x00007f4c830127c4 in pybind11::exception<MyException5_1>::~exception (
    this=0x7f4c83afce88 <pybind11::detail::get_exception_object<MyException5_1>()::ex>, __in_chrg=<optimized out>)
    at /tmp/pybind11/include/pybind11/pybind11.h:2536
#14 0x00007f4c84a425a5 in __run_exit_handlers (status=0, listp=0x7f4c84bdc840 <__exit_funcs>, 
    run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:111
#15 0x00007f4c84a426fa in __GI_exit (status=<optimized out>) at exit.c:141
#16 0x00007f4c84a2a991 in __libc_start_call_main (main=main@entry=0x55dc073144d0 <main>, argc=argc@entry=3, 
    argv=argv@entry=0x7ffe2ee89aa8) at ../sysdeps/nptl/libc_start_call_main.h:74
#17 0x00007f4c84a2aa45 in __libc_start_main_impl (main=0x55dc073144d0 <main>, argc=3, argv=0x7ffe2ee89aa8, init=<optimized out>, 
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe2ee89a98) at ../csu/libc-start.c:360
#18 0x000055dc07314501 in _start ()
mgorny commented 1 year ago

I have also reproduced this with PyPy3.9 (7.3.12 also).

FindDefinition commented 1 year ago

This bug also exists in cpython 3.8 with pybind 2.11.1, with following bind functions:

void BoxOps::transform_pc(pybind11::array_t<float> out_points, pybind11::array_t<float> points, pybind11::array_t<float> mat)   {

  using DType = float;
  auto N = points.shape(0);
  auto out_points_data = out_points.mutable_data();
  auto points_data = points.mutable_data();
  auto stride = points.shape(1);
  DType* out_ptr, *ptr;
  DType x, y, z;
  DType m00 = mat.at(0, 0);
  DType m01 = mat.at(0, 1);
  DType m02 = mat.at(0, 2);
  DType m10 = mat.at(1, 0);
  DType m11 = mat.at(1, 1);
  DType m12 = mat.at(1, 2);
  DType m20 = mat.at(2, 0);
  DType m21 = mat.at(2, 1);
  DType m22 = mat.at(2, 2);
  if (mat.shape(0) == 3 && mat.shape(1) == 3){
      for (size_t i = 0; i < N; ++i){
          out_ptr = out_points_data + i * stride;
          ptr = points_data + i * stride;
          x = ptr[0];
          y = ptr[1];
          z = ptr[2];
          out_ptr[0] = m00 * x + m01 * y + m02 * z;
          out_ptr[1] = m10 * x + m11 * y + m12 * z;
          out_ptr[2] = m20 * x + m21 * y + m22 * z;
      }
  }else{
      DType m03 = mat.at(0, 3);
      DType m13 = mat.at(1, 3);
      DType m23 = mat.at(2, 3);
      for (size_t i = 0; i < N; ++i){
          out_ptr = out_points_data + i * stride;
          ptr = points_data + i * stride;
          x = ptr[0];
          y = ptr[1];
          z = ptr[2];
          out_ptr[0] = m00 * x + m01 * y + m02 * z + m03;
          out_ptr[1] = m10 * x + m11 * y + m12 * z + m13;
          out_ptr[2] = m20 * x + m21 * y + m22 * z + m23;
      }
  }
}
  module_BoxOps.def_static("transform_pc", &csrc::boxops::BoxOps::transform_pc, pybind11::arg("out_points"), pybind11::arg("points"), pybind11::arg("mat"), pybind11::return_value_policy::automatic, pybind11::call_guard<pybind11::gil_scoped_release>());

@rwgk Is this a expected behavior? We can't release GIL with all functions include array_t argument?

rwgk commented 1 year ago

Is this a expected behavior?

Yes, almost certainly.

We can't release GIL with all functions include array_t argument?

Not like your doing it right now (assuming you don't want undefined behavior). You have to be more careful: you have to ensure you're holding the GIL again before the destructors run. This is a classical trap that is very easy to fall into (I had a couple myself before I added the guards). The fix is probably subtle. Maybe (!), don't release the GIL with pybind11::call_guard<>, but divide the body of your transform_pc() function into a first part that constructs the temporary py::object (of some sort), then a subscope (simply { ... }) with py::gil_scoped_release release; at the top. That way, when the subscope goes of of scope, the py::object destructors don't run before the GIL is acquired again.

Inside the subscope you cannot have any code calling back into the Python C API. The simple guards we have just cover the most common oversights.