pybind / pybind11

Seamless operability between C++11 and Python
https://pybind11.readthedocs.io/
Other
15.39k stars 2.07k forks source link

fix: a long-standing bug in the handling of Python multiple inheritance #4762

Closed rwgk closed 10 months ago

rwgk commented 1 year ago

Description

The equivalent of this PR was merged as https://github.com/google/pywrapcc/pull/30056 on September 12, 2023.

COPY of the google/pywrapcc#30056 PR description (for easy reference):

The core of the added test is this:

struct CppBase {};
struct CppDrvd : CppBase {};
class PC(m.CppBase):
    pass

class PPCC(PC, m.CppDrvd):
    pass

Without this PR, the runtime behavior depends on the order in which PC or PPCC are first instantiated, which can be highly surprising and potentially very difficult to debug. See https://github.com/pybind/pybind11/pull/4762#issuecomment-1654558387 for a full analysis.

However, use cases appear to be surprisingly rare (https://github.com/pybind/pybind11/pull/4762#issuecomment-1671731319). The only known use cases are through PyCLIF.

More-or-less as a side-effect, this PR also:


Original PR description:

The core of the test is this:

struct CppBase {};
struct CppDrvd : CppBase {};
class PC(m.CppBase):
    pass

class PPCC(PC, m.CppDrvd):
    pass

pybind11 forces adding an __init__ override (which happens to not disturb PyCLIF or Boost.Python).

A wrapped PPPC instance evidently has two independent C++ objects (CppBase and CppDerived) with pybind11, while it has only one with PyCLIF and Boost.Python (CppDerived).

The nanobind metaclass refuses to create the PPCC subtype.

Suggested changelog entry:

A long-standing bug in the handling of Python multiple inheritance was fixed. See PR #4762 for the rather complex details.
rwgk commented 1 year ago

Logging a very surprising observation:

-    assert d.get_base_value() == 11
+    assert d.get_base_value() == 12
     assert d.get_base_value_from_drvd() == 12
     d.reset_base_value(20)
     assert d.get_base_value() == 20
-    assert d.get_base_value_from_drvd() == 12
+    assert d.get_base_value_from_drvd() == 20
     d.reset_base_value_from_drvd(30)
-    assert d.get_base_value() == 20
+    assert d.get_base_value() == 30

pytest -n 2:

( cd /usr/local/google/home/rwgk/forked/pybind11/tests && PYTHONPATH=/usr/local/google/home/rwgk/forked/build_gcc/lib /usr/bin/python3 -m pytest -n 2 test_python_multiple_inheritance.py )

=========================================================== test session starts ============================================================
platform linux -- Python 3.11.4, pytest-7.2.1, pluggy-1.0.0+repack
C++ Info: 12.2.0 C++17 __pybind11_internals_v4_gcc_libstdcpp_cxxabi1017__ PYBIND11_SIMPLE_GIL_MANAGEMENT=False
rootdir: /usr/local/google/home/rwgk/forked/pybind11/tests, configfile: pytest.ini
plugins: xdist-3.3.1
2 workers [2 items]
..                                                                                                                                   [100%]
============================================================ 2 passed in 0.32s =============================================================

pytest -n 1:

( cd /usr/local/google/home/rwgk/forked/pybind11/tests && PYTHONPATH=/usr/local/google/home/rwgk/forked/build_gcc/lib /usr/bin/python3 -m pytest -n 1 test_python_multiple_inheritance.py )

=========================================================== test session starts ============================================================
platform linux -- Python 3.11.4, pytest-7.2.1, pluggy-1.0.0+repack
C++ Info: 12.2.0 C++17 __pybind11_internals_v4_gcc_libstdcpp_cxxabi1017__ PYBIND11_SIMPLE_GIL_MANAGEMENT=False
rootdir: /usr/local/google/home/rwgk/forked/pybind11/tests, configfile: pytest.ini
plugins: xdist-3.3.1
1 worker [2 items]
.F                                                                                                                                   [100%]
================================================================= FAILURES =================================================================
______________________________________________________________ test_PPCCInit _______________________________________________________________
[gw0] linux -- Python 3.11.4 /usr/bin/python3

    def test_PPCCInit():
        d = PPCCInit(11)
        assert d.get_drvd_value() == 36
        d.reset_drvd_value(55)
        assert d.get_drvd_value() == 55

        # CppBase is initialized and used when CppBase methods are called, but
        # CppDrvd is used when CppDrvd methods are called.
>       assert d.get_base_value() == 12
E       assert 11 == 12
E        +  where 11 = <bound method PyCapsule.get_base_value of <test_python_multiple_inheritance.PPCCInit object at 0x7f3cfad79490>>()
E        +    where <bound method PyCapsule.get_base_value of <test_python_multiple_inheritance.PPCCInit object at 0x7f3cfad79490>> = <test_python_multiple_inheritance.PPCCInit object at 0x7f3cfad79490>.get_base_value

d          = <test_python_multiple_inheritance.PPCCInit object at 0x7f3cfad79490>

test_python_multiple_inheritance.py:32: AssertionError
========================================================= short test summary info ==========================================================
FAILED test_python_multiple_inheritance.py::test_PPCCInit - assert 11 == 12
======================================================= 1 failed, 1 passed in 0.34s ========================================================

ERROR: completed_process.returncode=1
rwgk commented 1 year ago

Follow-on to the previous comment:

The test also passes with pytest -n 1 (or just pytest) if I move def test_PC(): after def test_PPCCInit():.

The source of the order dependence is the algorithm here:

With Python code in this order

    PC(0)
    PPCC(0)

the result of the algorithm is:

registered_types_py = {PC: {CppBase}, PPCC: {CppBase, CppDrvd}}

With Python code in this order

    PPCC(0)
    PC(0)

the result of the algorithm is:

registered_types_py = {PPCC: {CppDrvd, CppBase}, PC: {CppBase}}

For completeness, code to reproduce: d37b5409d4e5b835620ccbb321a4e1ba89af315c

Note that the type of registered_types_py is std::unordered_map<PyTypeObject *, std::vector<type_info *>>:

I.e. the order of the keys is undetermined, which is totally fine, it doesn't matter.

However, the order of the bases for a given key is completely deterministic in both cases above.

This is a subtle but very serious flaw:

Assume some complex user code currently works as expected. It happens to call PPCC(i) first, PC(j) second. Then one day someone makes a harmless looking addition somewhere so that the order becomes PC(k), PPCC(i), PC(j). All the sudden the existing code behaves differently. This will be extremely difficult to troubleshoot.

rwgk commented 1 year ago

Tracking result of a simple experiment with the diff below.

Google global testing did not find any use cases involving manually written pybind11 extensions tripping over the added exceptions.

Internal test ID: OCL:554689491:BASE:555172881:1691595490960:83406d90

Note that this testing did NOT use PyCLIF-pybind11 (auto-generated pybind11 extensions), which brings in at least one use case, which led to this PR. The number of auto-generated pybind11 extensions is roughly 5x the number of manually written extensions.

--- a/include/pybind11/detail/type_caster_base.h
+++ b/include/pybind11/detail/type_caster_base.h
@@ -109,14 +109,22 @@ inline void all_type_info_add_base_most_
         type_info *existing_base = *it;
         if (PyType_IsSubtype(addl_base->type, existing_base->type) != 0) {
             bases.insert(it, addl_base);
+            throw std::runtime_error("all_type_info_add_base_most_derived_first:BANDAID");
             return;
         }
     }
     bases.push_back(addl_base);
+    for (std::size_t i = 0; i < bases.size(); i++) {
+        for (std::size_t j = i + 1; j < bases.size(); j++) {
+            if (PyType_IsSubtype(bases[i]->type, bases[j]->type) != 0) {
+                throw std::runtime_error("all_type_info_add_base_most_derived_first:LUCKY");
+            }
+        }
+    }
 }

 // Populates a just-created cache entry.
-PYBIND11_NOINLINE void all_type_info_populate(PyTypeObject *t, std::vector<type_info *> &bases) {
+PYBIND11_NOINLINE void all_type_info_populate(PyTypeObject *t, std::vector<type_info *> &bases) noexcept {
     assert(bases.empty());
     std::vector<PyTypeObject *> check;
     for (handle parent : reinterpret_borrow<tuple>(t->tp_bases)) {
rwgk commented 11 months ago

Hi @jagerman, is there a chance that you could help review this PR? — @wjakob said this is code that you rewrote at some point.

rainwoodman commented 10 months ago

Interesting diamand case.

struct CppBase {int base;};
struct CppDrvd1 : CppBase {};
struct CppDrvd2 : CppBase {};

class PCB(m.CppBase):
    pass

class PC1(m.CppDrvd1):
    pass
class PC2(m.CppDrvd2):
    pass

class PCD(PC1, PC2):   # consider banning this.
   pass

PC2().set_base

py::class<CppBase>().method('set_base')

py::class<CppDrvd1, CppBase>().method('get_base')

py::class<CppDrvd2, CppBase>().method('get_base')
rwgk commented 10 months ago

@rainwoodman FYI

After our meeting reviewing this PR I decided it's worth the effort catching all ill-behaved situations (not just fix the particular situation I stumbled over).

0a9599f775bfd3ca196c5e23a3fcf2890cbf6e82 captures some of them, but not all.

To get all, we need to bring in the C++ bases as well. I'll try to work on that asap.

Intermediate results: I ran Google global testing with 0a9599f775bfd3ca196c5e23a3fcf2890cbf6e82:

  1. just changing pybind11: the "diverging" error message did not appear anywhere.

  2. PyCLIF-pybind11: again, he "diverging" error message did not appear anywhere.

So plugging this hole could turn out to be relatively high effort for very little practical gain, but we can only be sure after putting in the effort. :-/


For my own reference:

Global testing just changing pybind11: OCL:579584090:BASE:579672844:1699221331111:c625089a

Global testing changing PyCLIF-pybind11: OCL:579673181:BASE:579726990:1699249978254:6ab60a42

rwgk commented 10 months ago

Regarding 5f5fd6a68e3cff1726628f6dea8e1c0754636a23:

This bug goes pretty deep: registered_types_py does not enumerate all C++ bases for pybind11-wrapped types. Therefore I had to go into type_caster_generic::load_impl<> to detect divergent bases.

The current state of this PR is definitely not meant to be for merging. It's only meant to enable global testing, to see if there are hidden problems in the wild. For that I have to move this work to the pywrapcc branch/repo, because I have to apply a similar patch in smart_holder_type_casters.h.

rwgk commented 10 months ago

I have to apply a similar patch in smart_holder_type_casters.h.

That's under https://github.com/google/pywrapcc/pull/30077 now.

Unit testing with all sanitizers in the Google toolchain passes.

Global testing initiated. Results will probably only be available very late today.

Note: the divergence check in type_caster_generic::load_impl<> only catches problems when a Python object is passed for a C++ base type in a function call. It would be much better to catch all divergence issues in all_type_info_populate(), but currently I don't see a way to achieve that easily.

rwgk commented 10 months ago

Global testing initiated. Results will probably only be available very late today.

No failures.

It would be much better to catch all divergence issues in all_type_info_populate(), but currently I don't see a way to achieve that easily.

That's the only situation we're still not sure about.

rwgk commented 10 months ago

Conclusion from a text chat with @rainwoodman:

I will revert the last 3 commits here (not counting merge commits), to get back to the state reviewed in a meeting with @rainwoodman last week (Oct 31), then merge.

But also: Create a new PR bringing back the 3 commits (0a9599f775bfd3ca196c5e23a3fcf2890cbf6e82, 5f5fd6a68e3cff1726628f6dea8e1c0754636a23, df27188dc6d6cf333145755543f56b2f6657aa5e), and do follow-on work separately, later.

Rationale: I think it's fair to defer work on catching "diverging" (best term I could think of) inheritance situations because I could not find any problems in the wild (Google global testing). The new PR will serve as documentation and reminder.

rainwoodman commented 10 months ago

I agree with your assessment: we can revisit the divergence case when there is evidence that someone actually wants to use it.

rwgk commented 10 months ago

The follow-on PR is #4928.