pybind / pybind11

Seamless operability between C++11 and Python
https://pybind11.readthedocs.io/
Other
15.18k stars 2.06k forks source link

fix: Different MSVC versions may be ABI incompatible, guard with _MSC_VER (fix #2898) #4779

Closed pwuertz closed 11 months ago

pwuertz commented 11 months ago

Description

Pybind11, currently by default, shares C++ data structures across DLLs built with different MSVC versions, for which there is no documented guarantee for ABI compatibility. This has caused difficult-to-track crashes on several occasions for quite some time now (#1562, #1618, #2898, onnx/onnx#3493, pytorch/pytorch#62090, pytorch/pytorch#84457, NVIDIA/TensorRT#2853, IntelRealSense/librealsense#8570).

This PR aims to mitigate the risk of an ABI mismatch in pybind11-module interaction by guarding against different MSVC compiler/lib/toolchain versions, without causing significant changes for pybind11 (see also discussion in #2898).

Suggested changelog entry:

* Guard against crashes/corruptions caused by modules built with different MSVC versions
rwgk commented 11 months ago

@wjakob I think the same fix needs to be applied here: https://github.com/wjakob/nanobind/blob/1f65061db485dd8428a62842eb1dd0201db2dbd9/src/nb_internals.cpp#L58-L63

EthanSteinberg commented 11 months ago

@rwgk Won't this break ABI between already compiled extensions and new extensions compiled under this change?

rwgk commented 11 months ago

@rwgk Won't this break ABI between already compiled extensions and new extensions compiled under this change?

Yes, but I believe it's the lesser of two evils.

Based on @pwuertz's input I got to believe the current situation is very bad. Peter listed 8 bugs. I assume those are only a tiny fraction of cases where the ABI incompatibility caused crashes or confusion. The sooner we fix it the better. I think it's unavoidable that initially the next release might break some workflows, but once it's baked in everybody will be better off.

rwgk commented 11 months ago

Oh! We probably should up the internals version for Windows at the same time.

pwuertz commented 11 months ago

Won't this break ABI between already compiled extensions and new extensions compiled under this change?

No, this PR doesn't change the ABI of pybind11 modules at all. It just changes the namespace in which modules are communicating in.

Or is your concern that any two independently compiled modules are now ending up in different namespaces and thus unexpectedly won't be sharing struct internals? Note that pybind11 never gave you a guarantee for this in the first place. There is always a chance for two independently authored, compiled and distributed modules to unknowingly or accidentally end up in different namespaces due to a change in someone's build infrastructure.

Pybind11 currently doesn't provide a way of declaring and asserting that for example a some_module_extra is critically dependent on some_module's internals. Shouldn't be that difficult do implement though. some_module_extra could use something like a PYBIND11_DEPENDS_ON("some_module") macro, which causes some_module_extra to import some_module when loaded and check for matching PYBIND11_INTERNALS_IDs or raise.

We probably should up the internals version for Windows at the same time.

If I understood its intent correctly, the PYBIND11_INTERNALS_VERSION section of the PYBIND11_INTERNALS_ID is a guarantee for the struct internals {} definition being the same for two modules. We didn't touch struct internals {} in this PR, so there should be no need for changing PYBIND11_INTERNALS_VERSION?

rwgk commented 11 months ago

To answer the simple question first:

If I understood its intent correctly, the PYBIND11_INTERNALS_VERSION section of the PYBIND11_INTERNALS_ID is a guarantee for the struct internals {} definition being the same for two modules. We didn't touch struct internals {} in this PR, so there should be no need for changing PYBIND11_INTERNALS_VERSION?

That's correct, but it's kind-of the other way around:

More later.

rwgk commented 11 months ago

Or is your concern that any two independently compiled modules are now ending up in different namespaces and thus unexpectedly won't be sharing struct internals?

Yes. — That's what I had in mind with "lesser of two evils." — Currently we have a many-to-many mapping (MSVC version X to Y). A subset of those are "good", the rest is "bad". I don't think we have a reasonable way to tell those apart. The only choice we have is how quickly we transition from many-to-many to one-to-one.

Everything below is beyond the scope of this PR for sure.

Pybind11 currently doesn't provide a way of declaring and asserting that for example a some_module_extra is critically dependent on some_module's internals.

Hm ... it doesn't tell you what you need, only that you need something. In the case of stl.h conversions it tries to be more helpful:

E       TypeError: missing_header_arg(): incompatible function arguments. The following argument types are supported:
E           1. (arg0: std::vector<float, std::allocator<float> >) -> None
E
E       Invoked with: [1.0, 2.0, 3.0]
E
E       Did you forget to `#include <pybind11/stl.h>`? Or <pybind11/complex.h>,
E       <pybind11/functional.h>, <pybind11/chrono.h>, etc. Some automatic
E       conversions are optional and require extra headers to be included
E       when compiling your pybind11 module.

We could probably go a long way by pointing to a new doc section in that message, explaining the common situations. That would have no compile-time or runtime overhead.

Another approach might be to keep track of what types have registered conversions anywhere in the process, e.g. based on a simple python dict with key typeid(T) for all class_-wrapped types, value a Python list of all modules with converters. When the error above is generated, we know we don't have the conversions in that extension, but if we have a match in the dict we could append that information to the error message. That has compile-time and runtime overhead, but I think it would be worth it.

Since we're hitting on PYBIND11_INTERNALS_VERSION here, I want to add another thought: I believe we could do much better by using Python capsules to pass pointers between any pair of binding systems, not only different pybind11 versions, but also nanobind or even SWIG. The idea is super simple: we need a standardized key for the C++ ABI, and a raw pointer, std::shared_ptr, or std::unique_ptr. If the ABI key matches we know we can pass them around, no matter how they are wrapped. (We've been using capsules like that for years for pybind11, SWIG, PyCLIF interop, although we don't require ABI keys because we build everything from sources Google-internally.) I think unshackling us from the PYBIND11_INTERNALS_VERSION will open up a lot of creative potential, without dragging the community into excessive copy-pasting of Python bindings.

rwgk commented 11 months ago

Thanks @henryiii and @wjakob for supporting/adopting this change. I'll merge this here, than make a separate PR to bump the PYBIND11_INTERNALS_VERSION for MSVC (see https://github.com/pybind/pybind11/pull/4779#issuecomment-1666572021).

isuruf commented 7 months ago

This is wrong. I looked at most of the issues mentioned and it is about mixing MSVC modules built with /MT vs modules built with /MD. Please add a guard of checking _MT macro. See https://learn.microsoft.com/en-us/cpp/build/reference/md-mt-ld-use-run-time-library?view=msvc-170

rwgk commented 7 months ago

This is wrong.

Well, sorry. What we need is constructive feedback, concretely a PR with how to do it right. pybind11 is a community project.

Also, maybe you meant: this isn't complete enough yet?

rwgk commented 7 months ago

This is wrong.

Well, sorry. What we need is constructive feedback, concretely a PR with how to do it right. pybind11 is a community project.

Also, maybe you meant: this isn't complete enough yet?

Oh you did: #4953

Looking.