pybind / pybind11

Seamless operability between C++11 and Python
https://pybind11.readthedocs.io/
Other
15.66k stars 2.1k forks source link

[BUG]: finer grained ABI compatibility #3793

Open beckermr opened 2 years ago

beckermr commented 2 years ago

Required prerequisites

Problem description

Based on the extensive discussion in conda-forge plus offline discussion with @isuruf, we think there may be a better approach to ABI compatibility with respect to compilers and c++ runtimes.

Currently, pybind11 uses a combination of various things. The one of interest here is the __GXX_ABI_VERSION. This appears to be too strict, especially for forward-compatible libraries commonly in use.

The suggestion is to replace __GXX_ABI_VERSION with

These capture the fine-grained, core ABI changes without being overly prescriptive. The last two are related to exception handling. There may be more macros to use to capture dwarf2 exceptions explicitly.

Thoughts @henryiii @isuruf?

Reproducible example code

No response

allanleal commented 2 years ago

I'm wondering if what you guys want to solve here is related to the issue I'm facing for quite some time already.

I have two C++ libraries (autodiff and Reaktoro) for which pybind11 is used to generate their Python bindings. When one is compiled with a different version than the other, I start to have issues like the following:

image

The ChemicalState::temperature method is expecting autodiff::real. However, this type is constructible from arithmetic types, so state.temperature(100, "celsius") works fine ONLY WHEN autodiff and Reaktoro are compiled with exact same compilers. Otherwise I get these errors as shown in the snapshot.

I have not been able to resolve this from my side; have tried already multiple solutions, additional conversion overloads, with no avail. I would appreciate to know if this is an unresolved issue in pybind11 or if there is a solution to this already.

beckermr commented 2 years ago

Is this different compilers or different C++ stdlib ABIs?

allanleal commented 2 years ago

The issue happens with both cases:

For example, this happened today when a new conda-forge package for Reaktoro was built with gcc 10.3, but the autodiff package available in conda-forge was built days ago with gcc 9.4.

beckermr commented 2 years ago

So I have a PR open to handle the conda-forge issues via the pybind11-abi package: https://github.com/conda-forge/pybind11-feedstock/pull/79

The first item is out of my wheel house.

tdegeus commented 1 year ago

I just want to state my support for make the ABI compatibly finer. I use classes from different modules and ship them using conda-forge. The current behaviour is just a nightmare on Linux: all involved modules have to be compiled with the same GCC major version. This means that when the compiler of conda-forge is updated:

  1. All modules have to be reshipped with the latest compiler.
  2. Even then, things might not work, as not all my systems would have easy access to that compiler.
wjakob commented 11 months ago

The key question is: when is the ABI compatibility criterion in pybind11 too strict, and what can be done to make it more relaxed without running a risk of breaking things? (The same also applies to nanobind, which uses essentially the same kind of ABI tag)

Let me add one correction: simply using different versions of GCC alone is not enough to cause issues -- pybind11 queries the ABI compatibility tag of libstdc++ (__GXX_ABI_VERSION). And that's where this whole thing gets kind of tricky: when pybind11 isolates different extensions, it's does so intentionally because a change in this ABI version.

At that point, it's basically your word against that of the libstdc++ authors: they are saying that there was an ABI break, and you wish to disable it because it is inconvenient.

These ABI tags haven't been introduced just for fun -- it's because things were crashing badly and in weird ways, and it was really complicated to track down what was going on. So any change here would need a very careful justification on why the coarser-grained flags are 100% guaranteed to avoid such issues, now and in the future. The alternative proposal @beckermr made at the top will cause less things from being isolated from each other, but it wasn't clear to me why it would provide such a guarantee. Note that this isn't just about pybind11 usage of libstdc++ (which uses only relatively simple and presumably stable components like std::string, std::unordered_map, etc.). It also includes the extensions, which could use pybind11 to exchange arbitrary C++ instances and segfault during this process because of an ABI break.

Taking a step back, this seems to me to not be a pybind11 issue but a complaint about the libstdc++ authors changing __GXX_ABI_VERSION relatively often. And that is just not an issue we can fix here..

One thing I have never tried, but which may be worth a shot, is to provide -fabi-version=.., which is a GCC parameter that supposedly forces GCC to a specific ABI version. Maybe that's a feasible route to pin the ABI version of multiple extensions so that they aren't isolated from each other. Barring that, building with the same GCC major version doesn't strike me as the end of the world (we're talking about major, not minor versions, right?)

isuruf commented 11 months ago

GXX_ABI_VERSION is too broad. Yes, there might have been an ABI break because of name mangling issues, but there is a compatibility scheme that adds a aliased symbol. See -fabi-compat-version in https://gcc.gnu.org/onlinedocs/gcc-13.2.0/gcc/C_002b_002b-Dialect-Options.html

Barring that, building with the same GCC major version doesn't strike me as the end of the world (we're talking about major, not minor versions, right?)

It is when you have hundreds of downstream packages.

Another issue with this system is that the current system breaks things silently whereas we haven't had an issue with the libstdc++ version for years. That's because of the -fabi-compat-version workaround.

wjakob commented 11 months ago

It is when you have hundreds of downstream packages.

Extensions which need to be part of the same "pybind11 domain" for C++ interop with custom types are sort of "buddies" with each other. Typical pybind11 extensions bind their own custom types but don't need such an intimate relationship with another extension. So that statement about hundreds of packages being affected seems high to me.

I have a few follow-up questions:

  1. Can you explain how -fabi-compat-version=.. is used to create the workaround you mentioned in CondaForge?
  2. Does this cause GCC to set a #define internally so that it could be detected?
  3. Do you know the difference between -fabi-version=.. and -fabi-compat-version=.. in the linked GCC docs. Those two seem very related (I'm confused).
isuruf commented 11 months ago
  1. Can you explain how -fabi-compat-version=.. is used to create the workaround you mentioned in CondaForge?

It's automatically done by GCC. Nothing to do with conda-forge.

  1. No

  2. The only changes to the G++ ABI between -fabi-version=13 and -fabi-version=18 are name mangling differences. If a function foo had name mangling foo_mangle1 in -fabi-version=13 and foo_mangle2 in -fabi-version=18, then GCC would generate a symbol called foo_mangle2 and make an alias for foo_mangle1. When two packages are compiled with -fabi-version=13, they generate the symbols foo_mangle1, but when only the upstream is updated to compile with -fabi-version=18, GCC generates both foo_mangle1 and foo_mangle2.

So, you have to consider that -fabi-version={13,14,15,16,17,18} are all interchangeable on Linux and other systems with strong aliasing. (GCC 8.2 - GCC 13.2)

wjakob commented 11 months ago

It's automatically done by GCC. Nothing to do with conda-forge.

What I meant to ask here is: does CondaForge automatically specify -fabi-compat-version= or -fabi-version= to all compiled packages? If so, what's the method to choose this argument? Still not sure I get the difference between those two.

Would it be a reasonable solution that the user -- when also specifying -fabi-compat-version= or -fabi-version= can set a hypothetical definition -DPYBIND11_GXX_ABI_VERSION=.. with the same value, and that this is then used to construct the capsule key that pybind11 uses to isolate itself from ABI-incompatible compilers and C++ libraries?

isuruf commented 11 months ago

What I meant to ask here is: does CondaForge automatically specify -fabi-compat-version= or -fabi-version= to all compiled packages? If so, what's the method to choose this argument? Still not sure I get the difference between those two.

No. We use the compiler default.

Would it be a reasonable solution that the user -- when also specifying -fabi-compat-version= or -fabi-version= can set a hypothetical definition -DPYBIND11_GXX_ABI_VERSION=.. with the same value, and that this is then used to construct the capsule key that pybind11 uses to isolate itself from ABI-incompatible compilers and C++ libraries?

Yes, that's what https://github.com/conda-forge/pybind11-feedstock/pull/79 does.

wjakob commented 11 months ago

Yes, that's what https://github.com/conda-forge/pybind11-feedstock/pull/79 does.

If I understood correctly, this commit completely disables ABI tagging (it unsets both PYBIND11_BUILD_ABI and PYBIND11_COMPILER_TYPE). That's a lot more severe than what I was suggesting. Besides potential incompatibilities related to libstdc++, there are parts in libc++ that aren't mutually ABI-compatible with libstdc++ (and libc++ is often used with Clang).

My suggestion would be to find a more balanced version of this and make it part of either pybind11 proper or the CMake build system.

PS: Let me just mention this comment about ABI compatibility from another project here since it nicely makes a point about pitfalls: (https://github.com/wjakob/nanobind/discussions/374#discussioncomment-7693236)

It appears to have been a compiler problem where the compiler for the C++ test & the compiler for the whole codebase with nanobind wrapping wasn't the same. Turns out calling VTK methods compiled with llvm from a gcc code leads to completely random behaviors.

beckermr commented 11 months ago

I'm happy to not mix compiler types personally.

isuruf commented 11 months ago

Compilers like clang and gcc are usually okay to mix whereas the C++ standard library is not. Clang on Linux defaults to libstdc++ and it's okay as long as they both use libstdc++.

I agree that we should check the C++ standard library, but the compiler itself is not a huge issue.

PS: Let me just mention this comment about ABI compatibility from another project here since it nicely makes a point about pitfalls: (https://github.com/wjakob/nanobind/discussions/374#discussioncomment-7693236)

Isn't that precisely because of NB_COMPILER_TYPE or PYBIND11_COMPILER_TYPE specifically?

wjakob commented 10 months ago

Hi @isuruf, @beckermr, and @rwgk,

I had a bit of time look a bit into C++ ABI stability in the last few days. I think that I am too paranoid after having been repeatedly bitten by this some years ago. It seems that basically all the three main compiler tools (GCC, Clang, MSVC) have started taking ABI stability seriously at some point in the last years and there haven't been major breaks since then.

So I'm open to making a change here that allows interoperability between more extensions built with different compilers. But let's get it right and make something that we hopefully won't have to tweak retroactively because the criterion turns out to be too loose and, e.g., and ancient GCC/Clang version turns out to not be ABI-compatible after all.

I agree that mixing compilers should be okay, and that it's rather the C++ library (libstdc++ vs libc++) that is a potential issue.

One important think to keep in mind (@rwgk) is that a change here is itself an ABI break, of pybind11. Modifying the PYBIND11_BUILD_ABI string means that the extension is now isolated from all other extensions previously built with a different PYBIND11_BUILD_ABI string. So we should put this change into a larger release that bundles accumulated ABI-breaking commits.

(I also plan to add whatever we converge upon to nanobind so that these two projects are consistent.)

wjakob commented 10 months ago

One more thing: this commit identifies a few discrete points where a __GXX_ABI_VERSION change was associated with breakage, like the transition to C++11, or how exceptions are represented and propagated at the ABI level.

But this is all very GCC specific. What about Clang? If we change the PYBIND11_BUILD_ABI, we should get this 100% right and the solve issue once and for all -- and that means thinking about all compilers used to compile this project.

rwgk commented 10 months ago

One important think to keep in mind (@rwgk) is that a change here is itself an ABI break, of pybind11.

With respect to #4779 (merged already) & #4953 (pending):

Before we merged #4779 we discussed the ABI break aspect, and accepted that as a necessary step to get to safety.

We have not made a release yet after #4779 was merged, so we still have an opportunity to refine the change under #4953.

If we change the PYBIND11_BUILD_ABI, we should get this 100% right and the solve issue once and for all

The major platforms (MSVC, g++, clang) are independent (AFAICT), I believe it might be more practical to focus on one platform at a time (unless we get more people involved). The released state for g++ and clang is pretty good, while it is very unsafe for MSVC.