pybind / pybind11_abseil

Pybind11 bindings for the Abseil C++ Common Libraries
Other
24 stars 14 forks source link

ImportError: generic_type: cannot initialize type "StatusCode": an object with that name is already defined #12

Open ethanluoyc opened 9 months ago

ethanluoyc commented 9 months ago

Hi,

I am trying to figure out some issues with using the pybind11_abseil.status usage. TensorFlow as well as some other Google projects are using the pybind11_abseil modules. I found that as of TensorFlow 2.14.0 (2.13.1 is fine), it looks like that using TensorFlow in conjunction with packages that use pybind11_abseil will fail with the error

ImportError: generic_type: cannot initialize type "StatusCode": an object with that name is already defined

This happens for example in https://github.com/google-deepmind/launchpad/issues/44 and https://github.com/tensorflow/tflite-support/issues/954. The error happens when a pybind11 extension that uses py::google::ImportStatusModule() is called.

For example

import tensorflow
import launchpad # Fails when importing an extension that uses ImportStatusModule.

Reversing the order we will see the same issues this time coming from TensorFlow trying to import a module that uses ImportStatusModule.

I noticed that there are some changes to the pybind11_abseil version used in TensorFlow 2.14.0, which incoporates some changes around the import_status_module, but I don't have a really good clue about what that change entails, it would be quite useful if you can provide some help.

Thanks!

rwgk commented 9 months ago

This error

ImportError: generic_type: cannot initialize type "StatusCode": an object with that name is already defined

is raised by pybind11 when enum_<absl::StatusCode>() is called a second time. I looked around for 10 minutes but still have no clue how that could happen.

We have this in pybind11_abseil (RegisterStatusBindings()):

https://github.com/pybind/pybind11_abseil/blob/f6ff92551862273d368a758b5b75703dde1f0b9d/pybind11_abseil/register_status_bindings.cc#L161

That should only get called the pybind11_abseil.status module is imported, and the standard Python import-only-once mechanisms are expected to ensure RegisterStatusBindings() is only called once.

I'd use a debugger to find out from where this line of code is reached:

Swap the import order to get a backtrace for both locations.

If using a debugger is impractical, I often simply insert long *BAD = nullptr; *BAD = 101; to trigger a segfault and then just look at the backtrace (assuming the toolchain generates a backtrace by default).

ethanluoyc commented 9 months ago

I manage to get some stacktrace for the two places. I modifed the launchpad pybind dependencies so that it aligns with tensorflow@2.14.0

pybind11_abseil 2c4932ed6f6204f1656e245838f4f5eae69d2e29 this version is before the bypass_regular_import is deprecated and defaults to true, I tried to set this to false and go through the import path but not luck.

pybind11 2.10.4

Then I set a breakpoint on initialize.

Importing TensorFlow first gets us

import tensorflow
from courier.python import client  # pytype: disable=import-error
#0  pybind11::pybind11_fail (reason=<error: Cannot access memory at address 0x626f206e61203a22>)
    at external/pybind11/include/pybind11/detail/../detail/common.h:1008
#1  0x00007fff5961a892 in pybind11::detail::generic_type::initialize (this=0x7fffffffb2d0, rec=...)
    at external/pybind11/include/pybind11/pybind11.h:1301
#2  0x00007fff59eaace9 in pybind11::class_<absl::lts_20230125::StatusCode>::class_<>(pybind11::handle, char const*) (
    this=0x7fffffffb2d0, scope=..., name=0x7fff59fccff0 "StatusCode")
    at external/pybind11/include/pybind11/pybind11.h:1546
#3  0x00007fff59ea44fe in pybind11::enum_<absl::lts_20230125::StatusCode>::enum_<>(pybind11::handle const&, char const*) (this=0x7fffffffb2d0, scope=..., name=0x7fff59fccff0 "StatusCode")
    at external/pybind11/include/pybind11/pybind11.h:2178
#4  0x00007fff59e9a741 in pybind11::google::internal::RegisterStatusBindings (m=...)
    at external/pybind11_abseil/pybind11_abseil/register_status_bindings.cc:102
#5  0x00007fff59e99a75 in pybind11::google::ImportStatusModule (bypass_regular_import=true)
    at external/pybind11_abseil/pybind11_abseil/import_status_module.cc:20
#6  0x00007fff59609fe8 in courier::(anonymous namespace)::pybind11_init_py_client (m=...)
    at courier/python/py_client.cc:113
#7  0x00007fff59609eb5 in courier::(anonymous namespace)::PyInit_py_client () at courier/python/py_client.cc:111
#8  0x00007ffff7def7c7 in _PyImport_LoadDynamicModuleWithSpec (spec=spec@entry=0x7fff5a40a5c0, fp=fp@entry=0x0)
    at ./Python/importdl.c:167
#9  0x00007ffff7debf5d in _imp_create_dynamic_impl (module=<optimised out>, file=<optimised out>,

Import Launchpad first

from courier.python import client  # pytype: disable=import-error
import tensorflow
0x00007fff620d0734 in pybind11::pybind11_fail(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /home/yicheng/projects/launchpad/.venv/lib/python3.10/site-packages/tensorflow/python/util/_tf_stack.so
(gdb) backtrace
#0  0x00007fff620d0734 in pybind11::pybind11_fail(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /home/yicheng/projects/launchpad/.venv/lib/python3.10/site-packages/tensorflow/python/util/_tf_stack.so
#1  0x00007fff620de4a7 in pybind11::detail::generic_type::initialize(pybind11::detail::type_record const&) ()
   from /home/yicheng/projects/launchpad/.venv/lib/python3.10/site-packages/tensorflow/python/util/_tf_stack.so
#2  0x00007fff620f06ac in pybind11::class_<absl::lts_20230125::StatusCode>::class_<>(pybind11::handle, char const*)
    () from /home/yicheng/projects/launchpad/.venv/lib/python3.10/site-packages/tensorflow/python/util/_tf_stack.so
#3  0x00007fff620eed38 in pybind11::enum_<absl::lts_20230125::StatusCode>::enum_<>(pybind11::handle const&, char const*) () from /home/yicheng/projects/launchpad/.venv/lib/python3.10/site-packages/tensorflow/python/util/_tf_stack.so
#4  0x00007fff620edcca in pybind11::google::internal::RegisterStatusBindings(pybind11::module_) ()
   from /home/yicheng/projects/launchpad/.venv/lib/python3.10/site-packages/tensorflow/python/util/_tf_stack.so
#5  0x00007fff620edba9 in pybind11::google::ImportStatusModule(bool) ()
   from /home/yicheng/projects/launchpad/.venv/lib/python3.10/site-packages/tensorflow/python/util/_tf_stack.so
#6  0x00007fff620ccba9 in tensorflow::pybind11_init__tf_stack(pybind11::module_&) ()
   from /home/yicheng/projects/launchpad/.venv/lib/python3.10/site-packages/tensorflow/python/util/_tf_stack.so
#7  0x00007fff620cc34f in PyInit__tf_stack ()
   from /home/yicheng/projects/launchpad/.venv/lib/python3.10/site-packages/tensorflow/python/util/_tf_stack.so
#8  0x00007ffff7def7c7 in _PyImport_LoadDynamicModuleWithSpec (spec=spec@entry=0x7fff624f8790, fp=fp@entry=0x0)
    at ./Python/importdl.c:167
#9  0x00007ffff7debf5d in _imp_create_dynamic_impl (module=<optimised out>, file=<optimised out>,
    spec=0x7fff624f8790) at Python/import.c:2049

Some other observations

  1. It does look like that with TensorFlow imported first, On the launchpad side I got internal::IsStatusModuleImported() == false so it's not picking up that the status module is imported and went ahead to register the binding -> error.
rwgk commented 9 months ago

On the launchpad side I got internal::IsStatusModuleImported() == false

Could it be that std::type_index(typeid(absl::Status)) is different?

Could you try to print that out?

std::cout << std::type_index(typeid(absl::Status)).hash_code() << std::endl;

Reason for asking the question:

https://github.com/pybind/pybind11/blob/7969049de4c11f1d4955ababd97f340d82d2bf5a/include/pybind11/detail/type_caster_base.h#L216-L232

You could also add prints there.

ethanluoyc commented 9 months ago
inline bool IsStatusModuleImported() {
  std::cout << "Type index" << std::type_index(typeid(absl::Status)).hash_code() << std::endl;
  std::cout << "Local type info" << detail::get_local_type_info(typeid(absl::Status)) << std::endl;
  std::cout << "Global type info "<< detail::get_global_type_info(typeid(absl::Status)) << std::endl;
  return detail::get_type_info(typeid(absl::Status));
}
Type index 11412274694001596587
Local type info0
Global type info 0

I won't be able to compile TF from source unfortunately.. I made a repro of the issue above here https://github.com/ethanluoyc/pybind11_abseil_gh12_repro.

If possible, would you mind taking a look? I tried to extract the issue from the original repo as much as possible.

rwgk commented 9 months ago

I won't be able to compile TF from source unfortunately.

pip install -r requirements.txt

Oh boy ... binary incompatibility I'd guess then, but how?

I'm really swamped, and digging into a situation that brings in prebuilt binaries could go anywhere.

Could you please try more at your end?

What puzzles me: if there is indeed a binary incompatibility, I'd expect the TF modules and launchpad to be isolated from each other more or less completely. A key piece of the mechanism behind that is here:

https://github.com/pybind/pybind11/blob/7969049de4c11f1d4955ababd97f340d82d2bf5a/include/pybind11/detail/internals.h#L314-L322

Could you please try to build launchpad with -DPYBIND11_INTERNALS_KIND=SOMETHINGRANDOM1697842496?

Then that will live in its own universe for sure.

It might break interop you need, but try to ignore that. We only want to know: do the two imports work in any order?

Also, I'd definitely insert prints in

https://github.com/pybind/pybind11/blob/7969049de4c11f1d4955ababd97f340d82d2bf5a/include/pybind11/detail/type_caster_base.h#L216-L232

(same link as before) to get a better idea what's happening. I'd try to print out the typeid name somehow.

ethanluoyc commented 9 months ago

I will give it a try. Thanks!

Maybe another data point on this. I just tried using latest version of pybind11 + absl combo found from this repo and that actually makes the import error go away (at least if I import tensorflow first). I couldn't get the reverse to work out as it complains about not finding pybind11_abseil module, but I suppose that is because I didn't put the status.so in my data dependency.

ethanluoyc commented 9 months ago

Sorry my C++ and Bazel is bad, but is it not calling with

bazel run -c dbg --copt '-DPYBIND11_INTERNALS_KIND=RAAANDOM' //courier:debug_client_test
external/pybind11/include/pybind11/detail/../detail/type_caster_base.h: In member function 'bool pybind11::detail::type_caster_generic::try_load_foreign_module_local(pybind11::handle)':
<command-line>: error: expected ',' or ';' before 'RAAANDOM'
external/pybind11/include/pybind11/pybind11.h: In member function 'void pybind11::detail::generic_type::initialize(const pybind11::detail::type_record&)':
<command-line>: error: expected ')' before 'RAAANDOM'
In file included from external/pybind11_abseil/pybind11_abseil/import_status_module.h:4,
                 from external/pybind11_abseil/pybind11_abseil/import_status_module.cc:1:
external/pybind11/include/pybind11/pybind11.h:1353:20: note: to match this '('
 1353 |             setattr(m_ptr, PYBIND11_MODULE_LOCAL_ID, capsule(tinfo));
      |                    ^

Update

bazel run -c dbg --cxxopt '-DPYBIND11_INTERNALS_KIND="SOMETHINGRANDOM1697842496"' //courier:debug_client_test
INFO: Analyzed target //courier:debug_client_test (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
Target //courier:debug_client_test up-to-date:
  bazel-bin/courier/debug_client_test
INFO: Elapsed time: 0.052s, Critical Path: 0.00s
INFO: 1 process: 1 internal.
INFO: Build completed successfully, 1 total action
INFO: Running command line: bazel-bin/courier/debug_client_test
2023-10-21 00:26:16.435708: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
  File "/home/yicheng/.cache/bazel/_bazel_yicheng/a8240993b3009bd370b3605a74432aad/execroot/debug-pybind11/bazel-out/k8-dbg/bin/courier/debug_client_test.runfiles/debug-pybind11/courier/debug_client_test.py", line 2, in <module>
    from courier import debug_client
ImportError: generic_type: cannot initialize type "StatusCode": an object with that name is already defined
ethanluoyc commented 9 months ago

Got the typeid name, it's N4absl12lts_202301256StatusE

rwgk commented 9 months ago

Ouch.

This is beginning to look very tricky.

Maybe it's something simple, but I don't see it.

I'm beginning to wonder: could there be two pybind11_abseil.Status modules? With incompatible PYBIND11_INTERNALS_ID?

That's just a wild guess, can you look? I think look for Status.so, how many do you find?

Alternatively, import only tensorflow, then look in sys.modules, import only launchpad, again look in sys.modules. Something like sys.modules["pybind11_abseil.Status"].__file__.

rwgk commented 9 months ago

Got the typeid name, it's N4absl12lts_202301256StatusE

Oh, that's another potential issue: maybe the two packages use different abseil-cpp versions?

ethanluoyc commented 9 months ago

I got the absl-cpp version for TF 2.14.0 from https://github.com/tensorflow/tensorflow/blob/v2.14.0/third_party/absl/workspace.bzl

I don't see any pybind11_abseil.status module in either TF or launchpad. Looking into my sitepackages I don't see those either. I think both launchpad and TF goes through the old bypass_regular_import=true path, but I don't know exactly what that means.

rwgk commented 9 months ago

I think both launchpad and TF goes through the old bypass_regular_import=true path, but I don't know exactly what that means.

https://github.com/pybind/pybind11_abseil/blob/4b883e48ae749ff984c220484d54fdeb0cb4302c/pybind11_abseil/import_status_module.cc#L13-L16

I disabled that a more than year ago, in September 2022:

https://github.com/pybind/pybind11_abseil/commit/c18f6323c22edd4f95bef62623c77973f8cfd3a1

What pybind11_abseil sources where used when TF was built?

What pybind11_abseil sources does launchpad use?

I never really understood the bypass mechanism, therefore cannot really tell what could go wrong.

I was glad to get rid of it. I thought.

ethanluoyc commented 9 months ago

What pybind11_abseil sources where used when TF was built?

2c4932ed6f6204f1656e245838f4f5eae69d2e29 (ref https://github.com/tensorflow/tensorflow/blob/v2.14.0/third_party/pybind11_abseil/workspace.bzl#L10-L13)

What pybind11_abseil sources does launchpad use?

1bb411eb1b13440d5af61660e70e8c5b5b2998a1 (ref https://github.com/google-deepmind/launchpad/blob/3b28eaed02c4294197b9ca2b8988cf68d8b5d868/WORKSPACE#L31-L34)

I think the inconsistency in the pybind11_abseil does not matter as I can reproduce the issue with the TF version in my repro repo. The commit they use also predates the deprecation of the value (and a switch of the default).

Looking closer at the diff of https://github.com/tensorflow/tensorflow/blob/v2.14.0/tensorflow/python/util/tf_stack.cc This is the import that gives ImportError if we import TF after

Between v2.14.0 and v2.13.1, the difference is that 2.14.0 starts using pybind11_abseil. There is usage of that in v2.13.1, but those usage never seems to be part of imports in the Python package, I think we can say that this issue has been there for a while :)

Looking at your test cases in this repo, I see that to test depends on pybind11_abseil:status.so. I understand this is required because we need to go through the regular python imports so the status module need to be in sys.path. However, neither TensorFlow nor launchpad distributes status.so and I think both is depending on the branch which is the default in 1bb411eb1b13440d5af61660e70e8c5b5b2998a1

  if (bypass_regular_import) {
    auto m = reinterpret_borrow<module_>(PyImport_AddModule(
        PYBIND11_TOSTRING(PYBIND11_ABSEIL_STATUS_MODULE_PATH)));
    if (!internal::IsStatusModuleImported()) {
      internal::RegisterStatusBindings(m);
    }
    // else no-op because bindings are already loaded.
    return m;
  }
rwgk commented 9 months ago

Away from keyboard

Qq

Can you build TF also from sources?

I believe: even if that takes you a few hours, it'll still be faster than debugging a frankenbuild with outdated sources.

On Fri, Oct 20, 2023 at 17:17 Yicheng Luo @.***> wrote:

What pybind11_abseil sources where used when TF was built?

2c4932e https://github.com/pybind/pybind11_abseil/commit/2c4932ed6f6204f1656e245838f4f5eae69d2e29 (ref https://github.com/tensorflow/tensorflow/blob/v2.14.0/third_party/pybind11_abseil/workspace.bzl#L10-L13 )

What pybind11_abseil sources does launchpad use?

1bb411e https://github.com/pybind/pybind11_abseil/commit/1bb411eb1b13440d5af61660e70e8c5b5b2998a1 (ref https://github.com/google-deepmind/launchpad/blob/3b28eaed02c4294197b9ca2b8988cf68d8b5d868/WORKSPACE#L31-L34 )

I think the inconsistency in the pybind11_abseil does not matter as I can reproduce the issue with the TF version in my repro repo. The commit they use also predates the deprecation of the value (and a switch of the default).

Looking closer at the diff of https://github.com/tensorflow/tensorflow/blob/v2.14.0/tensorflow/python/util/tf_stack.cc This is the import that gives ImportError if we import TF after

Between v2.14.0 and v2.13.1, the difference is that 2.14.0 starts using pybind11_abseil.

Looking at your test cases in this repo, I see that to test depends on pybind11_abseil:status.so. I understand this is required because we need to go through the regular python imports so the status module need to be in sys.path. However, neither TensorFlow nor launchpad distributes status.so and I think both is depending on the branch which is the default in 1bb411e https://github.com/pybind/pybind11_abseil/commit/1bb411eb1b13440d5af61660e70e8c5b5b2998a1

if (bypass_regular_import) { auto m = reinterpretborrow<module>(PyImport_AddModule( PYBIND11_TOSTRING(PYBIND11_ABSEIL_STATUS_MODULE_PATH))); if (!internal::IsStatusModuleImported()) { internal::RegisterStatusBindings(m); } // else no-op because bindings are already loaded. return m; }

— Reply to this email directly, view it on GitHub https://github.com/pybind/pybind11_abseil/issues/12#issuecomment-1773548290, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFUZAGIN5A7VZ52GTSAMYLYAMIBDAVCNFSM6AAAAAA6ICKLBOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZTGU2DQMRZGA . You are receiving this because you commented.Message ID: @.***>

ethanluoyc commented 9 months ago

I will dig around :) Enjoy your weekend

ethanluoyc commented 9 months ago

This is a rabbit hole..

Built from source TF v2.14.0, I now get imports to work (so we do not throw ImportError in my repro repo).

However, I tried integrating my build into the launchpad repo. If I import launchpad before TensorFlow then things would work ok. But if I import TF first and then Launchpad I can't pass the unit test and it's saying:

Traceback (most recent call last):
  File "/home/yicheng/.cache/bazel/_bazel_yicheng/83e93ea5b191337ae544f4d037b4b862/execroot/launchpad/bazel-out/k8-dbg/bin/courier/python/client_test.runfiles/launchpad/courier/python/client_test.py", line 102, in testCallRebind
    result = self._client.rebind()
  File "/home/yicheng/.cache/bazel/_bazel_yicheng/83e93ea5b191337ae544f4d037b4b862/execroot/launchpad/bazel-out/k8-dbg/bin/courier/python/client_test.runfiles/launchpad/courier/python/client.py", line 56, in inner_function
    raise translate_status(e.status) from e
  File "/home/yicheng/.cache/bazel/_bazel_yicheng/83e93ea5b191337ae544f4d037b4b862/execroot/launchpad/bazel-out/k8-dbg/bin/courier/python/client_test.runfiles/launchpad/courier/python/client.py", line 43, in translate_status
    exc = StatusNotOk(s.message())
  File "<string>", line 6, in __init__
AttributeError: 'str' object has no attribute 'ok'.

The type_name does match though in either case. Building from source I got N4absl12lts_202301256StatusE.

rwgk commented 9 months ago
    exc = StatusNotOk(s.message())
  File "<string>", line 6, in __init__
AttributeError: 'str' object has no attribute 'ok'.

That looks like something I missed in a super-sized series of changes about a year ago. Could you please try this change:

-    exc = StatusNotOk(s.message())
+    exc = StatusNotOk(s)

If I import launchpad before TensorFlow then things would work ok.

Hm... that sounds like you still have two versions of StatusNotOk. The pybind11_abseil commits I see when following the two WORKSPACE links under https://github.com/pybind/pybind11_abseil/issues/12#issuecomment-1773548290 point to two different pybind11_abseil versions:

These are completely different.

Did you update one or the other to use the same version?

Really, pybind11_abseil should only be built once. I'm not familiar at all with how TF or Launchpad use bazel, and how to make both depend on the same sources + build artifacts.

rwgk commented 9 months ago

I pointed the owners of courier/python/client.py here. I'm hoping they'll update to a more modern version of pybind11_abseil. I'm also hoping they'll find a way to ensure TF and Launchpad are using the exact same version.

ethanluoyc commented 9 months ago

Thanks!. Yeah updating the call and fix version fixes the error above.

I guess the confusing part now is why things work if I built TF 2.14.0 from the source but not with their pre-built wheels. For testing, I actually built TF with the exact checkout from their tag and got it working (since I use the same version for launchpad testing anyway).

ethanluoyc commented 8 months ago

I suspect this is a compiler thing. I just tried setting CC=clang and it seems that the error goes away.

rwgk commented 8 months ago

I'm glad you found a path that works.

I suspect this is a compiler thing. I just tried setting CC=clang and it seems that the error goes away.

That makes me suspect that the bypass import mechanism somehow subverts the ABI isolation mechanisms I pointed out before (pybind11/detail/internals.h).

I wouldn't want to root-cause that. IMO the only good use of our time is to work towards

  1. Using a modern version of pybind11_abseil, or at least using bypass_regular_import=false
  2. Work on the bazel BUILD files so that there is no way that more than one pybind11_abseil version is used when building TF and Launchpad from sources.
ethanluoyc commented 8 months ago

Thanks.

I think I will probably want to fix this upstream in launchpad or tensorflow. Building and maintaining bazel project that depends on TensorFlow and is manylinux is quite some work so I hope this gets fixed upstream :)

  1. If fixing in TF, I think upgrading to a more recent version of pybind11_abseil would be useful. Then I can build downstream with the existing toolchain and the problem may go away.
  2. Fixing in launchpad to use a consistent toolchain as TF. The launchpad build is still using a gcc toolchain and I think maybe upgrading to clang as used in TF would probably work as well.

I worked a bit on trying to get the later to work but apparently, it's quite difficult to set up a CC toolchain as used by TF. Maybe I will seek some help from the Launchpad authors or some other repos that use pre-built tensorflow as a dependency to create a clang toolchain to work with this.

rwgk commented 8 months ago

I ran this by @tkoeppe, he pointed out:

sounds a bit like an ODR violation?

It sounds like you have two "fat DSO"s for the TF and LP Python modules and both link in a copy of pybind11_abseil? That seems like a well-known failure mode (of Bazel)?

I think that's very consistent with our observations here.

ethanluoyc commented 8 months ago

Yeah, unfortunately keeping in sync with TF's build dependency is just a lot for many downstream projects.

I don't know how to best fix this either.

Lucky for me, I hijacked a copy of TF's manylinux build toolchain configuration with clang and that seems to allow me to build manylinux wheels successfully. I am using that to work around issues with LaunchPad. There is a sibling project Reverb https://github.com/google-deepmind/reverb/tree/master/reverb which will likely have the same issue and I have created a PR with the fix.

https://github.com/google-deepmind/reverb/pull/136

tkoeppe commented 8 months ago

A proper "fix" would require (re)architecting a build system that provides a correct implementation of Python native extensions. This is not an easy problem.

A slightly cleaner "hack" for the meantime might be to just control symbol visibility for Python module DSOs better. If the (duplicately linked) Abseil dependencies in each module are not exported, there's less chance of a mismatch at runtime.