protocolbuffers / protobuf

Protocol Buffers - Google's data interchange format
http://protobuf.dev
Other
65.11k stars 15.43k forks source link

[cmake] v27.* break the shared library support compare to v26.1 #17465

Open Mizux opened 1 month ago

Mizux commented 1 month ago

For google/or-tools we tried to upgrade from v26.1 (working) to v27.2 (broken)

note: release v27.0 and v27.1 are also broken...

Language: Python (native C++ module) i.e. pybind11 wrappers....

What operating system (Linux, Windows, ...) and version? Linux and MacOS are impacted...

What runtime / compiler are you using (e.g., python version or gcc version) From Python 3.9 to Python 3.12 gcc ~[10,14] see default version from almalinux:latest, rockylinux:9, fedora:latest, debian:latest, opensuse, ubuntu:latest .... ref: https://repology.org/project/gcc/badges. note: we are targeting C++17 (with agregate initialisation) or C++20 (MSVC)

What did you do? Please read this issue for a full detailed bug report:

What did you expect to see all or-tools tests pass as before...

What did you see instead? all python tests using pybind11 and protobuf failed with error:

 F0000 00:00:1720999376.464679    2304 generated_message_reflection.cc:3620] Check failed: file != nullptr 

Make sure you include information that can help us debug (full error message, exception listing, stack trace, logs). Again, Please read the issue for a full detailed bug report:

Anything else we should know about your project / environment using kind of bissect on protobuf tag use in our build, I figure out it is these commits (split in two parts -_-) are at fault:

Protobuf Version Test Date
v27.0 FAIL 2024/05/23
... ... ...
49c83ab799977592744f50df93957ddec1e12e70 FAIL 2024/01/30 (newer)
abc9bae7c19d9fff4b0be0ffd3a01f947d3b487a NA(0) 2024/01/30 (need @49c83ab)
58baeb4c3b664f8918d24cef5151083d9da9767c PASS 2024/01/30 (older)
... ... ...
v26.1 PASS 2024/03/27
  1. Can't test abc9bae directly since it miss the next commit which generate some stuff...

For Googlers: it is cl/602725967

note: git revert both commit on top of v27.2 tag was not working (too many conflict to make it...)

Mizux commented 1 month ago

MRE: bazel-proto

Some update: This weekend I was able to reproduce it with a pet project without involving pybind11 etc... EDIT: same issue with "v28.0-rc1" too

It seems the issue comes from the shared library nature...

here the project layout:

 graph TD;
      A[bin/App\nbinary]-->FB[lib/libFooBar.so\nlibrary];
      FB-->F[lib/libFoo.so\nlibrary];
      FB-->B[lib/libBar.so\nlibrary];
      F-.->Pc[C.proto];
      Pc-- "import" -->Pa[A.proto];
      Pc-- "import" -->Pb[B.proto];

ref: https://github.com/mizux/bazel-proto

Trace:

% cmake -S. -Bbuild
% cmake --build build -j
% ./build/bin/App
...
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
F0000 00:00:1721036258.637555 1559662 generated_message_reflection.cc:3619] Check failed: file != nullptr 
*** Check failure stack trace: ***
zsh: IOT instruction  ./build/bin/App

Workaround

if I apply this patch to make SHARED libraries to be STATIC:

diff --git a/foo/CMakeLists.txt b/foo/CMakeLists.txt
index aee9266..c7476a1 100644
--- a/foo/CMakeLists.txt
+++ b/foo/CMakeLists.txt
@@ -1,4 +1,4 @@
-add_library(Foo) # shared implicit
+add_library(Foo STATIC "")

 get_cpp_proto(PROTO_HDRS PROTO_SRCS)
 target_sources(Foo
diff --git a/foobar/CMakeLists.txt b/foobar/CMakeLists.txt
index 724f2e1..d4fe85a 100644
--- a/foobar/CMakeLists.txt
+++ b/foobar/CMakeLists.txt
@@ -1,4 +1,4 @@
-add_library(FooBar "") # shared implicit
+add_library(FooBar STATIC "")
 target_sources(FooBar PRIVATE foobar.h foobar.cc)
 target_include_directories(FooBar
   PUBLIC
% cmake --build build -j
% ./build/bin/App
...
(exit 0)

i.e. App embedded the libFoo.a then test pass, aka bin/App run without error... So I suspect the Protobuf commit to break the shared library mode.

Mizux commented 1 month ago

Back to google/or-tools when generating our wheel package we have the following layout:

ortools
├── __init__.py
├── .libs
│   └── libortools.so.10 (48Mo)
├── algorithms
│   ├── __init__.py
│   └── python
│       ├── __init__.py
│       ├── knapsack_solver.cpython-312-x86_64-linux-gnu.so
│       └── knapsack_solver.pyi
├── graph
│   ├── __init__.py
│   └── python
│       ├── __init__.py
│       ├── max_flow.cpython-312-x86_64-linux-gnu.so
│       ├── max_flow.pyi
│       ├── min_cost_flow.cpython-312-x86_64-linux-gnu.so
│       ├── min_cost_flow.pyi
│       ├── linear_sum_assignment.cpython-312-x86_64-linux-gnu.so
│       └── linear_sum_assignment.pyi
├── routing
│   ├── enums_pb2.py
│   ├── enums_pb2.pyi
│   ├── ils_pb2.py
│   ├── ils_pb2.pyi
│   ├── __init__.py
│   └── python
│       ├── __init__.py
│       ├── model.cpython-312-x86_64-linux-gnu.so (2Mo)
│       ├── model.pyi
├── sat
...
 graph TD;
      D_O[ortools];
      D_O-->D_L[.libs];
      D_L-->OR[(libortools.so\nortools/.lib)];
      D_O-->D_A[algorithms];
      D_A-->D_AP[python];
      D_AP-->A[(algorithms.cpython-312-x86_64-linux-gnu.so\nortools/algorithms/python)];
      A-. link .->OR;

      D_O-->D_G[graph];
      D_G-->D_GP[python];
      D_GP-->G1[(max_flow.cpython-312-x86_64-linux-gnu.so\nortools/graph/python)];
      D_GP-->G2[(min_cost_flow.cpython-312-x86_64-linux-gnu.so\nortools/graph/python)];
      G1-. link .->OR;
      G2-. link .->OR;

      D_O-->D_R[routing];
      D_R-->D_RP[python];
      D_RP-->R[(model.cpython-312-x86_64-linux-gnu.so\nortools/routing/python)];
      R-. link .->OR;

      D_O-->D_X[component];
      D_X-->D_XP[python];      
      D_XP-->X[(component.cpython-312-x86_64-linux-gnu.so\nortools/<component>/python)];
      X-. link .->OR;

note: we are using RUNPATH for each *.cpython.so lib to find the libortools.so.

objdump -p ortools/algorithms/python/knapsack_solver.cpython-312-x86_64-linux-gnu.so | grep RUNPATH
  RUNPATH              $ORIGIN:$ORIGIN/../../../ortools/.libs

using a shared lib libortools.so avoid to have to duplicate symbols in each python "sub" module. note: libortools.so it's 48Mo, since we have a dozen of python modules so 48Mo * 12 is a no go (pypi limit is 200Mo per wheel packages...)

Mizux commented 1 month ago

Dev Note playing with http://github.com/mizux/bazel-proto and using a hackish version of protobuf 58baeb4 vs 49c83ab

Adding patch:

diff --git a/src/google/protobuf/generated_message_reflection.cc b/src/google/protobuf/generated_message_reflection.cc
index d0543d134..450cf589d 100644
--- a/src/google/protobuf/generated_message_reflection.cc
+++ b/src/google/protobuf/generated_message_reflection.cc
@@ -3642,6 +3642,7 @@ void AssignDescriptorsImpl(const DescriptorTable* table, bool eager) {
   }

   // Fill the arrays with pointers to descriptors and reflection classes.
+  std::cerr << "filename: '" << table->filename << "'\n";
   const FileDescriptor* file =
       DescriptorPool::internal_generated_pool()->FindFileByName(
           table->filename);

Using 58baeb4 :

./build/bin/App
...
"FooBar":{"Bar":{"int":13,"int64":31},"Foo":{"int":17,"int64":42}}
filename: 'google/protobuf/descriptor.proto'
filename: 'google/protobuf/cpp_features.proto'
filename: 'foo/C.proto'
filename: 'foo/A.proto'
filename: 'foo/B.proto'
a {
  name: "42"
}
b {
  value: 42
}

Using 49c83ab :

./build/bin/App
...
"FooBar":{"Bar":{"int":13,"int64":31},"Foo":{"int":17,"int64":42}}
filename: 'google/protobuf/descriptor.proto'
filename: 'google/protobuf/cpp_features.proto'
filename: 'foo/C.proto'
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
F0000 00:00:1721046248.069280 1828260 generated_message_reflection.cc:3623] Check failed: file != nullptr 
*** Check failure stack trace: ***
zsh: IOT instruction  ./build/bin/App
Mizux commented 1 month ago

Small update, buiding Protobuf in SHARED make the bazel-proto tests (e.g. foo_test) pass...

diff --git a/cmake/dependencies/CMakeLists.txt b/cmake/dependencies/CMakeLists.txt
index 7bf92be..3b712ba 100644
--- a/cmake/dependencies/CMakeLists.txt
+++ b/cmake/dependencies/CMakeLists.txt
@@ -52,13 +52,15 @@ if(BUILD_Protobuf)
   message(CHECK_START "Fetching Protobuf")
   list(APPEND CMAKE_MESSAGE_INDENT "  ")
   set(protobuf_BUILD_TESTS OFF)
-  set(protobuf_BUILD_SHARED_LIBS OFF)
+  set(protobuf_BUILD_SHARED_LIBS ON)
   set(protobuf_BUILD_EXPORT OFF)
   set(protobuf_MSVC_STATIC_RUNTIME OFF)
   #set(protobuf_BUILD_LIBUPB ON)

i.e. BAD: bin/foo_test -> lib/libFoo.so --static link--> libprotobuf.a GOOD: bin/foo_test -> lib/libFoo.so --ldd--> libprotobuf.so

Mizux commented 1 month ago

Adding some log (thx to @sbenzaquen for the suggestion). Seems we have an ODR violation when using a static version of Protobuf

The patch in Protobuf to have some insight...

git diff HEAD~1
diff --git a/src/google/protobuf/generated_message_util.cc b/src/google/protobuf/generated_message_util.cc
index e33d22fc4..7b9663d43 100644
--- a/src/google/protobuf/generated_message_util.cc
+++ b/src/google/protobuf/generated_message_util.cc
@@ -88,18 +88,23 @@ void InitWeakDefaults() {}

 PROTOBUF_CONSTINIT std::atomic<bool> init_protobuf_defaults_state{false};
 static bool InitProtobufDefaultsImpl() {
+  std::cerr << __func__ << ":" << __LINE__ << " call...\n";
   fixed_address_empty_string.DefaultConstruct();
   OnShutdownDestroyString(fixed_address_empty_string.get_mutable());
   InitWeakDefaults();

   init_protobuf_defaults_state.store(true, std::memory_order_release);
+  std::cerr << __func__ << ":" << __LINE__ << " done.\n";
   return true;
 }

 void InitProtobufDefaultsSlow() {
+  std::cerr << __func__ << ":" << __LINE__ << " call...\n";
   static bool is_inited = InitProtobufDefaultsImpl();
   (void)is_inited;
+  std::cerr << "is_inited: " << &is_inited << ".\n";
+  std::cerr << __func__ << ":" << __LINE__ << "done.\n";
 }
 // Force the initialization of the empty string.
 // Normally, registration would do it, but we don't have any guarantee that

The trace of build/bin/foo_test when using libprotobuf.a patched and with the commit @49c83ab

 ./build/bin/foo_test 
InitProtobufDefaultsSlow:103 call...
InitProtobufDefaultsImpl:91 call...
InitProtobufDefaultsImpl:98 done.
is_inited: 0x7ff270270e08.
InitProtobufDefaultsSlow:107done.
InitProtobufDefaultsSlow:103 call...
InitProtobufDefaultsImpl:91 call...
InitProtobufDefaultsImpl:98 done.
is_inited: 0x5641f8051b48.
InitProtobufDefaultsSlow:107done.
Running main() from /usr/local/google/home/corentinl/dev/repo/bazel-proto/build/_deps/googletest-src/googletest/src/gtest_main.cc
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from FooTest
[ RUN      ] FooTest.ProtoFunction
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
F0000 00:00:1721369740.493581  753437 generated_message_reflection.cc:3622] Check failed: file != nullptr 
*** Check failure stack trace: ***
zsh: IOT instruction  ./build/bin/foo_test

note: same ODR violation prior to the patch (i.e. protobuf v26.1) but without the assert fail side effect thus it was unnoticed until now...