ros2 / rclcpp

rclcpp (ROS Client Library for C++)
Apache License 2.0
559 stars 422 forks source link

Composable node runtime error (undefined symbol) but normal Node has no issue #2391

Open maxime-clem opened 11 months ago

maxime-clem commented 11 months ago

Bug report

Required Info:

I am trying to call the python interpreter from a ComposableNode. I have no issue doing a simple print(), but if I try to do import torch, the program crashes with an undefined symbol error.

terminate called after throwing an instance of 'pybind11::error_already_set'
  what():  ImportError: /usr/lib/python3.10/lib-dynload/_ctypes.cpython-310-x86_64-linux-gnu.so: undefined symbol: PyTuple_Type

At:
  /usr/lib/python3.10/ctypes/__init__.py(8): <module>
  <frozen importlib._bootstrap>(241): _call_with_frames_removed
  <frozen importlib._bootstrap_external>(883): exec_module
  <frozen importlib._bootstrap>(703): _load_unlocked
  <frozen importlib._bootstrap>(1006): _find_and_load_unlocked
  <frozen importlib._bootstrap>(1027): _find_and_load
  /home/mclement/.local/lib/python3.10/site-packages/torch/__init__.py(17): <module>
  <frozen importlib._bootstrap>(241): _call_with_frames_removed
  <frozen importlib._bootstrap_external>(883): exec_module
  <frozen importlib._bootstrap>(703): _load_unlocked
  <frozen importlib._bootstrap>(1006): _find_and_load_unlocked
  <frozen importlib._bootstrap>(1027): _find_and_load

There is no issue when doing this with a normal Node.

Steps to reproduce issue

I made a minimal example showcasing the issue in the following repository: https://github.com/maxime-clem/ros2_composable_node_bug

# requires python3-dev and pybind11-dev
git clone git@github.com:maxime-clem/ros2_composable_node_bug.git
cd ros2_composable_node_bug
colcon build
source install/setup.sh
ros2 run test_exe test_exe  # No issue
ros2 run test_node test_node  # No issue
ros2 run test_composable_node test_composable_node_exe  # undefined symbol error

Expected behavior

Composable node can use the python library without issue, similarly to a normal Node.

Actual behavior

Composable node crashes with an undefined symbol: PyTuple_Type error.

Additional information

I have confirmed the issue with another user so it does not appear to be en environment issue. The only workaround found so far is to use dlopen("libpython3.10.so", RTLD_GLOBAL | RTLD_NOW) in the code of the ComposableNode.

mjcarroll commented 11 months ago

Interesting, because as far as I know, the composable node containers should be purely C++ and shouldn't have any interactions with pybind11. This happens on the first pybind11 node that you load or on subsequent ones?

maxime-clem commented 11 months ago

The issue happens even without using a composable node containers (can be reproduced by directly running the node executable). Since the issue can be solved by using dlopen, it seems to be a linker issue but I do not see any reason why the link to the python library would be different between a Node and a ComposableNode.

clalancette commented 11 months ago

I'm not 100% sure of this, but the situation seems similar to https://github.com/PyO3/pyo3/issues/2000#issuecomment-979479111 , which leads to https://bugs.python.org/issue21536 . There, they discuss some of the ins and outs of loading things dynamically with Python. In particular, I'll point to this comment where they say:

"IHMO it's a bad usage of dlopen(): libpython must always be loaded with RTLD_GLOBAL."

I then took a look at how we loaded libraries, and saw this: https://github.com/ros2/rcutils/blob/d3fed35f2d8e19dede7f6dfd5f3b862c40ac7809/src/shared_library.c#L97

Indeed, locally if I switch that to RTLD_LAZY | RTLD_GLOBAL, the example that @maxime-clem provided works.

So the question is: should we add in RTLD_GLOBAL? It fixes the issue, but I'm slightly concerned about other side-effects it might have. @mjcarroll thoughts?