ros2 / rmw_zenoh

RMW for ROS 2 using Zenoh as the middleware
Apache License 2.0
175 stars 34 forks source link

Sporadic failure on system_tests/test_communication #275

Open YuanYuYuan opened 1 week ago

YuanYuYuan commented 1 week ago

rmw_zenoh_cpp sometimes fails to pass system_tests/test_communication with the following error.

Traceback (most recent call last):
  File "/ws/src/system_tests/test_communication/test/action_server_py.py", line 199, in <module>
    rclpy.spin(node)
  File "/ws/install/rclpy/lib/python3.12/site-packages/rclpy/__init__.py", line 310, in spin
    executor.spin_once()
  File "/ws/install/rclpy/lib/python3.12/site-packages/rclpy/executors.py", line 853, in spin_once
    self._spin_once_impl(timeout_sec)
  File "/ws/install/rclpy/lib/python3.12/site-packages/rclpy/executors.py", line 837, in _spin_once_impl
    handler, entity, node = self.wait_for_ready_callbacks(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/ws/install/rclpy/lib/python3.12/site-packages/rclpy/executors.py", line 807, in wait_for_ready_callbacks
    return next(self._cb_iter)
           ^^^^^^^^^^^^^^^^^^^
  File "/ws/install/rclpy/lib/python3.12/site-packages/rclpy/executors.py", line 711, in _wait_for_ready_callbacks
    wait_set.wait(timeout_nsec)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/ws/src/system_tests/test_communication/test/action_server_py.py", line 184, in <module>
    with rclpy.init(args=[]):
  File "/ws/install/rclpy/lib/python3.12/site-packages/rclpy/__init__.py", line 112, in __exit__
    try_shutdown(context=shutdown_context, uninstall_handlers=self.installed_signal_handlers)
  File "/ws/install/rclpy/lib/python3.12/site-packages/rclpy/__init__.py", line 200, in try_shutdown
    _try_shutdown(context=context)
  File "/ws/install/rclpy/lib/python3.12/site-packages/rclpy/utilities.py", line 104, in try_shutdown
    g_default_context.try_shutdown()
  File "/ws/install/rclpy/lib/python3.12/site-packages/rclpy/context.py", line 174, in try_shutdown
    self._cleanup()
  File "/ws/install/rclpy/lib/python3.12/site-packages/rclpy/context.py", line 156, in _cleanup
    self.__context.shutdown()
rclpy._rclpy_pybind11.RCLError: failed to shutdown: rcl_shutdown already called on the given context, at /ws/src/rcl/rcl/src/rcl/init.c:290

To reproduce

colcon build --packages-select test_communication --cmake-args -DSKIP_MULTI_RMW_TESTS=On
/usr/bin/python3 /ws/src/system_tests/test_communication/test/action_server_py.py NestedMessage /test_time_12_40_58
(Press Ctrl-C)

Note: The root cause is zenoh possibly takes a bit longer to close the session in rmw_shutdown. One can easily reproduce this by adding a sleep in rmw_shutdown.

YuanYuYuan commented 1 week ago

The analysis has been posted in rclpy https://github.com/ros2/rclpy/issues/1352

Yadunund commented 1 week ago

@YuanYuYuan thanks for the detailed analysis! Curious if this issue only arises with rclpy or also rclcpp? Also, would this be related to https://github.com/ros2/rmw_zenoh/issues/170?

I still think shutting down the Zenoh session within rmw_shutdown is the right thing to do. If downstream packages such as rclpy and rclcpp are not equipped to handle cases where this function takes a while to execute, then we should fix expectations in those downstream implementations to handle these cases.

YuanYuYuan commented 1 week ago

Hi @Yadunund, I believe this issue is mainly caused by rclpy and the fix is coming https://github.com/ros2/rclpy/pull/1353.

Also, would this be related to https://github.com/ros2/rmw_zenoh/issues/170?

Actually, it would be a possible consequence once we fix rclpy. I think there has already been many debates in Rust community about this problem. See here for more details. In short, Rust suggests not doing this way.