ros2 / rmw_cyclonedds

ROS 2 RMW layer for Eclipse Cyclone DDS
Apache License 2.0
117 stars 90 forks source link

Failure to run certain rclpy code with rmw_cyclonedds_cpp #494

Closed clalancette closed 4 months ago

clalancette commented 6 months ago

Bug report

This is split out of https://github.com/ros2/examples/pull/382 (and was ultimately found by https://github.com/osrf/ros2_test_cases/issues/1259 as part of the Jazzy tutorial testing party).

Consider the following code:

import sys

from example_interfaces.srv import AddTwoInts

import rclpy
from rclpy.callback_groups import ReentrantCallbackGroup
from rclpy.executors import ExternalShutdownException

def main(args=None):
    rclpy.init(args=args)

    try:
        node = rclpy.create_node('minimal_client')

        cb_group = ReentrantCallbackGroup()
        cli = node.create_client(AddTwoInts, 'add_two_ints', callback_group=cb_group)
        did_run = False
        did_get_result = False
        async def call_service():
            nonlocal cli, node, did_run, did_get_result
            did_run = True
            try:
                req = AddTwoInts.Request()
                req.a = 41
                req.b = 1
                future = cli.call_async(req)
                result = await future
                node.get_logger().info(
                    'Result of add_two_ints: for %d + %d = %d' %
                    (req.a, req.b, result.sum))
            finally:
                did_get_result = True
        while not cli.wait_for_service(timeout_sec=1.0):
            node.get_logger().info('service not available, waiting again...')
        timer = node.create_timer(0.5, call_service, callback_group=cb_group)

        while rclpy.ok() and not did_run:
            rclpy.spin_once(node)

        if did_run:
            # call timer callback only once
            timer.cancel()

        while rclpy.ok() and not did_get_result:
            rclpy.spin_once(node)
    except KeyboardInterrupt:
        pass
    except ExternalShutdownException:
        sys.exit(1)

if __name__ == '__main__':
    main()

If you run in one terminal:

ros2 run examples_rclpy_minimal_service service

And the code above in a second terminal with RMW_IMPLEMENTATION=rmw_cyclonedds_cpp, it hangs forever and never completes.

Required Info:

Steps to reproduce issue

See the initial description.

Expected behavior

Example code succeeds and prints a response from the service.

Actual behavior

Example code hangs forever.

Additional information

During debugging of this, I found that part of the problem seems to be in the rclpy.spin_once call. In particular, what happens is that for every call there the node is added to the executor, which constitutes an "event". At that point, it seems like because this is spin_once, that event, and that event alone, is continually returned. We never make it to handling the other events, including the client response event. This is somewhat inherent in the contract of spin_once, but it is odd that this works for rmw_fastrtps_cpp while it fails for rmw_cyclonedds_cpp

clalancette commented 6 months ago

@mjcarroll pointed out that the ordering in the executors is different in rclpy than rclcpp.

eboasson commented 5 months ago

I don't know how the RMW layer interacts with Python async code, but this looks to me like a single threaded process with one node, and one outstanding request. So even if you only intermittently "spin" that node, you would expect a result, right?

In other words, I have some difficulty seeing how this

During debugging of this, I found that part of the problem seems to be in the rclpy.spin_once call. In particular, what happens is that for every call there the node is added to the executor, which constitutes an "event". At that point, it seems like because this is spin_once, that event, and that event alone, is continually returned. We never make it to handling the other events, including the client response event.

relates to the code. Maybe someone can help clear up that confusion for me?

Given that this is backlog, it is probably not the highest priority.

rubenanapu commented 5 months ago

Not an answer to this question, but I also had problems with CycloneDDS on Jazzy.

In my case, the problems I had were related to buffer overflow:

 ros2 run demo_nodes_cpp talker
*** buffer overflow detected ***: terminated

WHen listing topics:

ros2 topic list 
*** buffer overflow detected ***: terminated
Aborted (core dumped)

My solution was to compile CycloneDDS from source. Just for reference, the steps I used to compile it from source are described at:

https://github.com/eclipse-cyclonedds/cyclonedds/issues/2043

ros-discourse commented 4 months ago

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/preparing-for-rolling-sync-2024-07-11/38526/2