ros2 / rmw_fastrtps

Implementation of the ROS Middleware (rmw) Interface using eProsima's Fast RTPS.
Apache License 2.0
157 stars 117 forks source link

what(): failed to send response: rmw_response.cpp:153 service.c:314 #706

Closed KKKiwiXU closed 1 year ago

KKKiwiXU commented 1 year ago

Bug report

Hello everyone. I may face a bug. Required Info:

Steps to reproduce issue

I'm working for a project with over 20 process and more than 40 nodes. My process is build in a one-process multi-node framework, which including more than 12 nodes. All these nodes are inherent from rclcpp::LifecycleNode, and are using LCN callback to on_configure, on_activate, etc. to control lifecycle. All my nodes are using intra-process-communication with intra_process_comm=True, using ::UniquePtr and publisher->publish(std::move(pub_message)) way to realize zero-copy intra-process-communication. If I only run my own process, all things goes well. LCN control with no error, all nodes subscription and publication right. But if I run other nodes in other containers on my work station, using ros2 bag play playing back a ros2 bag with many sensor data, and using LCN control my node, things goes worth. It would raises exception:

terminate called after throwing an instance of 'rclcpp::exceptions::RCLError' what(): failed to send response: client will not receive response, at /root/ros2_humble/src/ros2/rmw_fastrtps/rmw_fastrtps_shared_cpp/src/rmw_response.cpp:153, at /root/ros2_humble/src/ros2/rcl/rcl/src/rcl/service.c:314

this error occured while i activating my 12th node, I don't know whether it is related. If i transit state with no message playback, this error may not occur. I use a python scripts which using os.system to control my node:

def on_configure(node_names: List[str]):
    for name in node_names:
        print(name)
        os.system(f"ros2 lifecycle set {name} 1")

def on_activate(node_names: List[str]):
    for name in node_names:
        print(name)
        os.system(f"ros2 lifecycle set {name} 3")

All nodes in my process are using the same qos profile:

const rclcpp::QoSInitialization qos_init_1(
    rmw_qos_history_policy_t::RMW_QOS_POLICY_HISTORY_KEEP_LAST, 1);

const rmw_qos_profile_t _1_profile{
          RMW_QOS_POLICY_HISTORY_KEEP_LAST,
          1,
          RMW_QOS_POLICY_RELIABILITY_RELIABLE,
          RMW_QOS_POLICY_DURABILITY_VOLATILE,
          { 0LL, 0LL },
          { 0LL, 0LL },
          RMW_QOS_POLICY_LIVELINESS_SYSTEM_DEFAULT,
          { 0LL, 0LL },
          false };

const rclcpp::QoS qos_1(qos_init_1, _1_profile);

I set the history to 1 because I found that, I have more than one node subscribe to a same node, and a large history may cause message drop.

I have tried https://github.com/ros2/rmw_fastrtps/pull/704 changes on src code, and I proved that I have changed the max_blocking_time by print it. This doesn't work. Thanks.

fujitatomoya commented 1 year ago

os.system(f"ros2 lifecycle set {name} 1")

I would check if this system shell call returned in success. this actually calls service request to the lifecycle nodes, that means probably 12th client service response path does not exist yet when the server replies to the client.

All my nodes are using intra-process-communication

above system call will not use intra-process-communication, i would use rclpy API in the application code instead of os.system.

https://github.com/ros2/ros2/issues/1253 could be a similar problem.

KKKiwiXU commented 1 year ago

Thanks for answering. I found that the timeout error occured randomly on every node, not only on the 12th node. And I write a c++ code srv to invoke the transit function. But these attemptings didn't work. I remove all the intra-process-comms, and make the #704 change. This composition alleviates the problem. But this is not enough. Now we have more than 50 topics, and this would be even more in the future. I'm appreciate if you could tell me that, is there any method that could strictly specify the publisher and the subscription, which could replace the search and match stage while creating a publisher? This could solve this problem in a better way. Thanks

clalancette commented 1 year ago

We think that this may be solved by https://github.com/ros2/rclcpp/pull/2280 . If you can try that one out and see if it improves the situation for you, that would be very helpful. Thanks.

fujitatomoya commented 1 year ago

@KKKiwiXU https://github.com/ros2/rclcpp/pull/2280 has been merged to humble, that should fix the problem. I will go ahead to close this, if you still meet the problem, please feel free to reopen. thanks for the posting issue.

alexleel commented 1 year ago

I use the latest rclcpp, however , I still face the issue: terminate called after throwing an instance of 'rclcpp::exceptions::RCLError' what(): failed to send response: client will not receive response, at ./src/rmw_response.cpp:154, at ./src/rcl/service.c:314

What can I do to debug it ? As fas as I know. the expection is throwed by service.hpp in rclcpp, am I right?

fujitatomoya commented 1 year ago

I use the latest rclcpp, however , I still face the issue:

can you share your distribution and version for rclcpp?

alexleel commented 1 year ago

apt-show-versions | grep rclcpp ros-humble-rclcpp:amd64/jammy 16.0.6-1jammy.20230919.213531 uptodate and the command is as following: ros2 run demo_nodes_cpp add_two_ints_server I wonder if I miss sth ?

fujitatomoya commented 1 year ago

https://github.com/ros2/rclcpp/pull/2280 is available on rclcpp version 16.0.6-1jammy.20230919.213531, so i believe that you are using the correct version. the only thing that i can think of is, it returns some error instead of RCL_RET_TIMEOUT. I would recommend that you can create another issue for that.