ros2 / rmw

The ROS Middleware (rmw) Interface.
Apache License 2.0
95 stars 67 forks source link

Service clients freeze on multi-client cases. #372

Open kjjpc opened 2 months ago

kjjpc commented 2 months ago

Bug report

Steps to reproduce issue

I run 1 service server and 3 service clients which call service on high frequency as a stress test. Attached are python codes, but same result on c++ codes.

Service Server

import rclpy
from rclpy.node import Node
from example_interfaces.srv import AddTwoInts

class MinimalService(Node):
    def __init__(self):
        super().__init__('minimal_service')
        self.srv = self.create_service(AddTwoInts, 'add_two_ints', self.add_two_ints_callback)

    def add_two_ints_callback(self, request, response):
        response.sum = request.a + request.b
        return response

def main(args=None):
    rclpy.init(args=args)
    minimal_service = MinimalService()
    rclpy.spin(minimal_service)
    rclpy.shutdown()

if __name__ == '__main__':
    main()

Service Client (run on 3 terminals)

import rclpy
from rclpy.node import Node
from example_interfaces.srv import AddTwoInts

class MinimalClientAsync(Node):
    def __init__(self):
        super().__init__('minimal_client_async')
        self.cli = self.create_client(AddTwoInts, 'add_two_ints')
        while not self.cli.wait_for_service(timeout_sec=1.0):
            self.get_logger().info('service not available, waiting again...')

    def send_request(self, a, b):
        self.req = AddTwoInts.Request()
        self.req.a = a
        self.req.b = b
        self.future = self.cli.call_async(self.req)
        rclpy.spin_until_future_complete(self, self.future)
        return self.future.result()

def main(args=None):
    rclpy.init(args=args)
    minimal_client = MinimalClientAsync()
    for i in range(1000000000):
        response = minimal_client.send_request(1, 2)
        if i%10000 == 0:
            minimal_client.get_logger().info(
                'Result of add_two_ints: for %d + %d = %d' %
                (1, 2, response.sum))

    minimal_client.destroy_node()
    rclpy.shutdown()

if __name__ == '__main__':
    main()

Expected behavior

Continue to display results from service server repeatedly

Actual behavior

The service client sometimes freezes in spin_until_future_complete. The service client doesn't progress at all when it freezes. Even though 2 of 3 clients are terminated, the freeze client doesn't continue the computation loop. The freeze occurs not only high frequency cases but also low frequency multi-client cases.

Additional information

This issue is reported on https://github.com/ros2/rcl/issues/1163. The freeze is resolved by changing qos parameter like following.

        qos = qos_profile_services_default
        qos.history = HistoryPolicy.KEEP_ALL
        self.cli = self.create_client(AddTwoInts, 'add_two_ints', qos_profile = qos)

The default qos parameter of service is keep last 10 data, and it is not enough for multi-client cases.

In ROS2, all service response from service server is published to all clients, it means the buffer can be filled up by the responses for other callers and the response for the caller can be lost.

The default qos parameter of service client is not suitable for multi-client cases, which are often the case in ROS2. The default qos parameters should be changed from KEEP_LAST to KEEP_ALL. The parameter is defined in qos_profiles.h. https://github.com/ros2/rmw/blob/cf6b0ddba46b182b39852cf9679ef4c0350cb6a1/rmw/include/rmw/qos_profiles.h#L66

kjjpc commented 2 months ago

Additional information (Thanks @wuisky)

The issue of sending response to all clients. https://github.com/ros2/rclcpp/issues/2397

@fujitatomoya https://github.com/ros2/rclcpp/issues/2397 causes the freeze of service client. How should we fix this issue? The proper solution is to resolve the issue https://github.com/ros2/rclcpp/issues/2397.

Easy solution is to change default qos parameters from KEEP_LAST to KEEP_ALL. or Change the depth of default qos parameters from 10 to 100 or 1000. https://github.com/ros2/rmw/blob/cf6b0ddba46b182b39852cf9679ef4c0350cb6a1/rmw/include/rmw/qos_profiles.h#L67

fujitatomoya commented 4 days ago

I tried with your reproducible examples both service server and client, and no process freeze are observed with my environment.

# service server
root@tomoyafujita:~/ros2_ws/colcon_ws# ros2 run prover_rclpy rmw_372_server --ros-args --log-level debug
... <keeps logging>

# service client (print once in a thousand)
root@tomoyafujita:~/ros2_ws/colcon_ws# ros2 run prover_rclpy rmw_372_client
[INFO] [1725490218.437069421] [minimal_client_async]: Result of add_two_ints: for 1 + 2 = 3
[INFO] [1725490244.581709735] [minimal_client_async]: Result of add_two_ints: for 1 + 2 = 3
[INFO] [1725490324.383530299] [minimal_client_async]: Result of add_two_ints: for 1 + 2 = 3
[INFO] [1725490461.426458776] [minimal_client_async]: Result of add_two_ints: for 1 + 2 = 3
[INFO] [1725490658.204459264] [minimal_client_async]: Result of add_two_ints: for 1 + 2 = 3
[INFO] [1725490919.450157012] [minimal_client_async]: Result of add_two_ints: for 1 + 2 = 3

the above data tells me that it increases the response time on service server significantly as time goes. i am not sure why this is happening, CPU utilization is not even 100% for service server, i think this is rclpy executor efficiency problem.

The service client sometimes freezes in spin_until_future_complete

Can you rephrase freeze here? Is this supposed to mean process gets hung up?

The default qos parameter of service is keep last 10 data, and it is not enough for multi-client cases.

I do not think so.

Because your sample client calls spin_until_future_complete every time after the send request, that means it waits on the single request until it receives the response. (no async send/receive manner is implemented in the sample.) Which also means that as service server aspect, if we have 3 clients running at the same time, it can expect only 3 requests at the same time at most. IMO default QoS keep last 10 can be really enough for this test case. I am not sure why changing QoS depth affects the behavior for this test...

ros2/rclcpp#2397 causes the freeze of service client.

Could you tell us why you think ros2/rclcpp#2397 is the root cause of this?

As mentioned above, server deals with 3 request at most. And server sends the response to all client under the hidden topic, but client checks the identification that tells the message is for itself or not. If message identification is not for the client, client drops that message right away. this is not efficient but i do not see why this can be the reason for this issue.

As far as i see, this is rclpy executor performance problem.