ros2 / rclpy

rclpy (ROS Client Library for Python)
Apache License 2.0
308 stars 227 forks source link

Bug, rclpy Action Client: Highly variable client call time. #1059

Open rshanor opened 1 year ago

rshanor commented 1 year ago

Bug report

Required Info:

Steps to reproduce issue

I have an Action Server and Action Client implemented in Python. I noticed that periodically, the total client round trip time took several seconds, even though the server consistently responds in 250ms. Everything is running on the same machine and I am using the env variable ROS_LOCALHOST_ONLY=1. The messages size is fairly big, containing a single image slightly larger than 1080p. This is implemented as an ActionServer to match https://github.com/basler/pylon-ros-camera/tree/humble, but I could try to rewrite as a standard service if that would help.

My server callback looks something like

    def grab_images(self, goal_handle) -> GrabImages.Result:  # type: ignore
        self.get_logger().info("Capturing single image...")
        result = GrabImages.Result()
        image_msg = self._capture_sim_image()
        result.images.append(image_msg)
        result.cam_info = self.camera_info_manager.getCameraInfo()
        result.success = True
        goal_handle.succeed()
        self.get_logger().info("Captured single image...")
        return result

And my client code looks like

    def send_goal(self) -> GrabImages.Result:
        goal_msg = GrabImages.Goal()
        self._node.get_logger().info("Sending goal.")
        future = self._action_client.send_goal_async(goal_msg)
        rclpy.spin_until_future_complete(self._node, future, timeout_sec=5.0)
        assert future.done()
        self._node.get_logger().info("Goal accepted, waiting for result.")
        goal_handle = future.result()
        result_future = goal_handle.get_result_async()
        rclpy.spin_until_future_complete(self._node, result_future, timeout_sec=5.0)
        assert result_future.done()
        result = result_future.result().result
        self._node.get_logger().info("Got a result.")
        return result

From what I can tell, it looks like the delay is between waiting for result and receiving result.

[INFO] [1671587003.917166720] [call_grab_image_action]: Sending goal.
[INFO] [1671587003.918784412] [call_grab_image_action]: Goal accepted, waiting for result.
[INFO] [1671587006.293852748] [call_grab_image_action]: Got a result.
...
[INFO] [1671587129.343684424] [call_grab_image_action]: Sending goal.
[INFO] [1671587129.346026253] [call_grab_image_action]: Goal accepted, waiting for result.
[INFO] [1671587132.675653545] [call_grab_image_action]: Got a result.

Expected behavior

Client total action time to be slightly larger than server time.

Actual behavior

On occasion, client time is 10X or more larger than server time.

Additional information

Screenshot from 2022-12-20 17-32-37

rshanor commented 1 year ago

Running with debug logging. I see Sending service response in the server logs but a long delay before Action status taken in the client logs.

Client

[INFO] [1671659682.131745101] [call_grab_image_action]: Sending goal.
[DEBUG] [1671659682.131818861] [rcl_action]: Sending action goal request
[DEBUG] [1671659682.131822305] [rcl]: Client sending service request
[DEBUG] [1671659682.131853398] [rcl_action]: Action goal request sent
[DEBUG] [1671659682.131902662] [rcl]: Timer canceled
[DEBUG] [1671659682.131915339] [rcl]: Initializing timer with period: 3000000000ns
[DEBUG] [1671659682.131922580] [rcl]: Updated timer period from '3000000000ns' to '3000000000ns'
[DEBUG] [1671659682.131942364] [rcl]: Timer canceled
[DEBUG] [1671659682.131947286] [rcl]: Initializing timer with period: 2999935849ns
[DEBUG] [1671659682.131950934] [rcl]: Updated timer period from '2999935849ns' to '2999935849ns'
[DEBUG] [1671659682.131969332] [rcl]: Timer canceled
[DEBUG] [1671659682.131974137] [rcl]: Initializing timer with period: 2999908247ns
[DEBUG] [1671659682.131977434] [rcl]: Updated timer period from '2999908247ns' to '2999908247ns'
[DEBUG] [1671659682.132024143] [rcl]: Initializing wait set with '2' subscriptions, '2' guard conditions, '1' timers, '3' clients, '6' services
[DEBUG] [1671659682.132095458] [rcl]: Initializing wait set with '2' subscriptions, '2' guard conditions, '1' timers, '3' clients, '6' services
[DEBUG] [1671659682.133729544] [rcl_action]: Taking action goal response
[DEBUG] [1671659682.133742448] [rcl]: Client taking service response
[DEBUG] [1671659682.133754879] [rcl]: Client take response succeeded: true
[DEBUG] [1671659682.133757253] [rcl_action]: Action goal response taken
[INFO] [1671659682.134132751] [call_grab_image_action]: Goal accepted, waiting for result.
[DEBUG] [1671659682.134190355] [rcl_action]: Sending action result request
[DEBUG] [1671659682.134193659] [rcl]: Client sending service request
[DEBUG] [1671659682.134214154] [rcl_action]: Action result request sent
[DEBUG] [1671659682.134268579] [rcl]: Timer canceled
[DEBUG] [1671659682.134283606] [rcl]: Initializing timer with period: 3000000000ns
[DEBUG] [1671659682.134291090] [rcl]: Updated timer period from '3000000000ns' to '3000000000ns'
[DEBUG] [1671659682.134311948] [rcl]: Timer canceled
[DEBUG] [1671659682.134316976] [rcl]: Initializing timer with period: 2999928587ns
[DEBUG] [1671659682.134320655] [rcl]: Updated timer period from '2999928587ns' to '2999928587ns'
[DEBUG] [1671659682.134336290] [rcl]: Timer canceled
[DEBUG] [1671659682.134340508] [rcl]: Initializing timer with period: 2999903637ns
[DEBUG] [1671659682.134344632] [rcl]: Updated timer period from '2999903637ns' to '2999903637ns'
[DEBUG] [1671659682.134392980] [rcl]: Initializing wait set with '2' subscriptions, '2' guard conditions, '1' timers, '3' clients, '6' services
[DEBUG] [1671659682.134446534] [rcl_action]: Taking action status
[DEBUG] [1671659682.134448980] [rcl]: Subscription taking message
[DEBUG] [1671659682.134479482] [rcl]: Subscription take succeeded: true
[DEBUG] [1671659682.134481962] [rcl_action]: Action status taken
[DEBUG] [1671659682.153070645] [rcl]: Timer canceled
[DEBUG] [1671659682.153106819] [rcl]: Initializing timer with period: 2981208927ns
[DEBUG] [1671659682.153137015] [rcl]: Updated timer period from '2981208927ns' to '2981208927ns'
[DEBUG] [1671659682.153201044] [rcl]: Initializing wait set with '2' subscriptions, '2' guard conditions, '1' timers, '3' clients, '6' services
[DEBUG] [1671659682.153286712] [rcl]: Initializing wait set with '2' subscriptions, '2' guard conditions, '1' timers, '3' clients, '6' services
[DEBUG] [1671659682.368585382] [rcl_action]: Taking action status
[DEBUG] [1671659682.368608456] [rcl]: Subscription taking message
[DEBUG] [1671659682.368701045] [rcl]: Subscription take succeeded: true
[DEBUG] [1671659688.946816840] [rcl_action]: Action status taken
Traceback (most recent call last):
    assert result_future.done()
AssertionError

And server

[DEBUG] [1671659682.131994830] [rcl]: Service server taking service request
[DEBUG] [1671659682.132022257] [rcl]: Service take request succeeded: true
[DEBUG] [1671659682.133360429] [robodk_sim_camera.action_server]: New goal request with ID: [ 60  22  78  33 163 196  71 243 146  43  48 200 194 203 238 238]
[DEBUG] [1671659682.133560286] [rcl]: Sending service response
[DEBUG] [1671659682.133931800] [robodk_sim_camera.action_server]: New goal accepted: [ 60  22  78  33 163 196  71 243 146  43  48 200 194 203 238 238]
[DEBUG] [1671659682.134421058] [robodk_sim_camera.action_server]: Executing goal with ID [ 60  22  78  33 163 196  71 243 146  43  48 200 194 203 238 238]
[INFO] [1671659682.134647087] [robodk_sim_camera]: Capturing single image...
[DEBUG] [1671659682.368423588] [rcl]: Timer successfully reset
[DEBUG] [1671659682.368447476] [rcl]: Updated timer period from '459884619236ns' to '459617440767ns'
[INFO] [1671659682.368759227] [robodk_sim_camera]: Captured single image...
[DEBUG] [1671659682.369134753] [robodk_sim_camera.action_server]: Goal with ID [ 60  22  78  33 163 196  71 243 146  43  48 200 194 203 238 238] finished with state 4
[DEBUG] [1671659682.369311198] [rcl]: Initializing wait set with '0' subscriptions, '2' guard conditions, '2' timers, '0' clients, '13' services
[DEBUG] [1671659682.369407977] [rcl]: Service server taking service request
[DEBUG] [1671659682.369421793] [rcl]: Service take request succeeded: true
[DEBUG] [1671659682.370370704] [robodk_sim_camera.action_server]: Result request received for goal with ID: [ 60  22  78  33 163 196  71 243 146  43  48 200 194 203 238 238]
[DEBUG] [1671659682.374861538] [rcl]: Sending service response
[DEBUG] [1671659682.377828502] [rcl]: Initializing wait set with '0' subscriptions, '2' guard conditions, '2' timers, '0' clients, '13' services
[DEBUG] [1671659682.377947648] [rcl]: Initializing wait set with '0' subscriptions, '2' guard conditions, '2' timers, '0' clients, '13' services
rshanor commented 1 year ago

And FWIW, I noticed the same behavior with rmw_cyclonedds_cpp

rshanor commented 1 year ago

I just noticed that the server is leaking memory like crazy. When I take out the line result.images.append(image_msg), I do not observe the leak. The below plot is over a couple minutes at most.

image

fujitatomoya commented 1 year ago
[DEBUG] [1671659682.131853398] [rcl_action]: Action goal request sent
-> Goal request sent by client.

...

[DEBUG] [1671659682.133560286] [rcl]: Sending service response
-> Goal accepted by server, sending goal response.

...

[DEBUG] [1671659682.133757253] [rcl_action]: Action goal response taken
-> Goal response received by client.
[INFO] [1671659682.134132751] [call_grab_image_action]: Goal accepted, waiting for result.
[DEBUG] [1671659682.134190355] [rcl_action]: Sending action result request
[DEBUG] [1671659682.134193659] [rcl]: Client sending service request
[DEBUG] [1671659682.134214154] [rcl_action]: Action result request sent
-> Result request sent by client.

...

[DEBUG] [1671659682.370370704] [robodk_sim_camera.action_server]: Result request received for goal with ID: [ 60  22  78  33 163 196  71 243 146  43  48 200 194 203 238 238]
[DEBUG] [1671659682.374861538] [rcl]: Sending service response
-> Result response sent by server.

...

[DEBUG] [1671659682.368585382] [rcl_action]: Taking action status
[DEBUG] [1671659682.368608456] [rcl]: Subscription taking message
[DEBUG] [1671659682.368701045] [rcl]: Subscription take succeeded: true
-> This is status topic, status succeeded delivered to client.
[DEBUG] [1671659688.946816840] [rcl_action]: Action status taken
-> Why this takes 5 seconds to take the message out of the queue?

Why we are not seeing message Taking action result response?

rshanor commented 1 year ago

Not sure. My ActionClient is initialized as follows, if it matters at all.

        self._action_client = ActionClient(
            node, GrabImages, action_name, callback_group=MutuallyExclusiveCallbackGroup()
        )
rshanor commented 1 year ago

If I am not mistake, the only possibly meaningful code between Subscription take succeeded and Action #Type taken is TRACEPOINT(rcl_take, (const void *)ros_message);.

The rcl_action code in question:

// \internal Takes an action client specific topic message.
#define TAKE_MESSAGE(Type) \
  RCUTILS_LOG_DEBUG_NAMED(ROS_PACKAGE_NAME, "Taking action " #Type); \
  if (!rcl_action_client_is_valid(action_client)) { \
    return RCL_RET_ACTION_CLIENT_INVALID;  /* error already set */ \
  } \
  RCL_CHECK_ARGUMENT_FOR_NULL(ros_ ## Type, RCL_RET_INVALID_ARGUMENT); \
  rmw_message_info_t message_info; /* ignored */ \
  rcl_ret_t ret = rcl_take( \
    &action_client->impl->Type ## _subscription, ros_ ## Type, &message_info, NULL); \
  if (RCL_RET_OK != ret) { \
    if (RCL_RET_SUBSCRIPTION_TAKE_FAILED == ret) { \
      return RCL_RET_ACTION_CLIENT_TAKE_FAILED; \
    } \
    if (RCL_RET_BAD_ALLOC == ret) { \
      return RCL_RET_BAD_ALLOC; \
    } \
    return RCL_RET_ERROR; \
  } \
  RCUTILS_LOG_DEBUG_NAMED(ROS_PACKAGE_NAME, "Action " #Type " taken"); \
  return RCL_RET_OK;

rcl_ret_t
rcl_action_take_status(
  const rcl_action_client_t * action_client,
  void * ros_status)
{
  TAKE_MESSAGE(status);
}

Which calls the following rcl code

rcl_ret_t
rcl_take(
  const rcl_subscription_t * subscription,
  void * ros_message,
  rmw_message_info_t * message_info,
  rmw_subscription_allocation_t * allocation
)
{
  RCUTILS_LOG_DEBUG_NAMED(ROS_PACKAGE_NAME, "Subscription taking message");
  if (!rcl_subscription_is_valid(subscription)) {
    return RCL_RET_SUBSCRIPTION_INVALID;  // error message already set
  }
  RCL_CHECK_ARGUMENT_FOR_NULL(ros_message, RCL_RET_INVALID_ARGUMENT);

  // If message_info is NULL, use a place holder which can be discarded.
  rmw_message_info_t dummy_message_info;
  rmw_message_info_t * message_info_local = message_info ? message_info : &dummy_message_info;
  *message_info_local = rmw_get_zero_initialized_message_info();
  // Call rmw_take_with_info.
  bool taken = false;
  rmw_ret_t ret = rmw_take_with_info(
    subscription->impl->rmw_handle, ros_message, &taken, message_info_local, allocation);
  if (ret != RMW_RET_OK) {
    RCL_SET_ERROR_MSG(rmw_get_error_string().str);
    return rcl_convert_rmw_ret_to_rcl_ret(ret);
  }
  RCUTILS_LOG_DEBUG_NAMED(
    ROS_PACKAGE_NAME, "Subscription take succeeded: %s", taken ? "true" : "false");
  TRACEPOINT(rcl_take, (const void *)ros_message);
  if (!taken) {
    return RCL_RET_SUBSCRIPTION_TAKE_FAILED;
  }
  return RCL_RET_OK;
}
koki-ogura commented 1 year ago

This issue was resolved in PR#1070.