ros2 / rclpy

rclpy (ROS Client Library for Python)
Apache License 2.0
291 stars 224 forks source link

KeyError in ActionServer._execute_goal #1236

Open rafael1193 opened 7 months ago

rafael1193 commented 7 months ago

Bug report

Required Info:

Steps to reproduce issue

Expected behavior

Action goal result is set without problem.

Actual behavior

The action server fails because self._result_futures[bytes(goal_uuid)].set_result(result_response) in the ActionServer _execute_goal method raises a KeyError exception:

Traceback (most recent call last):
  File "/home/rbailonr/adauv_ws/install/adauv_bluebox/lib/adauv_bluebox/vehicle.py", line 333, in main
    executor.spin()
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 294, in spin
    self.spin_once()
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 794, in spin_once
    self._spin_once_impl(timeout_sec)
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 791, in _spin_once_impl
    future.result()
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/task.py", line 94, in result
    raise self.exception()
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/task.py", line 239, in __call__
    self._handler.send(None)
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/action/server.py", line 352, in _execute_goal
    self._result_futures[bytes(goal_uuid)].set_result(result_response)
KeyError: b'EM$\xc6\xb3YA:\xa9\xa1_\xd1\xeb\x88\x06\x88'

Additional information

This occurrence of this behavior is inconsistent, but only happens in accelerated time.

It looks like a synchronization issue with ActionServer._execute_expire_goals that removes the goal result future from the _result_futures dictionary before _execute_goal sets its result. Checking not self._result_futures[goal_uuid].done() before removing the expired goal seems to mitigate the issue, but the actual source of the problem may be elsewhere.

Barry-Xu-2018 commented 7 months ago

It should not be a bug.
There is a Result caching in action design (Please refer to https://design.ros2.org/articles/actions.html).
Users can decide how long the goal result will be retained. However, it is not guaranteed that the goal result will be consistently retained even if it has not been accessed.

https://github.com/ros2/rclpy/blob/220d714b2b6da81a4abd6a804e3ed4ee8cfd7c3f/rclpy/rclpy/action/server.py#L200

fujitatomoya commented 7 months ago

Have a node emit clock messages faster than wall clock moves (at x60)

@Barry-Xu-2018 this will make the result expired (removed) 60 times faster in the action server. i think precisely that this situation could happen, but generating KeyError exception is not designed behavior?

i think we can check if the key bytes(goal_uuid) exists in _result_futures dictionary, and, goal_handle is also remove from _goal_handles. probably it can print warning that says "Goal is already expired before completion", what do you think?

Barry-Xu-2018 commented 7 months ago

i think precisely that this situation could happen, but generating KeyError exception is not designed behavior?

Yes. Throwing an exception is not a correct behavior in this scenario.

i think we can check if the key bytes(goal_uuid) exists in _result_futures dictionary, and, goal_handle is also remove from _goal_handles. probably it can print warning that says "Goal is already expired before completion", what do you think?

Yes. I'll double-check the code logic to see if there are any other places that may need modification when this issue occurs.

Barry-Xu-2018 commented 7 months ago

https://github.com/ros2/rclpy/blob/220d714b2b6da81a4abd6a804e3ed4ee8cfd7c3f/rclpy/rclpy/action/server.py#L327-L353

https://github.com/ros2/rclpy/blob/220d714b2b6da81a4abd6a804e3ed4ee8cfd7c3f/rclpy/rclpy/action/server.py#L413-L419

These two parts involve asynchronous operations, so to completely avoid the issue, locking (asyncio.Lock) should be added when operating on self._result_futures.

rafael1193 commented 7 months ago

Thank you very much for your help.

In hindsight, some expiring timeout message was to be expected but not the action server failing. I have increased result_timeout accordingly and the synchronization problem does not seem to appear anymore.

fujitatomoya commented 4 days ago

@Barry-Xu-2018 i am almost sure this has been fixed with https://github.com/ros2/rcl/pull/1121, what do you think?

i think the only thing missing here is to check the goal_uuid key in _result_futures during _execute_get_result_request call. this could generate the key error when the result timeout is too short or simulation time like this issue report.

what do you think?

fujitatomoya commented 4 days ago

i think the only thing missing here is to check the goal_uuid key in _result_futures during _execute_get_result_request call.

I was thinking that the following patch, but i found that is not needed. because it already checks the goal_uuid in the _goal_handles. probably we can adjust the debug message a bit.

https://github.com/ros2/rclpy/blob/2de5c8ea447a1614a7c36d0fcf2250cefc21a594/rclpy/rclpy/action/server.py#L413-L420

diff --git a/rclpy/rclpy/action/server.py b/rclpy/rclpy/action/server.py
index ce60309..06b93d4 100644
--- a/rclpy/rclpy/action/server.py
+++ b/rclpy/rclpy/action/server.py
@@ -472,8 +472,13 @@ class ActionServer(Generic[GoalT, ResultT, FeedbackT], Waitable['ServerGoalHandl

         # There is an accepted goal matching the goal ID, register a callback to send the
         # response as soon as it's ready
-        self._result_futures[bytes(goal_uuid)].add_done_callback(
-            functools.partial(self._send_result_response, request_header))
+        if bytes(goal_uuid) in self._result_futures:
+            self._result_futures[bytes(goal_uuid)].add_done_callback(
+                functools.partial(self._send_result_response, request_header))
+        else:
+            # This can happen when result_timeout is too short, or high-freq simulation time
+            self._logger.warn(
+                'Requested result has been expired already (goal ID: {0})'.format(goal_uuid))

     async def _execute_expire_goals(self, expired_goals: Tuple[GoalInfo, ...]) -> None:
         for goal in expired_goals:
Barry-Xu-2018 commented 4 days ago

@Barry-Xu-2018 i am almost sure this has been fixed with ros2/rcl#1121, what do you think?

If result_timeout is still set to 0, this issue might still occur. This is because the current design doesn't ensure that the result is retrieved before it gets deleted. So it's the user's responsibility to set a reasonable result timeout.

ros2/rcl#1121 isn't aimed at this issue.

i think the only thing missing here is to check the goal_uuid key in _result_futures during _execute_get_result_request call. this could generate the key error when the result timeout is too short or simulation time like this issue report.

what do you think?

Yes. This check is necessary.