Closed fujitatomoya closed 2 months ago
@SteveMacenski @clalancette @AlexeyMerzlyakov what do you think?
One minute is better yet than 10 seconds, but doesn't really address the underlying problem. We need a different mechanic to expire goals that isn't based on request time (e.g. last-updated time? last-result-requested time?)
But, I'll take incremental improvements where I can get it. Nav2's workaround of increasing the time solves my immediate problems like this, but still leaves every other user that isn't extremely well plugged into the on-goings of Nav2 / rclcpp development in the lurch. So for their sake to help mask more of the problem in the meantime, I would be very supportive of a move up to 1 minute.
+1 this is a good suggestion
last-updated time? last-result-requested time?
i see, maybe we can rely on server process or event that resets the timeout. (e.g publish_feedback
, publish_status
or result_request_received
) thanks for the idea.
the followings are PRs just change the default value.
Thank Alexey! I’m just escalating his good sleuthing, that is his idea!
Thanks so much for the time to help address this issue @fujitatomoya!
@fujitatomoya, thank you for the making an attention on this issue! Yes, it will handle the immediate problem, so I would like to see these two PR-s to be merged. In long-term perspective, I agree that the change is required to be in timeout handling mechanism. It seems, that initially the value was decreased in RCL as a workaround to reduce memory consumption on unused actions. However, from this point of view, the timeout is better to be calculated from latest feedback/status/any other event in action server-client chain, rather than from request starting time (as it made currently).
Anyway, this is also a solution for now, so I would be OK to apply it.
This issue has been mentioned on ROS Discourse. There might be relevant details there:
https://discourse.ros.org/t/ros-2-tsc-meeting-minutes-2023-09-21/33733/2
If I understand correctly I think there are two bugs. One here, and one in the Python API.
rclpy
API appears to rely on the user to request the result when the goal is accepted, but that seems error prone to me. That seems to be what ros2/rclcpp#2101 is about. I think we should move that issue to ros2/rclpy
.A little info on the design. The client that sent the goal is expected to ask for the result as soon as it learns the goal has been accepted by the server.
The result should be kept for action completion time + timeout in case someone other than the original client wants to look at it.
What looks like it is working as intended is the client only expires goals in a terminal state. See here where the expire logic skips active goals (accepted, executing, canceling). That means even with the expire timer issue any client can get the result as long as they do so before the action completes. Second, the C++ action client properly requests the result as soon as the goal is accepted.
I saw some comments about how this could be a problem for actions that take hours to run, but I think that case is fine. Instead I think the issue would be when the timeout was very small and the goal was completed faster than the client could request the result.
Maybe in the Python API we should require the user of the Python action client to give a callback to be called when the result is ready? Or, maybe we should always request a result and give them a future they can keep or discard? Either way I think the get result service should be called here.
@sloretz thank you for sharing comments.
I might be mistaken, so i need to check.
The expire timeout is being calculated from the time the goal was accepted. It should be calculated from the time the goal is completed.
true, this is against https://design.ros2.org/articles/actions.html#result-caching
The rclpy API appears to rely on the user to request the result when the goal is accepted
i think rclcpp
does rely on the user to request the result too.
there seems to be 2 ways to call send_request_result
.
async_send_goal
request with options.result_callback
.async_get_result
request.rclpy
needs to call ClientGoalHandle::get_result_async
from user application, i believe this is likely to be called within user callback through the future from send_goal_async
.
That seems to be what Disabled result caching: Result gets erased before result callback is registered rclcpp#2101 is about. I think we should move that issue to ros2/rclpy.
https://github.com/ros2/rclcpp/issues/2101 creates the action server with rclcpp
, so i believe the problem is in rclcpp
.
https://github.com/ros2/rclcpp/issues/2101 action client calls ClientGoalHandle::get_result_async
immediately after goal is accepted.
action server completed the work right away, and the goal state becomes GOAL_EVENT_SUCCEED
, and with result_timeout is set to 0, it will expire the goal immediately.
after all, i think what needs to be fixed is result_timeout should be applied against action completion time
, for doing that we need to save action terminal state time
(the time that goal becomes terminal state) for each goal internally.
what do you think?
@iuhilnehc-ynos @Barry-Xu-2018 could you check this when you have time? i would like to have the 2nd opinion. thanks in advance.
after all, i think what needs to be fixed is result_timeout should be applied against action completion time, for doing that we need to save action terminal state time(the time that goal becomes terminal state) for each goal internally.
Totally agree. This is an important fixed point.
According to the design, it is possible to not get result if timeout is set as 0.
Based on current implementation (rclcpp/rclpy), on service side, goal process start while goal request is accepted. If the goal processing time is very short and the timeout is set to 0, it is possible that the goal result is deleted before getting the user's request result.
If we want to ensure that users always receive the goal result when the timeout is set to 0, the current interface may need to be modified.
@Barry-Xu-2018 @iuhilnehc-ynos you guys have bandwidth for this fix? that would be really appreciated.
CC: @clalancette @sloretz
Okay. There are 2 issues.
The expire timeout is being calculated from the time the goal was accepted. It should be calculated from the time the goal is completed.
We will fix this.
While the timeout is set to 0, make sure that users always receive the goal result.
We'll further consider the detailed solution and discuss it here. Since this is a special case, we don't want to introduce significant changes just to fix this issue. Significant changes could potentially introduce new problems.
While the timeout is set to 0, make sure that users always receive the goal result.
i am not really sure about this comment.
https://design.ros2.org/articles/actions.html#result-caching explains well.
If the timeout is configured to have value -1, then goal results will be “kept forever” (until the action server shuts down). If the timeout is configured to have value 0, then goal results are discarded immediately (after responding to any pending result requests).
if the timeout is set to 0, once goal has been completed, server checks pending result requests
to send the result, and then discard the goal immediately. in this case, client makes sure to send the result request before server complete the goal. unless, result will not be ready for clients.
if the timeout is set to 0, once goal has been completed, server checks pending result requests to send the result, and then discard the goal immediately. in this case, client makes sure to send the result request before server complete the goal. unless, result will not be ready for clients.
While goal processing time is very short (just send goal accept to client), client doesn't send goal result to service.
Current implementation must get the response of goal accept and then can send request of goal result. That is, we cannot make sure that client sends the result request before server complete the goal.
So consider whether there is a solution for this problem. Of course, using enough timeout can avoid this problem.
Understood. there will be always racy condition between result requested
and goal completed
in this case.
I think that is why we have own timeout option that server can manage by itself, because clients might be gone already.
saying, for example,
result requested
flag. (this client might be gone after sending the goal request, the result will never be delivered to client.)so i think current design with timeout on server after goal completion makes sense. but i am open for more options and ideas 👍 thanks
While goal processing time is very short (just send goal accept to client), client doesn't send goal result to service. Current implementation must get the response of goal accept and then can send request of goal result. That is, we cannot make sure that client sends the result request before server complete the goal.
I agree with you that a goal result could be missed if the goal executed very fast and the result timeout was zero, but I think that's a case of a missconfigured goal timeout in the application rather than a bug that needs to be fixed here.
btw,
If the timeout is configured to have value -1
i guess, we would take this as if negative
with typedef int64_t rcutils_duration_value_t;
. we can check -1
, but what other negative value means?
server can keep the result after completion, at least one client requests the result. (what if no client requests the result? caching unnecessary and old result would be problem.) how about sending the goal request with result requested flag. (this client might be gone after sending the goal request, the result will never be delivered to client.)
I've also considered the above solution, and as you described, there are some unavoidable issues. This is because it's impossible to determine when the service will receive the client's request result. How long to retain the goal result is a problem.
I tend to agree with sloretz's point of view. This should be resolved by asking users to set an appropriate timeout.
@Barry-Xu-2018 do we have any update on this issue?
No update.
According to my comments https://github.com/ros2/rcl/issues/1103#issuecomment-1760819558, I will submit a PR to fix first issue. About second issue, it should depend on appropriate timeout set by user (Not a bug).
closing this issue in favor of https://github.com/ros2/rcl/pull/1121
Bug report
Required Info:
Description
If the user application uses default
rcl_action_server_options_t
,goal_handle
will be considered as expired after 10 seconds.as described in https://github.com/ros-planning/navigation2/issues/3765, that is so likely to take more than 10 seconds to set the goal result before expired(10 seconds) once accepted. https://github.com/ros-planning/navigation2/issues/3765#issuecomment-1689951628 analyzes and work-around this issue with setting 30 seconds via
rcl_action_server_options_t.result_timeout
.Consideration / Proposal
rcl_action
: increase the timeout from 10 seconds to 1 minute in default. (15 minutes are too long though.) https://github.com/ros2/rcl/pull/1012 reduced the timeout into 10 seconds, but thinking about the use case such as Nav2 relies on ROS 2 action, 10 seconds is short in default. (backport required to Iron)rclpy_action
: result_timeout default should be set to rcl's default accordingly. (currently this is set to900
seconds.)Related Issue
timeout == 0
does not work)