rcl_action: result_timeout should be started on goal completion

fujitatomoya commented 1 year ago

Bug report

Required Info:

Operating System:
- Ubuntu 22.04 (Any)
Installation type:
- Any but Iron and Rolling
Version or commit hash:
- N.A
DDS implementation:
- Any
Client library (if applicable):
- Any (rclcpp, rclpy)

Description

If the user application uses default rcl_action_server_options_t, goal_handle will be considered as expired after 10 seconds.

as described in https://github.com/ros-planning/navigation2/issues/3765, that is so likely to take more than 10 seconds to set the goal result before expired(10 seconds) once accepted. https://github.com/ros-planning/navigation2/issues/3765#issuecomment-1689951628 analyzes and work-around this issue with setting 30 seconds via rcl_action_server_options_t.result_timeout.

Consideration / Proposal

rcl_action : increase the timeout from 10 seconds to 1 minute in default. (15 minutes are too long though.) https://github.com/ros2/rcl/pull/1012 reduced the timeout into 10 seconds, but thinking about the use case such as Nav2 relies on ROS 2 action, 10 seconds is short in default. (backport required to Iron)
rclpy_action : result_timeout default should be set to rcl's default accordingly. (currently this is set to 900 seconds.)

Related Issue

https://github.com/ros2/rclcpp/issues/2101 (timeout == 0 does not work)

fujitatomoya commented 1 year ago

@SteveMacenski @clalancette @AlexeyMerzlyakov what do you think?

SteveMacenski commented 1 year ago

One minute is better yet than 10 seconds, but doesn't really address the underlying problem. We need a different mechanic to expire goals that isn't based on request time (e.g. last-updated time? last-result-requested time?)

But, I'll take incremental improvements where I can get it. Nav2's workaround of increasing the time solves my immediate problems like this, but still leaves every other user that isn't extremely well plugged into the on-goings of Nav2 / rclcpp development in the lurch. So for their sake to help mask more of the problem in the meantime, I would be very supportive of a move up to 1 minute.

+1 this is a good suggestion

fujitatomoya commented 1 year ago

last-updated time? last-result-requested time?

i see, maybe we can rely on server process or event that resets the timeout. (e.g publish_feedback, publish_status or result_request_received) thanks for the idea.

the followings are PRs just change the default value.

SteveMacenski commented 1 year ago

Thank Alexey! I’m just escalating his good sleuthing, that is his idea!

Thanks so much for the time to help address this issue @fujitatomoya!

AlexeyMerzlyakov commented 1 year ago

@fujitatomoya, thank you for the making an attention on this issue! Yes, it will handle the immediate problem, so I would like to see these two PR-s to be merged. In long-term perspective, I agree that the change is required to be in timeout handling mechanism. It seems, that initially the value was decreased in RCL as a workaround to reduce memory consumption on unused actions. However, from this point of view, the timeout is better to be calculated from latest feedback/status/any other event in action server-client chain, rather than from request starting time (as it made currently).

Anyway, this is also a solution for now, so I would be OK to apply it.

ros-discourse commented 1 year ago

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/ros-2-tsc-meeting-minutes-2023-09-21/33733/2

sloretz commented 1 year ago

If I understand correctly I think there are two bugs. One here, and one in the Python API.

The expire timeout is being calculated from the time the goal was accepted. It should be calculated from the time the goal is completed.
The rclpy API appears to rely on the user to request the result when the goal is accepted, but that seems error prone to me. That seems to be what ros2/rclcpp#2101 is about. I think we should move that issue to ros2/rclpy.

A little info on the design. The client that sent the goal is expected to ask for the result as soon as it learns the goal has been accepted by the server.

The purpose of [the get result] service is to get the final result of a goal. After a goal has been accepted the client should call this service to receive the result

The result should be kept for action completion time + timeout in case someone other than the original client wants to look at it.

The server should cache the result once it is ready so multiple clients have to opportunity to get it. This is also useful for debugging/introspection tools. [...] To free up resources, the server should discard the result after a configurable timeout period. The timeout can be set as part of options to the action server.

What looks like it is working as intended is the client only expires goals in a terminal state. See here where the expire logic skips active goals (accepted, executing, canceling). That means even with the expire timer issue any client can get the result as long as they do so before the action completes. Second, the C++ action client properly requests the result as soon as the goal is accepted.

I saw some comments about how this could be a problem for actions that take hours to run, but I think that case is fine. Instead I think the issue would be when the timeout was very small and the goal was completed faster than the client could request the result.

Maybe in the Python API we should require the user of the Python action client to give a callback to be called when the result is ready? Or, maybe we should always request a result and give them a future they can keep or discard? Either way I think the get result service should be called here.

fujitatomoya commented 1 year ago

@sloretz thank you for sharing comments.

I might be mistaken, so i need to check.

The expire timeout is being calculated from the time the goal was accepted. It should be calculated from the time the goal is completed.

true, this is against https://design.ros2.org/articles/actions.html#result-caching

The rclpy API appears to rely on the user to request the result when the goal is accepted

i think rclcpp does rely on the user to request the result too.

there seems to be 2 ways to call send_request_result.

call async_send_goal request with options.result_callback.

https://github.com/ros2/rclcpp/blob/0644f220a2f84f1f1f56c61b065228f6cc32420d/rclcpp_action/include/rclcpp_action/client.hpp#L459-L460

call async_get_result request.

https://github.com/ros2/rclcpp/blob/0644f220a2f84f1f1f56c61b065228f6cc32420d/rclcpp_action/include/rclcpp_action/client.hpp#L512

rclpy needs to call ClientGoalHandle::get_result_async from user application, i believe this is likely to be called within user callback through the future from send_goal_async.

That seems to be what Disabled result caching: Result gets erased before result callback is registered rclcpp#2101 is about. I think we should move that issue to ros2/rclpy.

https://github.com/ros2/rclcpp/issues/2101 creates the action server with rclcpp, so i believe the problem is in rclcpp. https://github.com/ros2/rclcpp/issues/2101 action client calls ClientGoalHandle::get_result_async immediately after goal is accepted. action server completed the work right away, and the goal state becomes GOAL_EVENT_SUCCEED, and with result_timeout is set to 0, it will expire the goal immediately.

after all, i think what needs to be fixed is result_timeout should be applied against action completion time, for doing that we need to save action terminal state time(the time that goal becomes terminal state) for each goal internally.

what do you think?

fujitatomoya commented 12 months ago

@iuhilnehc-ynos @Barry-Xu-2018 could you check this when you have time? i would like to have the 2nd opinion. thanks in advance.

Barry-Xu-2018 commented 12 months ago

after all, i think what needs to be fixed is result_timeout should be applied against action completion time, for doing that we need to save action terminal state time(the time that goal becomes terminal state) for each goal internally.

Totally agree. This is an important fixed point.

According to the design, it is possible to not get result if timeout is set as 0.
Based on current implementation (rclcpp/rclpy), on service side, goal process start while goal request is accepted. If the goal processing time is very short and the timeout is set to 0, it is possible that the goal result is deleted before getting the user's request result.
If we want to ensure that users always receive the goal result when the timeout is set to 0, the current interface may need to be modified.

fujitatomoya commented 11 months ago

@Barry-Xu-2018 @iuhilnehc-ynos you guys have bandwidth for this fix? that would be really appreciated.

CC: @clalancette @sloretz

Barry-Xu-2018 commented 11 months ago

Okay. There are 2 issues.

The expire timeout is being calculated from the time the goal was accepted. It should be calculated from the time the goal is completed.

We will fix this.
While the timeout is set to 0, make sure that users always receive the goal result.

We'll further consider the detailed solution and discuss it here. Since this is a special case, we don't want to introduce significant changes just to fix this issue. Significant changes could potentially introduce new problems.

fujitatomoya commented 11 months ago

While the timeout is set to 0, make sure that users always receive the goal result.

i am not really sure about this comment.

https://design.ros2.org/articles/actions.html#result-caching explains well.

If the timeout is configured to have value -1, then goal results will be “kept forever” (until the action server shuts down). If the timeout is configured to have value 0, then goal results are discarded immediately (after responding to any pending result requests).

if the timeout is set to 0, once goal has been completed, server checks pending result requests to send the result, and then discard the goal immediately. in this case, client makes sure to send the result request before server complete the goal. unless, result will not be ready for clients.

Barry-Xu-2018 commented 11 months ago

if the timeout is set to 0, once goal has been completed, server checks pending result requests to send the result, and then discard the goal immediately. in this case, client makes sure to send the result request before server complete the goal. unless, result will not be ready for clients.

While goal processing time is very short (just send goal accept to client), client doesn't send goal result to service.
Current implementation must get the response of goal accept and then can send request of goal result. That is, we cannot make sure that client sends the result request before server complete the goal.
So consider whether there is a solution for this problem. Of course, using enough timeout can avoid this problem.

fujitatomoya commented 11 months ago

Understood. there will be always racy condition between result requested and goal completed in this case. I think that is why we have own timeout option that server can manage by itself, because clients might be gone already.

saying, for example,

server can keep the result after completion, at least one client requests the result. (what if no client requests the result? caching unnecessary and old result would be problem.)
how about sending the goal request with result requested flag. (this client might be gone after sending the goal request, the result will never be delivered to client.)

so i think current design with timeout on server after goal completion makes sense. but i am open for more options and ideas 👍 thanks

sloretz commented 11 months ago

While goal processing time is very short (just send goal accept to client), client doesn't send goal result to service. Current implementation must get the response of goal accept and then can send request of goal result. That is, we cannot make sure that client sends the result request before server complete the goal.

I agree with you that a goal result could be missed if the goal executed very fast and the result timeout was zero, but I think that's a case of a missconfigured goal timeout in the application rather than a bug that needs to be fixed here.

fujitatomoya commented 11 months ago

btw,

If the timeout is configured to have value -1

i guess, we would take this as if negative with typedef int64_t rcutils_duration_value_t;. we can check -1, but what other negative value means?

Barry-Xu-2018 commented 11 months ago

server can keep the result after completion, at least one client requests the result. (what if no client requests the result? caching unnecessary and old result would be problem.) how about sending the goal request with result requested flag. (this client might be gone after sending the goal request, the result will never be delivered to client.)

I've also considered the above solution, and as you described, there are some unavoidable issues. This is because it's impossible to determine when the service will receive the client's request result. How long to retain the goal result is a problem.

I tend to agree with sloretz's point of view. This should be resolved by asking users to set an appropriate timeout.

fujitatomoya commented 10 months ago

@Barry-Xu-2018 do we have any update on this issue?

Barry-Xu-2018 commented 10 months ago

No update.
According to my comments https://github.com/ros2/rcl/issues/1103#issuecomment-1760819558, I will submit a PR to fix first issue. About second issue, it should depend on appropriate timeout set by user (Not a bug).

fujitatomoya commented 2 months ago

closing this issue in favor of https://github.com/ros2/rcl/pull/1121

ros2 / rcl