ros2 / rclcpp

rclcpp (ROS Client Library for C++)
Apache License 2.0
536 stars 417 forks source link

Disabled result caching: Result gets erased before result callback is registered #2101

Open benthie opened 1 year ago

benthie commented 1 year ago

Bug report

Required Info:

Steps to reproduce issue

I used the Fibonacci ActionServer (C++) and ActionClient (Python) and extended them as follows to reproduce the bug.

In the server I passed in options that disable the result caching and reduced the execute() function to return quicker.

#include <functional>
#include <memory>
#include <thread>

#include "action_tutorials_interfaces/action/fibonacci.hpp"
#include "rclcpp/rclcpp.hpp"
#include "rclcpp_action/rclcpp_action.hpp"
#include "rclcpp_components/register_node_macro.hpp"

#include "action_tutorials_cpp/visibility_control.h"

namespace action_tutorials_cpp
{
class FibonacciActionServer : public rclcpp::Node
{
public:
  using Fibonacci = action_tutorials_interfaces::action::Fibonacci;
  using GoalHandleFibonacci = rclcpp_action::ServerGoalHandle<Fibonacci>;

  ACTION_TUTORIALS_CPP_PUBLIC
  explicit FibonacciActionServer(const rclcpp::NodeOptions & options = rclcpp::NodeOptions())
  : Node("fibonacci_action_server", options)
  {
    using namespace std::placeholders;

    auto serverOptions = rcl_action_server_get_default_options();
    serverOptions.result_timeout.nanoseconds = 0;

    this->action_server_ = rclcpp_action::create_server<Fibonacci>(
      this,
      "fibonacci",
      std::bind(&FibonacciActionServer::handle_goal, this, _1, _2),
      std::bind(&FibonacciActionServer::handle_cancel, this, _1),
      std::bind(&FibonacciActionServer::handle_accepted, this, _1),
      serverOptions);
  }

private:
  rclcpp_action::Server<Fibonacci>::SharedPtr action_server_;

  rclcpp_action::GoalResponse handle_goal(
    const rclcpp_action::GoalUUID & uuid,
    std::shared_ptr<const Fibonacci::Goal> goal)
  {
    RCLCPP_INFO(this->get_logger(), "Received goal request with order %d", goal->order);
    (void)uuid;
    return rclcpp_action::GoalResponse::ACCEPT_AND_EXECUTE;
  }

  rclcpp_action::CancelResponse handle_cancel(
    const std::shared_ptr<GoalHandleFibonacci> goal_handle)
  {
    RCLCPP_INFO(this->get_logger(), "Received request to cancel goal");
    (void)goal_handle;
    return rclcpp_action::CancelResponse::ACCEPT;
  }

  void handle_accepted(const std::shared_ptr<GoalHandleFibonacci> goal_handle)
  {
    using namespace std::placeholders;
    // this needs to return quickly to avoid blocking the executor, so spin up a new thread
    std::thread{std::bind(&FibonacciActionServer::execute, this, _1), goal_handle}.detach();
  }

  void execute(const std::shared_ptr<GoalHandleFibonacci> goal_handle)
  {
    RCLCPP_INFO(this->get_logger(), "Executing goal");
    const auto goal = goal_handle->get_goal();
    auto result = std::make_shared<Fibonacci::Result>();
    result->sequence.push_back(42);
    goal_handle->succeed(result);
    RCLCPP_INFO(this->get_logger(), "Executing done");
  }
};  // class FibonacciActionServer

}  // namespace action_tutorials_cpp

RCLCPP_COMPONENTS_REGISTER_NODE(action_tutorials_cpp::FibonacciActionServer)

In the client I added some more logging to both goal_response_callback and get_result_callback.

import rclpy
from rclpy.action import ActionClient
from rclpy.node import Node

from action_tutorials_interfaces.action import Fibonacci

class FibonacciActionClient(Node):

    def __init__(self):
        super().__init__('fibonacci_action_client')
        self._action_client = ActionClient(self, Fibonacci, 'fibonacci')

    def send_goal(self, order):
        goal_msg = Fibonacci.Goal()
        goal_msg.order = order

        self._action_client.wait_for_server()

        self._send_goal_future = self._action_client.send_goal_async(goal_msg)
        self._send_goal_future.add_done_callback(self.goal_response_callback)

    def goal_response_callback(self, future):
        goal_handle = future.result()
        if not goal_handle.accepted:
            self.get_logger().info('Goal rejected :(')
            return

        self.get_logger().info('Goal accepted :)')

        self.get_logger().info("goal_response_callback: get_result_async")
        self._get_result_future = goal_handle.get_result_async()
        self.get_logger().info("goal_response_callback: add_done_callback(result_callback)")
        self._get_result_future.add_done_callback(self.get_result_callback)
        self.get_logger().info("goal_response_callback: done")

    def get_result_callback(self, future):
        result = future.result()
        self.get_logger().info('Result: status {}'.format(result.status))
        self.get_logger().info('Result: {0}'.format(result.result.sequence))
        rclpy.shutdown()

def main(args=None):
    rclpy.init(args=args)

    action_client = FibonacciActionClient()

    action_client.send_goal(10)

    rclpy.spin(action_client)

if __name__ == '__main__':
    main()

Expected behavior

According to the ROS2 ActionServer design article I'd expect the server to discard the results "after responding to any pending result requests". The client should log "result_callback: status 4", which is STATUS_SUCCEEDED.

Actual behavior

The server discards the result as soon as it detects it as expired (see server.cpp L500). In some cases the discarding happens before the client has registered the result callback and thus the result is lost and at the time the result_callback is executed the client logs "result_callback: status 0", which is STATUS_UNKNOWN.

Additional information

I also put the rclcpp_action code in my overlay and added and changed some debug messages to log_level WARN so that I could see when the result gets discarded and when STATUS_UNKNOWN is returned by the server. Here is the log in which one can see the failure.

------------------------------------------------------------------------------------------------------------------------
Client
------------------------------------------------------------------------------------------------------------------------
[INFO] [1676025199.686814735] [fibonacci_action_client]: Goal accepted :)
[INFO] [1676025199.687044809] [fibonacci_action_client]: goal_response_callback: get_result_async
[INFO] [1676025199.687553497] [fibonacci_action_client]: goal_response_callback: add_done_callback(result_callback)
[INFO] [1676025199.687765941] [fibonacci_action_client]: goal_response_callback: done
[INFO] [1676025199.688158300] [fibonacci_action_client]: Result: status 0
[INFO] [1676025199.693995159] [fibonacci_action_client]: Result: array('i')

------------------------------------------------------------------------------------------------------------------------
Server
------------------------------------------------------------------------------------------------------------------------
[INFO] [1676025199.677291824] [fibonacci_action_server]: Received goal request with order 10
[DEBUG] [1676025199.677356341] [rcl]: Sending service response
[WARN] [1676025199.677414186] [fibonacci_action_server.rclcpp_action]: Accepted goal 912881d7a81d4a17b8ac8d6e89d14b0
[DEBUG] [1676025199.677736839] [rcl]: Waiting without timeout
[DEBUG] [1676025199.677742997] [rcl]: Timeout calculated based on next scheduled timer: false
[DEBUG] [1676025199.677747903] [rcl]: Guard condition in wait set is ready
[DEBUG] [1676025199.677753235] [rcl]: Waiting without timeout
[DEBUG] [1676025199.677755049] [rcl]: Timeout calculated based on next scheduled timer: false
[INFO] [1676025199.677770234] [fibonacci_action_server]: Executing goal
[DEBUG] [1676025199.677945233] [rcl]: Timer successfully reset
[DEBUG] [1676025199.677950962] [rcl]: Updated timer period from '0ns' to '0ns'
[DEBUG] [1676025199.677958279] [rcl]: Waiting with timeout: 0s + 0ns
[DEBUG] [1676025199.677961209] [rcl]: Timeout calculated based on next scheduled timer: true
[INFO] [1676025199.678002260] [fibonacci_action_server]: Executing done
[DEBUG] [1676025199.679048075] [rcl]: Timer in wait set is ready
[DEBUG] [1676025199.679071412] [rcl]: Timer canceled
[WARN] [1676025199.679082202] [fibonacci_action_server.rclcpp_action]: Expired goal 912881d7a81d4a17b8ac8d6e89d14b0
[DEBUG] [1676025199.679193269] [rcl]: Timer canceled
[DEBUG] [1676025199.679202156] [rcl]: Waiting without timeout
[DEBUG] [1676025199.679204135] [rcl]: Timeout calculated based on next scheduled timer: false
[DEBUG] [1676025199.679207484] [rcl]: Guard condition in wait set is ready
[DEBUG] [1676025199.679211155] [rcl]: Waiting without timeout
[DEBUG] [1676025199.679212952] [rcl]: Timeout calculated based on next scheduled timer: false
[DEBUG] [1676025199.687154346] [rcl]: Service in wait set is ready
[DEBUG] [1676025199.687177283] [rcl]: Service server taking service request
[DEBUG] [1676025199.687183095] [rcl]: Service take request succeeded: true
[WARN] [1676025199.687225064] [fibonacci_action_server.rclcpp_action]: Goal does not exist 912881d7a81d4a17b8ac8d6e89d14b0
[WARN] [1676025199.687228270] [fibonacci_action_server.rclcpp_action]: Creating result response with STATUS_UNKNOWN

I think this problem only applies to the Python API, because in C++ the result callback is part of the SendGoalOptions? I understand that the server cannot predict that some Python ActionClient is going to request the result soon. And I am not sure whether this an actual bug or rather a design decision and known issue the client has to deal with. As far as I understand, the Python API does not offer the ActionClient a way to register interest in the result at any earlier point in time. If there should be a way to do so, I am happy to try it out.

We only saw the issue in our Python based tests were the server was executing a rather short execution callback.

fujitatomoya commented 1 year ago

@benthie thanks for creating issue.

we have the same problem with https://github.com/ros2/ros2/commit/1f5bd8ed43beea199dabe48bc8023af3aba9806c rolling.

i think this is not expected behavior as designed.

fujitatomoya commented 2 months ago

@Barry-Xu-2018 can we also close this aligned with https://github.com/ros2/rcl/issues/1103?

Barry-Xu-2018 commented 2 months ago

can we also close this aligned with https://github.com/ros2/rcl/issues/1103?

Yes. I tried reproduced codes with the latest rolling code.
The desired result can be gotten.

[INFO] [1722307939.302519491] [fibonacci_action_client]: Goal accepted :)
[INFO] [1722307939.302758793] [fibonacci_action_client]: goal_response_callback: get_result_async
[INFO] [1722307939.303069126] [fibonacci_action_client]: goal_response_callback: add_done_callback(result_callback)
[INFO] [1722307939.303270590] [fibonacci_action_client]: goal_response_callback: done
[INFO] [1722307939.303829785] [fibonacci_action_client]: Result: status 0
[INFO] [1722307939.304037665] [fibonacci_action_client]: Result: array('i')
fujitatomoya commented 2 months ago

@Barry-Xu-2018

Yes. I tried reproduced codes with the latest rolling code. The desired result can be gotten.

are you sure about this?

I really do not think so... i think that client should print Result: status 4, STATUS_SUCCEEDED https://docs.ros2.org/foxy/api/action_msgs/msg/GoalStatus.html according to the https://design.ros2.org/articles/actions.html#result-caching

fujitatomoya commented 2 months ago

IMO this problem still stays after https://github.com/ros2/rcl/issues/1103

Barry-Xu-2018 commented 2 months ago

I really do not think so... i think that client should print Result: status 4, STATUS_SUCCEEDED

Yes, my mistake. The expected status value is 4, not 0.

IMO this problem still stays after https://github.com/ros2/rcl/issues/1103

Yes. I will further check this.

Barry-Xu-2018 commented 2 months ago

When the goal is completed, ServerGoalHandle::succeed() will be called. https://github.com/ros2/rclcpp/blob/45adf6565f99042a5fdcad7b8830e69f5efb03e5/rclcpp_action/include/rclcpp_action/server_goal_handle.hpp#L207-L215

Then call https://github.com/ros2/rclcpp/blob/45adf6565f99042a5fdcad7b8830e69f5efb03e5/rclcpp_action/include/rclcpp_action/server.hpp#L477-L493

shared_this->publish_status(); If there are requests for the goal result, it will be sent here. If the goal is completed quickly, the action client may not have received the request for the result yet.

notify_goal_terminal_state() -> rcl_action_notify_goal_done() -> _recalculate_expire_timer() In _recalculate_expire_timer(), expire_timer will be set with new period 0 (timeout is set to 0). Executor will call ServerBase::execute_check_expired_goals() to remove all expired goal.
So when the timeout is set to 0, the goal will be quickly removed once it is completed.

Currently, after receiving goal acceptance message, action client sending the request for the goal result. Considering the time needed for message processing and transmission (especially for Python, which has a longer processing time), when the timeout is set to 0 in action server, it is very likely that the action client will not receive the result.

If the timeout is configured to have value 0, then goal results are discarded immediately (after responding to any pending result requests).

The current implementation follows the design description. The design does not specify that setting the timeout to 0 guarantees that the action client will receive a result at least once.

If the goal is accepted, but the goal result cannot be obtained (like this issue), the user needs to consider whether the timeout value set for the action server is reasonable. So I tend to think this is not a bug.

fujitatomoya commented 1 month ago

According to the ROS2 ActionServer design article I'd expect the server to discard the results "after responding to any pending result requests".

so there is no pending result when the goal succeeds. that means that result request comes back with UNKNOWN goal. this is juset because by the time the server receives the result request, the goal already succeeds and expires.

I think this problem only applies to the Python API, because in C++ the result callback is part of the SendGoalOptions? I

i do not think so. you are right that rclcpp provides the result callback function with async_send_goal, but what it does is almost same with your example with rclpy.

https://github.com/ros2/rclcpp/blob/2739327ee9af0256c7edcbccb810513af0018851/rclcpp_action/include/rclcpp_action/client.hpp#L469-L471

that sends the result request on the goal response callback, technically this is the same situation with the rclpy sample in the report header. this does not guarantee the result is cached at least once the server sends the result to the client, or any timing advantage either. probably using rclcpp can mitigate this problem situation, but the same problem could occur.

i believe that currently user needs to set up the timeout to make sure there is enough time to request the results.

if we want to support this issue, i think we need to sends the result request flag with goal request at the same time. and when the goal is accepted on the server side, the server caches that result request internally until the server responses at least one result. but this generates another problem, what if nobody, no clients request the result at all? the server will keep the result forever...

i think current design and behavior works for user application, any thoughts?