`spin_some` easily deadlocks action clients

asasine commented 6 months ago

Bug report

When running an rclcpp_action::Client<T> in using spin_some() and a rclcpp::Rate, if the server you connect to publishes feedback at a faster rate than your rclcpp::Rate, your client waits for the goal to be accepted forever, effectively becoming deadlocked.

Required Info:

Operating System:
- Ubuntu 20.04
Installation type:
- Binaries
Version or commit hash:
- Galactic
DDS implementation:
- Cyclone DDS
Client library (if applicable):
- rclcpp

Note: based on my understanding of the root cause (see Additional information), I believe this bug exists in rolling too, but I've only tested on Galactic as that's what I have access to at the moment.

Steps to reproduce issue

The below server and client reproduce 100% of the time on my machine when running over loopback. Compilation settings shouldn't matter. Once built, they can be run with ros2 run ... or directly with ./path/to/binary. I'm using action_tutorials_interfaces/action/Fibonacci, but this should repro with any action interface.

Server

```cpp #include #include #include int main(int argc, char *argv[]) { rclcpp::init(argc, argv); auto node = rclcpp::Node::make_shared("fast_feedback_server"); using Action = action_tutorials_interfaces::action::Fibonacci; std::shared_ptr> goal_handle = nullptr; auto server = rclcpp_action::create_server( node, "fast_feedback", [&](const rclcpp_action::GoalUUID &, std::shared_ptr) { return rclcpp_action::GoalResponse::ACCEPT_AND_EXECUTE; }, [&](const std::shared_ptr>) { RCLCPP_INFO(node->get_logger(), "Cancellation requested, accepting."); goal_handle = nullptr; return rclcpp_action::CancelResponse::ACCEPT; }, [&](const std::shared_ptr> gh) { RCLCPP_INFO(node->get_logger(), "Goal was accepted"); goal_handle = gh; }); rclcpp::Rate rate(100.0); // 100 Hz rclcpp::executors::SingleThreadedExecutor executor; executor.add_node(node); RCLCPP_INFO(node->get_logger(), "Server spinning, ready to accept goals."); while (rclcpp::ok()) { executor.spin_some(); // NOTE: this server intentionally does not send a result if (goal_handle) { goal_handle->publish_feedback(std::make_unique()); } rate.sleep(); } return 0; } ``` Notes: 1. The server spins at 100 Hz. 2. Feedback is published at the same rate. 3. The goal is never completed as it's not relevant to this bug.

Client

```cpp #include #include #include int main(int argc, char *argv[]) { rclcpp::init(argc, argv); auto node = rclcpp::Node::make_shared("fast_feedback_client"); using Action = action_tutorials_interfaces::action::Fibonacci; using Client = rclcpp_action::Client; auto client = rclcpp_action::create_client(node, "fast_feedback"); if (!client->wait_for_action_server()) { RCLCPP_ERROR(node->get_logger(), "Action server not available after waiting"); return 1; } // NOTE: no feedback_callback is set Client::SendGoalOptions send_goal_options; bool goal_response = false; send_goal_options.goal_response_callback = [&](std::shared_future) { RCLCPP_INFO(node->get_logger(), "Goal response."); goal_response = true; }; RCLCPP_INFO(node->get_logger(), "Sending goal."); auto future = client->async_send_goal(Action::Goal(), send_goal_options); rclcpp::Rate rate(10.0); // 10 Hz, 10% of the server's rate rclcpp::executors::SingleThreadedExecutor executor; executor.add_node(node); while (rclcpp::ok() && !goal_response) { RCLCPP_INFO_THROTTLE(node->get_logger(), *node->get_clock(), 10000, "Waiting for goal response."); executor.spin_some(); rate.sleep(); } RCLCPP_INFO_STREAM(node->get_logger(), "Goal response received: " << std::boolalpha << goal_response); return 0; } ``` Notes 1. The client's rate in this example is 10% of the server's rate to make the repro foolproof for different machines/network conditions, but I've seen the bug reproduce reliably when the client is >90% of the server's rate, and even repro most of the time when they have the same rate. 2. No feedback callback is set on `SendGoalOptions` as the deadlock is not dependent on one. 3. The client exits once the goal is accepted since that's all that's relevant to this bug.

Expected behavior

The action client will accept the goal, perhaps after working through a short backlog of feedback up to the queue depth in the feedback subscription's QoS.

Actual behavior

The server has accepted and started working on the goal but the client never sees that. The client continuously logs Waiting for goal response. in the repro and debug logs have Received feedback for unknown goal. Ignoring.... Debug logs show Client in wait set is ready, which I presume to be the goal response, but it's never taken, so the client deadlocks.

Additional information

I did some digging and think I have tracked down the root cause, some other contributing factors, and some related impacts.

rclcpp_action::Client<T> derives from rclcpp_action::ClientBase derives from rclcpp::Waitable, and these are scheduled for work in an rclcpp::Executor as a single executable, not multiple executables of their constituent parts. When using rclcpp::Executor::spin_some, which only collects work from ready entities once per call, the rclcpp::Waitable is scheduled once, regardless of the amount of work it has ready. This should be fine, but the rclcpp::ClientBase::take_data implementation only yields a single executable per call and therefore the subsequent execute invoked by rclcpp::Executor only performs one thing on the client rather than everything that's ready.

https://github.com/ros2/rclcpp/blob/5e14a283b67940526a20c365b827d8090a552bad/rclcpp_action/src/client.cpp#L550-L687

In most scenarios, this is fine, but it deadlocks in this specific scenario: a server publishing feedback faster than the client is spinning (specifically, faster than it calls rclcpp::Executor::wait_for_work). This can happen with any of the spin* implementations but happens most readily with spin_some. Why? Because rclcpp::ClientBase::take_data only returns the first ready executable and it prioritizes feedback over responses. The implementation of rclcpp::ClientBase::take_data and execute take and execute in this order: feedback, status, goal response, result response, cancel response.

When you combine the implementation of rclcpp::Waitable that only returns execution data representing a single unit of work, the implicit prioritization mechanism built into ClientBase, and an action server whose feedback topic publishes faster than the spin rate of the client, you get a client that always has ready feedback for an unknown goal, ironically never executing the goal response that feedback is for.

Some ancillary effects of this: the goal response is never taken, the client never creates the goal handle, and therefore cancellation isn't easy. Cancellation can be done through rclcpp_action::Client<T>::async_cancel_all_goals but that may have adverse side effects depending on your scenario. Additionally, the client never becomes "result aware" for the goal since that occurs when the goal response is processed, so result responses are permanently lost if they occur during the deadlock. Feedback is also always dropped because it's for an unknown goal.

mauropasse commented 6 months ago

The issue can be fixed by re-ordering ClientBase::execute and ClientBase::take_data() such as:

if (pimpl_->is_feedback_ready) { ... }
else if (pimpl_->is_status_ready) {...}

go as the last conditions. So it'd be for both take_data and execute:

if (pimpl_->is_goal_response_ready) { }
else if (pimpl_->is_result_response_ready) {}
else if (pimpl_->is_cancel_response_ready) {}
else if (pimpl_->is_feedback_ready) {}
else if (pimpl_->is_status_ready) {}

Otherwise feedback & status will always have priority over server responses. Another alternative is using the EventsExecutor instead of the SingleThreadedExecutor, since the former always executes events in the order in which they were produced (no feedback over result priority).

asasine commented 6 months ago

I think reordering would work.

What of the fact that rclcpp::Waitable has to be taken multiple times to execute, unlike other executables like subscription/clients/etc? Is the design of Waitable intended to have implementations execute all of their work on execute(...) or only some of it?

EventsExecutor might work if event.num_events corresponds to the sum of the return values of the various get_number_of_ready_*() methods on rclcpp::Waitable. I haven't dug deep enough to know if it does yet.

https://github.com/ros2/rclcpp/blob/5e14a283b67940526a20c365b827d8090a552bad/rclcpp/src/rclcpp/experimental/executors/events_executor/events_executor.cpp#L341-L344

mauropasse commented 6 months ago

Is the design of Waitable intended to have implementations execute all of their work on execute(...) or only some of it?

The idea is that one single execute() of the executor, executes a single piece of work. If a rclcpp::Waitable has many events, it has to execute them separately. When calling executor->spin_once(), we expect to execute a single event and not all from a waitable.

PD: I've already tested with the EventsExecutor and it works. Nevertheless I think this has to be fixed for the SingleThreadedExecutor in rolling.

ros2 / rclcpp