Open asasine opened 6 months ago
The issue can be fixed by re-ordering ClientBase::execute and ClientBase::take_data() such as:
if (pimpl_->is_feedback_ready) { ... }
else if (pimpl_->is_status_ready) {...}
go as the last conditions. So it'd be for both take_data
and execute
:
if (pimpl_->is_goal_response_ready) { }
else if (pimpl_->is_result_response_ready) {}
else if (pimpl_->is_cancel_response_ready) {}
else if (pimpl_->is_feedback_ready) {}
else if (pimpl_->is_status_ready) {}
Otherwise feedback & status will always have priority over server responses.
Another alternative is using the EventsExecutor instead of the SingleThreadedExecutor
, since the former always executes events in the order in which they were produced (no feedback over result priority).
I think reordering would work.
What of the fact that rclcpp::Waitable
has to be taken multiple times to execute, unlike other executables like subscription/clients/etc? Is the design of Waitable
intended to have implementations execute all of their work on execute(...)
or only some of it?
EventsExecutor
might work if event.num_events
corresponds to the sum of the return values of the various get_number_of_ready_*()
methods on rclcpp::Waitable
. I haven't dug deep enough to know if it does yet.
Is the design of Waitable intended to have implementations execute all of their work on execute(...) or only some of it?
The idea is that one single execute()
of the executor, executes a single piece of work. If a rclcpp::Waitable
has many events, it has to execute them separately.
When calling executor->spin_once()
, we expect to execute a single event and not all from a waitable.
PD: I've already tested with the EventsExecutor
and it works.
Nevertheless I think this has to be fixed for the SingleThreadedExecutor
in rolling
.
Bug report
When running an
rclcpp_action::Client<T>
in usingspin_some()
and arclcpp::Rate
, if the server you connect to publishes feedback at a faster rate than yourrclcpp::Rate
, your client waits for the goal to be accepted forever, effectively becoming deadlocked.Required Info:
Note: based on my understanding of the root cause (see
Additional information
), I believe this bug exists in rolling too, but I've only tested on Galactic as that's what I have access to at the moment.Steps to reproduce issue
The below server and client reproduce 100% of the time on my machine when running over loopback. Compilation settings shouldn't matter. Once built, they can be run with
ros2 run ...
or directly with./path/to/binary
. I'm usingaction_tutorials_interfaces/action/Fibonacci
, but this should repro with any action interface.Server
```cpp #includeClient
```cpp #includeExpected behavior
The action client will accept the goal, perhaps after working through a short backlog of feedback up to the queue depth in the feedback subscription's QoS.
Actual behavior
The server has accepted and started working on the goal but the client never sees that. The client continuously logs
Waiting for goal response.
in the repro and debug logs haveReceived feedback for unknown goal. Ignoring...
. Debug logs showClient in wait set is ready
, which I presume to be the goal response, but it's never taken, so the client deadlocks.Additional information
I did some digging and think I have tracked down the root cause, some other contributing factors, and some related impacts.
rclcpp_action::Client<T>
derives fromrclcpp_action::ClientBase
derives fromrclcpp::Waitable
, and these are scheduled for work in anrclcpp::Executor
as a single executable, not multiple executables of their constituent parts. When usingrclcpp::Executor::spin_some
, which only collects work from ready entities once per call, therclcpp::Waitable
is scheduled once, regardless of the amount of work it has ready. This should be fine, but therclcpp::ClientBase::take_data
implementation only yields a single executable per call and therefore the subsequentexecute
invoked byrclcpp::Executor
only performs one thing on the client rather than everything that's ready.https://github.com/ros2/rclcpp/blob/5e14a283b67940526a20c365b827d8090a552bad/rclcpp_action/src/client.cpp#L550-L687
In most scenarios, this is fine, but it deadlocks in this specific scenario: a server publishing feedback faster than the client is spinning (specifically, faster than it calls
rclcpp::Executor::wait_for_work
). This can happen with any of thespin*
implementations but happens most readily withspin_some
. Why? Becauserclcpp::ClientBase::take_data
only returns the first ready executable and it prioritizes feedback over responses. The implementation ofrclcpp::ClientBase::take_data
andexecute
take and execute in this order: feedback, status, goal response, result response, cancel response.When you combine the implementation of
rclcpp::Waitable
that only returns execution data representing a single unit of work, the implicit prioritization mechanism built intoClientBase
, and an action server whose feedback topic publishes faster than the spin rate of the client, you get a client that always has ready feedback for an unknown goal, ironically never executing the goal response that feedback is for.Some ancillary effects of this: the goal response is never taken, the client never creates the goal handle, and therefore cancellation isn't easy. Cancellation can be done through
rclcpp_action::Client<T>::async_cancel_all_goals
but that may have adverse side effects depending on your scenario. Additionally, the client never becomes "result aware" for the goal since that occurs when the goal response is processed, so result responses are permanently lost if they occur during the deadlock. Feedback is also always dropped because it's for an unknown goal.