ros2 / rmw_cyclonedds

ROS 2 RMW layer for Eclipse Cyclone DDS
Apache License 2.0
114 stars 90 forks source link

Action-server fails (with an action-client to self) #47

Closed orduno closed 5 years ago

orduno commented 5 years ago

Bug report

Required Info:

Steps to reproduce issue

Expected behavior

Action server receives the request and starts sending feedback:

[INFO] [minimal_action_server]: Received goal request with order 10
[INFO] [minimal_action_server]: Executing goal
[INFO] [minimal_action_server]: Publish Feedback
[INFO] [minimal_action_server]: Publish Feedback
...

Actual behavior

Node crashes:

[INFO] [minimal_action_server]: Received goal request with order 10
terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
  what():  error taking goal response: error not set
[INFO] [minimal_action_server]: Executing goal

Additional information

As you inspect the code you'll notice something a bit unorthodox: we have an action-server created on MinimalActionServer and an action-client on an internal node, both actions have the same type and name. Also, nodes are spinning on separate executors.

The idea behind this is that the user has the option to request the execution of an action by publishing a /trigger message. This is a simplified version of something we're doing in ros2/nav2.

As mentioned above, if the user sends the request through the action interface the node crashes. However if the user sends the request as a message:

ros2 topic pub /trigger std_msgs/Empty

it is successfully processed :thinking:

rotu commented 5 years ago

@orduno Thanks for distilling this down to a digestible issue!

orduno commented 5 years ago

Also, I wonder if we might run into the same issue if we replace the actions with services. I haven't tried that though.

eboasson commented 5 years ago

The issue is — or appears to be — that I incorrectly assumed all client code would be able to handle spurious events.

Based on that assumption the Cyclone DDS RMW implementation filters the responses to a client inside rmw_take_response. Thus, currently, any time a message of the client's response topic arrives, the executor is triggered and (indirectly) calls rmw_take_response (in rmw_cyclonedds_cpp). That function deserialises the response, looks at the header and sets the (out param) taken to true if it was really meant for this client and to false if not.

The RMW_RET_OK with taken = false gets converted into a RCL_RET_CLIENT_TAKE_FAILED by rclcpp and then into RCL_RET_ACTION_CLIENT_TAKE_FAILED for the action-specific stuff. The description of the error code simply says: "if take failed but no error occurred". Whatever that means 🙂.

That brings one to the action client executor: https://github.com/ros2/rclcpp/blob/b8f723708757b0d0b7f29de47544798bafc24a6e/rclcpp_action/src/client.cpp#L406-L416 where RCL_RET_ACTION_CLIENT_TAKE_FAILED is not ignored. Change the code for at least to of the takes to a no-op if that error is returned and it works (there are bunch of them there).

The regular client executor in https://github.com/ros2/rclcpp/blob/b8f723708757b0d0b7f29de47544798bafc24a6e/rclcpp/src/rclcpp/executor.cpp#L404 simply does nothing when this error is returned, which makes me think that the original intent may well have been to allow such spurious events. So maybe my assumption wasn't entirely unreasonable.

I suspect FastRTPS filters the responses at an earlier stage and avoids the spurious wakeup.

I am not quite sure what the rules of the game are here. I think it would make sense for rclcpp to treat all such spurious events the same, whether that'd be ignoring them or erroring out on them. I have found this one location, but there may be more and so I have no idea of the feasibility of this route to fixing the issue.

It is certainly sensible to avoid generating such spurious events in the first place. That's possible, right now there are various routes one could take. Cyclone's current hack for content filtering may be a bit primitive but it can handle this case, or one could use a query condition or do the filtering inside rmw_wait. I'll have a think on how to do this inside Cyclone's RMW layer.

eboasson commented 5 years ago

Anybody feel like trying out PR #48? I've only had a chance to try it on Eloquent so far.

eboasson commented 5 years ago

I'm pretty sure the fixes that went into rclcpp_action should mean it everything now works fine. But it would be nice if someone could confirm with the full nav2 stack just to be sure ...

orduno commented 5 years ago

It's working great! The original issue https://github.com/ros-planning/navigation2/issues/1124 is solved.

Tested using: cyclonedds 76fa68808682a15dd8aff26da20622ac574fa1a1 rmw_cyclonedds e959e66f0810134d6016ab0a33753b0c6891b1d0 rclcpp_action 2716d3e81ed1bfd8b1afe8d07bdeda7d3a4ea61b

By the way, great performance on a multi-robot scenario too! -- I tested with 4 simulated robots each with independent navigations stacks running on the same domain, no issues with discovery, services or actions.

eboasson commented 5 years ago

Thanks for confirming and for the kind words, @orduno !

I think it's time to close the issue. Should things start failing again, please create a new one.