ros-navigation / navigation2

ROS 2 Navigation Framework and System
https://nav2.org/
Other
2.52k stars 1.28k forks source link

Timed out while waiting for action server to acknowledge goal request for compute_path_through_poses (or follow_path) #4278

Closed DylanDeCoeyer-Quimesis closed 5 months ago

DylanDeCoeyer-Quimesis commented 5 months ago

Bug report

Required Info:

Steps to reproduce issue

The issue occurs randomly when the bt_navigator sends a goal to the compute_path_through_poses action provided by the planner_server node. The BT node used to send the request is ComputePathThroughPosesAction. In my case, the ComputePathThroughPoses action request contains a list of 3 goals.

Expected behavior

The planner_server receives and acknowledges the request, computes the path, then returns the result to the bt_navigator.

Actual behavior

The planner_server receives the request and computes the path, but the bt_navigator does not seem to received the acknowledgement and ends up triggering the timeout. Here are the logs, with some additional information.

Apr 25 13:22:17.620047 [bt_navigator-12] [INFO]: Path is obsolete, computing a new one
Apr 25 13:22:17.624897 [planner_server-9] [INFO]: Attempting to find a path from (8.61, 2.92) to (4.99, 3.79).
Apr 25 13:22:17.625304 [planner_server-9] [INFO]: SmacPlannerHybrid: Start pose in world: (8.612866, 2.919776), in costmap (96, 39)
Apr 25 13:22:17.625636 [planner_server-9] [INFO]: SmacPlannerHybrid: Goal pose in world: (4.992264, 3.793334), in costmap (59, 47)
Apr 25 13:22:17.625954 [planner_server-9] [INFO]: Attempting to find a path from (4.99, 3.79) to (3.01, 3.79).
Apr 25 13:22:17.626311 [planner_server-9] [INFO]: SmacPlannerHybrid: Start pose in world: (4.992264, 3.793334), in costmap (59, 47)
Apr 25 13:22:17.626693 [planner_server-9] [INFO]: SmacPlannerHybrid: Goal pose in world: (3.015000, 3.793334), in costmap (40, 47)
Apr 25 13:22:17.627004 [planner_server-9] [INFO]: Attempting to find a path from (3.01, 3.79) to (1.01, 5.79).
Apr 25 13:22:17.627244 [planner_server-9] [INFO]: SmacPlannerHybrid: Start pose in world: (3.015000, 3.793334), in costmap (40, 47)
Apr 25 13:22:17.628079 [planner_server-9] [INFO]: SmacPlannerHybrid: Goal pose in world: (1.015000, 5.793334), in costmap (20, 67)
Apr 25 13:22:17.669789 [bt_navigator-12] [WARN]: Behavior tree engine cycle is slower than expected: 59 ms (timeout is 50 ms)
Apr 25 13:22:17.670153 [bt_navigator-12] [WARN]: Time required to tick the tree is 58 ms
Apr 25 13:22:17.702070[bt_navigator-12] [WARN]: Timed out while waiting for action server to acknowledge goal request for compute_path_through_poses
Apr 25 13:22:17.707512 [bt_navigator-12] [WARN]: Behavior tree engine cycle is slower than expected: 50 ms (timeout is 50 ms)

Additional information

I noticed that the issue also happens with the follow_path action. But it never happens when the bt_navigator triggers my custom actions.

Relevant bt_navigator configuration.

bt_navigator:
  ros__parameters:
    default_server_timeout: 100
    bt_loop_duration: 50

When the issue occurs, the CPU load is not especially high (~50%).

SteveMacenski commented 5 months ago

I don't have a good answer for you about this, I have not seen this and this isn't a known cause for any of the flakiness in our system testing (which we call these servers regularly from)

I'm curious if its related to your BT tick rate, you're running it at 20 hz so maybe its too slow or you need to increase the timeout? Have you tried this with other DDS vendors like Cyclone and still see it occurring?

DylanDeCoeyer-Quimesis commented 5 months ago

@SteveMacenski thanks for your answer. I forgot to mention that we use a discovery server.

I tried to switch to CycloneDDS, but encountered the following error : https://github.com/ros2/rmw_cyclonedds/issues/458 . The proposed solution does not work for me at the moment. I'll keep investigating on monday.

DylanDeCoeyer-Quimesis commented 5 months ago

Hi, I finally managed to fix the Failed to find a free participant index for domain 0 error by using the proposed solution.

Our robot has been running with CycloneDDS for the past 2 days, and the action acknowledgment timeout error has not occurred once. This is very encouraging since it used to occur a few times per day with Fast-DDS.

@SteveMacenski Regarding your remark about the BT tick rate, why would it be too slow ? I don't see a direct link with the observed timeout. Also, the chosen value of 100ms for the timeout seems already high to me. A normal action acknowledgement is usually received within a few ms. I don't think that increasing it will help.

SteveMacenski commented 5 months ago

Interesting. You may want to file a ticket with RMW Fast DDS (https://github.com/ros2/rmw_fastrtps) that some current behavior has caused you an issue and link back here. I'm sure they're going to ask you to come up with some more minimal test they can reproduce so they can dig into it when they have the cycles.

I didn't know if one, other, both, or neither was the cause of the issue, I was giving you some options of things to test to see what made a behavioral change that I could dig into :-) It seems that the DDS or RMW is a likely cause so its worth making folks over there aware of this in case it strikes a chord.

You can safely ignore the tick rate stuff! I think that largely closes this ticket out for Nav2? Not much we can do on the RMW front unfortunately over here.