rclcpp_action goal handle timeout logic clearing long-running goals

AlexeyMerzlyakov commented 11 months ago

This issue describes the flaky issues appeared with test_waypoint_follower system test

Bug report

Required Info:

Operating System:
- Ubuntu 22.04.3
ROS2 Version:
- ROS2 Rolling (built from sources)
Version or commit hash:
- b6f3b028 (and older 822560ce)
DDS implementation:
- Cyclone DDS

Steps to reproduce issue

Enable visualization for Waypoint Follower system test by following patch:


diff --git a/nav2_system_tests/src/waypoint_follower/test_case_py.launch b/nav2_system_tests/src/waypoint_follower/test_case_py.launch
index adeb839c..c7194fe9 100755
--- a/nav2_system_tests/src/waypoint_follower/test_case_py.launch
+++ b/nav2_system_tests/src/waypoint_follower/test_case_py.launch
@@ -40,6 +40,8 @@ def generate_launch_description():
 bringup_dir = get_package_share_directory('nav2_bringup')
 params_file = os.path.join(bringup_dir, 'params/nav2_params.yaml')

rviz_config_file = os.path.join(bringup_dir, 'rviz', 'nav2_default_view.rviz')
Replace the use_astar setting on the params file

configured_params = RewrittenYaml( source_file=params_file, @@ -59,6 +61,13 @@ def generate_launch_description(): '--minimal_comms', world], output='screen'),
Launch rviz
Node(
package='rviz2',
executable='rviz2',
arguments=['-d', rviz_config_file],
output='screen'),

 # TODO(orduno) Launch the robot state publisher instead
 #              using a local copy of TB3 urdf file
 Node(


2. Run `test_waypoint_follower` manually, or using the attached below stress-test script:
[run_stress.sh.txt](https://github.com/ros-planning/navigation2/files/12380528/run_stress.sh.txt)

Expected behavior

Test passes 30 times in a row.

Actual behavior

The following problems were found during the test:

Problem1: No action

When the robot is moving near the obstacle, it could stuck. Sometimes, this causes recovery behaviors that being triggered. If this recovery was triggered (moreover, it was detected that only "Received request to clear entirely the local_costmap" event is enough for that), and robot successfully reached its goal, but the NavigateToPoseNavigator action returns UNKNOWN status instead of SUCCEEDED, which causes main WaypointFollower cycle to hang:

https://github.com/ros-planning/navigation2/assets/60094858/1731f55f-7fe7-491b-9bdf-1a1f84cf701b

Problem2: Incorrect out-of-map status

Sometimes, when running set waypoint outside of map test scenario, missed_waypoint contains INVALID_PATH=103 status instead of expected ComputePathToPose.Goal().GOAL_OUTSIDE_MAP=204 which gives the following error messages:

[bt_navigator-12] [INFO] [1692161747.546793612] [bt_navigator]: Begin navigating from current location to (100.00, 100.00)
[planner_server-10] [WARN] [1692161747.548072894] [planner_server]: GridBased plugin failed to plan from (0.70, 0.19) to (100.00, 100.00): "Goal Coordinates of(100.000000, 100.000000) was outside bounds"
[planner_server-10] [WARN] [1692161747.548158462] [planner_server]: [compute_path_to_pose] [ActionServer] Aborting handle.
[bt_navigator-12] [ERROR] [1692161747.577081372] [bt_navigator]: Goal failed
[bt_navigator-12] [WARN] [1692161747.577271398] [bt_navigator]: [navigate_to_pose] [ActionServer] Aborting handle.
[tester_node-16] [INFO] [1692161747.949221718] [nav2_waypoint_tester]: Goal failed to process all waypoints, missed 1 wps.
[tester_node-16] Traceback (most recent call last):
[tester_node-16]   File "/home/amerzlyakov/nav2_ws/src/navigation2/nav2_system_tests/src/waypoint_follower/tester.py", line 271, in <module>
[tester_node-16]     main()
[tester_node-16]   File "/home/amerzlyakov/nav2_ws/src/navigation2/nav2_system_tests/src/waypoint_follower/tester.py", line 233, in main
[tester_node-16]     assert test.action_result.missed_waypoints[0].error_code == \
[tester_node-16] AssertionError

Problem3: Robot is getting lost after setting initial pose second time

In the stop on failure test with bogous waypoint test scenario there is test.setInitialPose(starting_pose) called second time while robot is not placed on its initial pose. This causes robot to lose its position. In most cases, robot still operating and reaching its goal; but sometimes since its position is getting lost, robot continues its uncontrolled movement between pillars causing testcase to fail by timeout: Screenshot_2023-08-16_09-41-43_cr Screenshot_2023-08-16_09-41-50_cr

AlexeyMerzlyakov commented 11 months ago

The Problem3 is being resolved by removing second setInitialPose() in "stop on failure test with bogous waypoint" case, so robot became to move on map correctly:

diff --git a/nav2_system_tests/src/waypoint_follower/tester.py b/nav2_system_tests/src/waypoint_follower/tester.py
index 79604767..5a9a89cc 100755
--- a/nav2_system_tests/src/waypoint_follower/tester.py
+++ b/nav2_system_tests/src/waypoint_follower/tester.py
@@ -236,9 +236,9 @@ def main(argv=sys.argv[1:]):
     # stop on failure test with bogous waypoint
     test.setStopFailureParam(True)
     bwps = [[-0.52, -0.54], [100.0, 100.0], [0.58, 0.52]]
-    starting_pose = [-2.0, -0.5]
+    #starting_pose = [-2.0, -0.5]
     test.setWaypoints(bwps)
-    test.setInitialPose(starting_pose)
+    #test.setInitialPose(starting_pose)
     result = test.run(True, False)
     assert not result
     result = not result

AlexeyMerzlyakov commented 11 months ago

The Problem1 and Problem2 are under investigation, I'll write the details as they will be arrived.

AlexeyMerzlyakov commented 10 months ago

Analysis shown that Problem2 appears in case when FollowPath BT node returned non-zero INVALID_PATH=103 status for the preemption goals (for some reasons), while next outside-of-map goal scenario adds GOAL_OUTSIDE_MAP=204 status to ComputePathToPose BT node blackboard. BtActionServer<ActionT>::populateErrorCode() routine will select the status with highest priority (INVALID_PATH < GOAL_OUTSIDE_MAP) and will return status from previous run.

So this problem, I consider, appear because of we do not clean BT node statuses between two consecutive bt_->run() runs, and thus keeping errors from previous BT run.

The following patch seems to work for me:

diff --git a/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server.hpp b/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server.hpp
index 79b1df5b..e0fc13e9 100644
--- a/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server.hpp
+++ b/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server.hpp
@@ -196,6 +196,11 @@ protected:
    */
   void populateErrorCode(typename std::shared_ptr<typename ActionT::Result> result);

+  /**
+   * @brief Setting BT error codes to success. Used to clean blackboard between different BT runs
+   */
+  void cleanErrorCodes();
+
   // Action name
   std::string action_name_;

diff --git a/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server_impl.hpp b/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server_impl.hpp
index 11110b09..a60fa69f 100644
--- a/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server_impl.hpp
+++ b/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server_impl.hpp
@@ -265,6 +272,8 @@ void BtActionServer<ActionT>::executeCallback()
       action_server_->terminate_all(result);
       break;
   }
+
+  cleanErrorCodes();
 }

 template<class ActionT>
@@ -291,6 +302,18 @@ void BtActionServer<ActionT>::populateErrorCode(
   }
 }

+template<class ActionT>
+void BtActionServer<ActionT>::cleanErrorCodes()
+{
+  if (!blackboard_) {
+    RCLCPP_WARN(logger_, "Blackboard is not created to be cleaned");
+  }
+
+  for (const auto & error_code : error_code_names_) {
+    blackboard_->set<unsigned short>(error_code, 0);
+  }
+}
+
 }  // namespace nav2_behavior_tree

 #endif  // NAV2_BEHAVIOR_TREE__BT_ACTION_SERVER_IMPL_HPP_

@SteveMacenski, doesn't this approach break whole BT logic, I hope?

AlexeyMerzlyakov commented 10 months ago

The Problem1 root cause is still unclear. Some observations on it:

From BT side, when the robot reached its goal, NavigateToPoseNavigator calls succeeded_current() for its action server:

  switch (rc) {
    case nav2_behavior_tree::BtStatus::SUCCEEDED:
      RCLCPP_INFO(logger_, "Goal succeeded");
      action_server_->succeeded_current(result);
      break;

which should trigger SUCCEEDED status of the goal.

However, from WaypointFollower side, the WaypointFollower::resultCallback() gets UNKNOWN status of this result, which seems really strange:

  switch (result.code) {
    case rclcpp_action::ResultCode::SUCCEEDED:
      current_goal_status_.status = ActionStatus::SUCCEEDED;
      return;
    case rclcpp_action::ResultCode::ABORTED:
      current_goal_status_.status = ActionStatus::FAILED;
      current_goal_status_.error_code = result.result->error_code;
      return;
    case rclcpp_action::ResultCode::CANCELED:
      current_goal_status_.status = ActionStatus::FAILED;
      return;
    case rclcpp_action::ResultCode::UNKNOWN:  // Alexey's debugging
      RCLCPP_INFO(get_logger(), "Got UNKNOWN result from BT");
      current_goal_status_.status = ActionStatus::FAILED;
      return;

From first glance, it might seem that the issue to be in the RCLCPP. However, this problem appears every time when recovery RecoveryFallback switched during the navigation, and, opposite, does not appear during normal navigation subtree operation w/o recoveries. This makes me think, that during switching to recoveries something is happening with nav2_msgs::action::NavigateToPose action server-client chain (e.g. RCLCPP context was changed, new pointer (re)initiated, or something else).

SteveMacenski commented 10 months ago

So this problem, I consider, appear because of we do not clean BT node statuses between two consecutive bt_->run() runs, and thus keeping errors from previous BT run.

Oh, that's a really good find Alexey! Yes, resetting the blackboard for the error codes is sensible at the end of a task and/or when halted due to cancellation of the navigation action.

+  if (!blackboard_) {
+    RCLCPP_WARN(logger_, "Blackboard is not created to be cleaned");
+  }

You should be able to skip this, that blackboard is part of the factory construction so it should always exist at this point.

However, this problem appears every time when recovery RecoveryFallback switched during the navigation, and, opposite, does not appear during normal navigation subtree operation w/o recoveries.

That is odd. The best that I can think of is that the WPF is sending the Nav2Pose to the BT which itself then sends the appropriate commands to the task servers. When we enter recovery, the task servers are halted when we execute recoveries and then re-tasked once we finish and start navigating again. So the commands to planner/controller/etc are re-issued and thus have a new UUID / goal handle. But we should still be in the same goal handle for Nav2Pose since all that happens internally to the BT which the Nav2Pose action server doesn't track.

Perhaps this is another error code issue or something we're we're propagating the error code from the halted task servers instead of the final, complete end result code? That's the only server-client changes I can see happening.

Note these shenanigans we have to do with checking that the goal ID is right https://github.com/ros-planning/navigation2/blob/main/nav2_waypoint_follower/src/waypoint_follower.cpp#L307-L312

Try changing the debug to an info in the simple action server to make sure that the current handle is actually active / we're returning success on it?

  void succeeded_current(
    typename std::shared_ptr<typename ActionT::Result> result =
    std::make_shared<typename ActionT::Result>())
  {
    std::lock_guard<std::recursive_mutex> lock(update_mutex_);

    if (is_active(current_handle_)) {
      debug_msg("Setting succeed on current goal.");  // Here and check logs
      current_handle_->succeed(result);
      current_handle_.reset();
    }
  }

AlexeyMerzlyakov commented 10 months ago

Problem1 bugfix:

I've added the checks and got that the goal_id was always the same for BT ActionServer and in resultCallback() for WPF each time during 3-waypoints test scenario phase:

[tester_node-16] [INFO] [1692773504.064704366] [nav2_waypoint_tester]: Received amcl_pose
[tester_node-16] [INFO] [1692773504.718885261] [nav2_waypoint_tester]: Received amcl_pose
[controller_server-8] [INFO] [1692773504.958734862] [controller_server]: Passing new path to controller.
[tester_node-16] [INFO] [1692773505.812664647] [nav2_waypoint_tester]: Received amcl_pose
[controller_server-8] [INFO] [1692773505.914895209] [controller_server]: Reached the goal!
[bt_navigator-12] [INFO] [1692773505.947715042] [bt_navigator]: Goal succeeded
[bt_navigator-12] ALEX: succeeded_current: result.goal_id = c3a927770364788626b154153d323c1
<- this goal_id was send to succeeded_current() by BT
[bt_navigator-12] [INFO] [1692773505.948040788] [bt_navigator]: ALEX: succeeded_current
[waypoint_follower-13] ALEX: WaypointFollower::resultCallback: result.goal_id = c3a927770364788626b154153d323c1
<- the same goal_id received from BT. No issue here

[waypoint_follower-13] [INFO] [1692773506.576328268] [waypoint_follower]: ALEX: got UNKNOWN result from BT
[waypoint_follower-13] [INFO] [1692773506.626355822] [waypoint_follower]: Failed to process waypoint 2, moving to next.
[waypoint_follower-13] [INFO] [1692773506.626475290] [waypoint_follower]: Completed all 3 waypoints requested.
[tester_node-16] [INFO] [1692773506.634029050] [nav2_waypoint_tester]: Goal failed to process all waypoints, missed 1 wps.
[tester_node-16] Traceback (most recent call last):
[tester_node-16]   File "/home/amerzlyakov/nav2_ws/src/navigation2/nav2_system_tests/src/waypoint_follower/tester.py", line 249, in <module>
[tester_node-16]     main()
[tester_node-16]   File "/home/amerzlyakov/nav2_ws/src/navigation2/nav2_system_tests/src/waypoint_follower/tester.py", line 214, in main
[tester_node-16]     assert result
[tester_node-16] AssertionError

So, this made me think that this is related to RCLCPP and check it. After some analysis, it was found the same problem described in https://github.com/ros2/rclcpp/issues/2101. RCLCPP considers the the result status not updated more than 10 seconds as UNKNOWN and returns it back to WPF in resultCallback(). This also explains, why the Problem1 appeared only on recoveries situation.

The problem did solved by increasing the result_timeout for BtActionServer to 30 seconds:

diff --git a/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server_impl.hpp b/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server_impl.hpp
index 11110b09..8782a996 100644
--- a/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server_impl.hpp
+++ b/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server_impl.hpp
@@ -109,12 +111,17 @@ bool BtActionServer<ActionT>::on_configure()
   // Support for handling the topic-based goal pose from rviz
   client_node_ = std::make_shared<rclcpp::Node>("_", options);

+  rcl_action_server_options_t server_options = rcl_action_server_get_default_options();
+  server_options.result_timeout.nanoseconds = RCL_S_TO_NS(30);  // 30 seconds timeout to get the result from BT
+
   action_server_ = std::make_shared<ActionServer>(
     node->get_node_base_interface(),
     node->get_node_clock_interface(),
     node->get_node_logging_interface(),
     node->get_node_waitables_interface(),
-    action_name_, std::bind(&BtActionServer<ActionT>::executeCallback, this));
+    action_name_, std::bind(&BtActionServer<ActionT>::executeCallback, this),
+    nullptr, std::chrono::milliseconds(500), false, server_options);

   // Get parameters for BT timeouts
   int timeout;

The value of 30 seconds was chosen by numerous experiments on 3-waypoints initial scenario runs and obtaining the timings required for recoveries on 3rd most difficult waypoint: Correct_timeout

(Sure, this timeout might be set as ROS-parameter for bt_navigator, rather than hardcoded above, if needed)

Also, it was found that due to short distances of navigation, the robot operates much more stable on prune_distance = 1.0 instead of default 2.0 meters value (see https://github.com/ros-planning/navigation2/pull/3098 for more details). Local experiments on https://github.com/ros-planning/navigation2/commit/822560ce551afde513deb8be5e5c6a750f8e6fc0 shows that probability of recovery situation in 3-waypoints scenario for waypoint follower system test as follows:

	Runs	Recoveries	Recovery probability, %
Prune 1m	48	5	10%
Prune 2m (default)	46	14	30%

So, prune_distance and forward_prune_distance are considered be shortened for WaypointFollwer system test to 1.0 meter. Actually, the problem that these parameters are not described in nav2_params.yaml and RewrittenYaml can't replace non-existing parameters yet (this is some kind of stuck situation for now). The best solution - is to add new parameters declaring ability API to RewrittenYam, which will also help to avoid boilerplate works synchronizing Costmap Filters system tests YAML's with the main one.

Finally, I'd like to propose to treat UNKNOWN (default case) statuses in WPF main loop as failures, which will force WPF to fail, rather than hang for infinite time.

SteveMacenski commented 10 months ago

After some analysis, it was found the same problem described in https://github.com/ros2/rclcpp/issues/2101. RCLCPP considers the the result status not updated more than 10 seconds as UNKNOWN and returns it back to WPF in resultCallback()

Oh what the hell... How does this not cause our end-users problems as well for the WPF, the controller server, and the BT task in general? Those are frequently running tasks longer than 10 seconds.

Any time limit for how long a task runs in the action server is too low since we could have navigation requests (or WPF, or controller, etc) that could take hours to complete. Lets set this to absolute max anywhere we send actions within Nav2 and inform users about this problem publicly // comment on that ticket for a default to be maxed out or disabled.

SteveMacenski commented 10 months ago

I think that logic needs to be updated so its not based on a timeout but a timeout after an event (e.g. after the goal has been completed or aborted) so that we don't delete currently running goals. Can you file a ticket on that and tag me and @clalancette ?

AlexeyMerzlyakov commented 10 months ago

New PR #3785 with fix of the most of issues described in this ticket. The problem with RCLCPP action timeout to be handled in next separate step, I'd suppose.

SteveMacenski commented 10 months ago

Got it, @pepisg messaged me yesterday on Slack regarding re-adding the 15 minute timeout on the WPF server specifically as a temporary stop gap. Since the controller updates goals each time its preempted and planning occurs quickly, I think its only an issue with the WPF internally to Nav2.

... but is definitely still an issue for users of Nav2's base action servers for Nav2Pose and NavThroughPoses, which bears some thought about a permanent solution in rclcpp. Is that something you're open to working on @AlexeyMerzlyakov ?

Thanks for your diligent work on this and uncovering an rclcpp issue!

SteveMacenski commented 10 months ago

The params have been merged in, so that should be a good medium term solution until a more permanent change can be made in rclcpp itself. As such, I renamed the ticket

fujitatomoya commented 7 months ago

@SteveMacenski @AlexeyMerzlyakov the fix is ready https://github.com/ros2/rcl/pull/1121, it would be appreciated if you can try and review! thanks in advance.

CC: @Barry-Xu-2018

SteveMacenski commented 7 months ago

So much thanks @barry-xu-2018!

SteveMacenski commented 6 months ago

@barry-xu-2018 can this be closed with your recent PR?

Barry-Xu-2018 commented 6 months ago

@SteveMacenski

https://github.com/ros2/rcl/pull/1121 isn't merged by now.
Perhaps we need to wait for the merge before closing this issue.

ros-navigation / navigation2