Open AlexeyMerzlyakov opened 11 months ago
The Problem3 is being resolved by removing second setInitialPose()
in "stop on failure test with bogous waypoint" case, so robot became to move on map correctly:
diff --git a/nav2_system_tests/src/waypoint_follower/tester.py b/nav2_system_tests/src/waypoint_follower/tester.py
index 79604767..5a9a89cc 100755
--- a/nav2_system_tests/src/waypoint_follower/tester.py
+++ b/nav2_system_tests/src/waypoint_follower/tester.py
@@ -236,9 +236,9 @@ def main(argv=sys.argv[1:]):
# stop on failure test with bogous waypoint
test.setStopFailureParam(True)
bwps = [[-0.52, -0.54], [100.0, 100.0], [0.58, 0.52]]
- starting_pose = [-2.0, -0.5]
+ #starting_pose = [-2.0, -0.5]
test.setWaypoints(bwps)
- test.setInitialPose(starting_pose)
+ #test.setInitialPose(starting_pose)
result = test.run(True, False)
assert not result
result = not result
The Problem1 and Problem2 are under investigation, I'll write the details as they will be arrived.
Analysis shown that Problem2 appears in case when FollowPath
BT node returned non-zero INVALID_PATH=103
status for the preemption goals (for some reasons), while next outside-of-map goal scenario adds GOAL_OUTSIDE_MAP=204
status to ComputePathToPose
BT node blackboard. BtActionServer<ActionT>::populateErrorCode()
routine will select the status with highest priority (INVALID_PATH < GOAL_OUTSIDE_MAP
) and will return status from previous run.
So this problem, I consider, appear because of we do not clean BT node statuses between two consecutive bt_->run()
runs, and thus keeping errors from previous BT run.
The following patch seems to work for me:
diff --git a/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server.hpp b/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server.hpp
index 79b1df5b..e0fc13e9 100644
--- a/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server.hpp
+++ b/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server.hpp
@@ -196,6 +196,11 @@ protected:
*/
void populateErrorCode(typename std::shared_ptr<typename ActionT::Result> result);
+ /**
+ * @brief Setting BT error codes to success. Used to clean blackboard between different BT runs
+ */
+ void cleanErrorCodes();
+
// Action name
std::string action_name_;
diff --git a/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server_impl.hpp b/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server_impl.hpp
index 11110b09..a60fa69f 100644
--- a/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server_impl.hpp
+++ b/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server_impl.hpp
@@ -265,6 +272,8 @@ void BtActionServer<ActionT>::executeCallback()
action_server_->terminate_all(result);
break;
}
+
+ cleanErrorCodes();
}
template<class ActionT>
@@ -291,6 +302,18 @@ void BtActionServer<ActionT>::populateErrorCode(
}
}
+template<class ActionT>
+void BtActionServer<ActionT>::cleanErrorCodes()
+{
+ if (!blackboard_) {
+ RCLCPP_WARN(logger_, "Blackboard is not created to be cleaned");
+ }
+
+ for (const auto & error_code : error_code_names_) {
+ blackboard_->set<unsigned short>(error_code, 0);
+ }
+}
+
} // namespace nav2_behavior_tree
#endif // NAV2_BEHAVIOR_TREE__BT_ACTION_SERVER_IMPL_HPP_
@SteveMacenski, doesn't this approach break whole BT logic, I hope?
The Problem1 root cause is still unclear. Some observations on it:
From BT side, when the robot reached its goal, NavigateToPoseNavigator
calls succeeded_current()
for its action server:
switch (rc) {
case nav2_behavior_tree::BtStatus::SUCCEEDED:
RCLCPP_INFO(logger_, "Goal succeeded");
action_server_->succeeded_current(result);
break;
which should trigger SUCCEEDED
status of the goal.
However, from WaypointFollower side, the WaypointFollower::resultCallback()
gets UNKNOWN
status of this result, which seems really strange:
switch (result.code) {
case rclcpp_action::ResultCode::SUCCEEDED:
current_goal_status_.status = ActionStatus::SUCCEEDED;
return;
case rclcpp_action::ResultCode::ABORTED:
current_goal_status_.status = ActionStatus::FAILED;
current_goal_status_.error_code = result.result->error_code;
return;
case rclcpp_action::ResultCode::CANCELED:
current_goal_status_.status = ActionStatus::FAILED;
return;
case rclcpp_action::ResultCode::UNKNOWN: // Alexey's debugging
RCLCPP_INFO(get_logger(), "Got UNKNOWN result from BT");
current_goal_status_.status = ActionStatus::FAILED;
return;
From first glance, it might seem that the issue to be in the RCLCPP. However, this problem appears every time when recovery RecoveryFallback
switched during the navigation, and, opposite, does not appear during normal navigation subtree operation w/o recoveries. This makes me think, that during switching to recoveries something is happening with nav2_msgs::action::NavigateToPose
action server-client chain (e.g. RCLCPP context was changed, new pointer (re)initiated, or something else).
So this problem, I consider, appear because of we do not clean BT node statuses between two consecutive bt_->run() runs, and thus keeping errors from previous BT run.
Oh, that's a really good find Alexey! Yes, resetting the blackboard for the error codes is sensible at the end of a task and/or when halted due to cancellation of the navigation action.
+ if (!blackboard_) {
+ RCLCPP_WARN(logger_, "Blackboard is not created to be cleaned");
+ }
You should be able to skip this, that blackboard is part of the factory construction so it should always exist at this point.
However, this problem appears every time when recovery RecoveryFallback switched during the navigation, and, opposite, does not appear during normal navigation subtree operation w/o recoveries.
That is odd. The best that I can think of is that the WPF is sending the Nav2Pose to the BT which itself then sends the appropriate commands to the task servers. When we enter recovery, the task servers are halted when we execute recoveries and then re-tasked once we finish and start navigating again. So the commands to planner/controller/etc are re-issued and thus have a new UUID / goal handle. But we should still be in the same goal handle for Nav2Pose since all that happens internally to the BT which the Nav2Pose action server doesn't track.
Perhaps this is another error code issue or something we're we're propagating the error code from the halted task servers instead of the final, complete end result code? That's the only server-client changes I can see happening.
Note these shenanigans we have to do with checking that the goal ID is right https://github.com/ros-planning/navigation2/blob/main/nav2_waypoint_follower/src/waypoint_follower.cpp#L307-L312
Try changing the debug to an info in the simple action server to make sure that the current handle is actually active / we're returning success on it?
void succeeded_current(
typename std::shared_ptr<typename ActionT::Result> result =
std::make_shared<typename ActionT::Result>())
{
std::lock_guard<std::recursive_mutex> lock(update_mutex_);
if (is_active(current_handle_)) {
debug_msg("Setting succeed on current goal."); // Here and check logs
current_handle_->succeed(result);
current_handle_.reset();
}
}
Problem1 bugfix:
I've added the checks and got that the goal_id
was always the same for BT ActionServer and in resultCallback()
for WPF each time during 3-waypoints test scenario phase:
[tester_node-16] [INFO] [1692773504.064704366] [nav2_waypoint_tester]: Received amcl_pose
[tester_node-16] [INFO] [1692773504.718885261] [nav2_waypoint_tester]: Received amcl_pose
[controller_server-8] [INFO] [1692773504.958734862] [controller_server]: Passing new path to controller.
[tester_node-16] [INFO] [1692773505.812664647] [nav2_waypoint_tester]: Received amcl_pose
[controller_server-8] [INFO] [1692773505.914895209] [controller_server]: Reached the goal!
[bt_navigator-12] [INFO] [1692773505.947715042] [bt_navigator]: Goal succeeded
[bt_navigator-12] ALEX: succeeded_current: result.goal_id = c3a927770364788626b154153d323c1
<- this goal_id was send to succeeded_current() by BT
[bt_navigator-12] [INFO] [1692773505.948040788] [bt_navigator]: ALEX: succeeded_current
[waypoint_follower-13] ALEX: WaypointFollower::resultCallback: result.goal_id = c3a927770364788626b154153d323c1
<- the same goal_id received from BT. No issue here
[waypoint_follower-13] [INFO] [1692773506.576328268] [waypoint_follower]: ALEX: got UNKNOWN result from BT
[waypoint_follower-13] [INFO] [1692773506.626355822] [waypoint_follower]: Failed to process waypoint 2, moving to next.
[waypoint_follower-13] [INFO] [1692773506.626475290] [waypoint_follower]: Completed all 3 waypoints requested.
[tester_node-16] [INFO] [1692773506.634029050] [nav2_waypoint_tester]: Goal failed to process all waypoints, missed 1 wps.
[tester_node-16] Traceback (most recent call last):
[tester_node-16] File "/home/amerzlyakov/nav2_ws/src/navigation2/nav2_system_tests/src/waypoint_follower/tester.py", line 249, in <module>
[tester_node-16] main()
[tester_node-16] File "/home/amerzlyakov/nav2_ws/src/navigation2/nav2_system_tests/src/waypoint_follower/tester.py", line 214, in main
[tester_node-16] assert result
[tester_node-16] AssertionError
So, this made me think that this is related to RCLCPP and check it.
After some analysis, it was found the same problem described in https://github.com/ros2/rclcpp/issues/2101. RCLCPP considers the the result status not updated more than 10 seconds as UNKNOWN and returns it back to WPF in resultCallback()
. This also explains, why the Problem1 appeared only on recoveries situation.
The problem did solved by increasing the result_timeout
for BtActionServer
to 30 seconds
:
diff --git a/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server_impl.hpp b/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server_impl.hpp
index 11110b09..8782a996 100644
--- a/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server_impl.hpp
+++ b/nav2_behavior_tree/include/nav2_behavior_tree/bt_action_server_impl.hpp
@@ -109,12 +111,17 @@ bool BtActionServer<ActionT>::on_configure()
// Support for handling the topic-based goal pose from rviz
client_node_ = std::make_shared<rclcpp::Node>("_", options);
+ rcl_action_server_options_t server_options = rcl_action_server_get_default_options();
+ server_options.result_timeout.nanoseconds = RCL_S_TO_NS(30); // 30 seconds timeout to get the result from BT
+
action_server_ = std::make_shared<ActionServer>(
node->get_node_base_interface(),
node->get_node_clock_interface(),
node->get_node_logging_interface(),
node->get_node_waitables_interface(),
- action_name_, std::bind(&BtActionServer<ActionT>::executeCallback, this));
+ action_name_, std::bind(&BtActionServer<ActionT>::executeCallback, this),
+ nullptr, std::chrono::milliseconds(500), false, server_options);
// Get parameters for BT timeouts
int timeout;
The value of 30 seconds
was chosen by numerous experiments on 3-waypoints initial scenario runs and obtaining the timings required for recoveries on 3rd most difficult waypoint:
(Sure, this timeout might be set as ROS-parameter for bt_navigator
, rather than hardcoded above, if needed)
Also, it was found that due to short distances of navigation, the robot operates much more stable on prune_distance = 1.0
instead of default 2.0
meters value (see https://github.com/ros-planning/navigation2/pull/3098 for more details).
Local experiments on https://github.com/ros-planning/navigation2/commit/822560ce551afde513deb8be5e5c6a750f8e6fc0 shows that probability of recovery situation in 3-waypoints scenario for waypoint follower system test as follows:
Runs | Recoveries | Recovery probability, % | |
---|---|---|---|
Prune 1m | 48 | 5 | 10% |
Prune 2m (default) | 46 | 14 | 30% |
So, prune_distance
and forward_prune_distance
are considered be shortened for WaypointFollwer system test to 1.0
meter. Actually, the problem that these parameters are not described in nav2_params.yaml
and RewrittenYaml
can't replace non-existing parameters yet (this is some kind of stuck situation for now). The best solution - is to add new parameters declaring ability API to RewrittenYam
, which will also help to avoid boilerplate works synchronizing Costmap Filters system tests YAML's with the main one.
Finally, I'd like to propose to treat UNKNOWN
(default case) statuses in WPF main loop as failures, which will force WPF to fail, rather than hang for infinite time.
After some analysis, it was found the same problem described in https://github.com/ros2/rclcpp/issues/2101. RCLCPP considers the the result status not updated more than 10 seconds as UNKNOWN and returns it back to WPF in resultCallback()
Oh what the hell... How does this not cause our end-users problems as well for the WPF, the controller server, and the BT task in general? Those are frequently running tasks longer than 10 seconds.
Any time limit for how long a task runs in the action server is too low since we could have navigation requests (or WPF, or controller, etc) that could take hours to complete. Lets set this to absolute max anywhere we send actions within Nav2 and inform users about this problem publicly // comment on that ticket for a default to be maxed out or disabled.
I think that logic needs to be updated so its not based on a timeout but a timeout after an event (e.g. after the goal has been completed or aborted) so that we don't delete currently running goals. Can you file a ticket on that and tag me and @clalancette ?
New PR #3785 with fix of the most of issues described in this ticket. The problem with RCLCPP action timeout to be handled in next separate step, I'd suppose.
Got it, @pepisg messaged me yesterday on Slack regarding re-adding the 15 minute timeout on the WPF server specifically as a temporary stop gap. Since the controller updates goals each time its preempted and planning occurs quickly, I think its only an issue with the WPF internally to Nav2.
... but is definitely still an issue for users of Nav2's base action servers for Nav2Pose and NavThroughPoses, which bears some thought about a permanent solution in rclcpp. Is that something you're open to working on @AlexeyMerzlyakov ?
Thanks for your diligent work on this and uncovering an rclcpp issue!
The params have been merged in, so that should be a good medium term solution until a more permanent change can be made in rclcpp itself. As such, I renamed the ticket
@SteveMacenski @AlexeyMerzlyakov the fix is ready https://github.com/ros2/rcl/pull/1121, it would be appreciated if you can try and review! thanks in advance.
CC: @Barry-Xu-2018
So much thanks @barry-xu-2018!
@barry-xu-2018 can this be closed with your recent PR?
@SteveMacenski
https://github.com/ros2/rcl/pull/1121 isn't merged by now.
Perhaps we need to wait for the merge before closing this issue.
This issue describes the flaky issues appeared with
test_waypoint_follower
system testBug report
Required Info:
Steps to reproduce issue
rviz_config_file = os.path.join(bringup_dir, 'rviz', 'nav2_default_view.rviz')
Replace the
use_astar
setting on the params fileconfigured_params = RewrittenYaml( source_file=params_file, @@ -59,6 +61,13 @@ def generate_launch_description(): '--minimal_comms', world], output='screen'),
Launch rviz
Node(
package='rviz2',
executable='rviz2',
arguments=['-d', rviz_config_file],
output='screen'),
Expected behavior
Test passes 30 times in a row.
Actual behavior
The following problems were found during the test:
Problem1: No action
When the robot is moving near the obstacle, it could stuck. Sometimes, this causes recovery behaviors that being triggered. If this recovery was triggered (moreover, it was detected that only "Received request to clear entirely the local_costmap" event is enough for that), and robot successfully reached its goal, but the
NavigateToPoseNavigator
action returnsUNKNOWN
status instead ofSUCCEEDED
, which causes mainWaypointFollower
cycle to hang:https://github.com/ros-planning/navigation2/assets/60094858/1731f55f-7fe7-491b-9bdf-1a1f84cf701b
Problem2: Incorrect out-of-map status
Sometimes, when running
set waypoint outside of map
test scenario,missed_waypoint
containsINVALID_PATH=103
status instead of expectedComputePathToPose.Goal().GOAL_OUTSIDE_MAP=204
which gives the following error messages:Problem3: Robot is getting lost after setting initial pose second time
In the
![Screenshot_2023-08-16_09-41-50_cr](https://github.com/ros-planning/navigation2/assets/60094858/e30f4ae6-4d80-4139-8214-a60ef15d32ea)
stop on failure test with bogous waypoint
test scenario there istest.setInitialPose(starting_pose)
called second time while robot is not placed on its initial pose. This causes robot to lose its position. In most cases, robot still operating and reaching its goal; but sometimes since its position is getting lost, robot continues its uncontrolled movement between pillars causing testcase to fail by timeout: