ros-navigation / navigation2

ROS 2 Navigation Framework and System
https://nav2.org/
Other
2.62k stars 1.31k forks source link

Nav2 plans path through walls when there is no other way to reach the goal #1999

Closed pfedom closed 4 years ago

pfedom commented 4 years ago

Bug report

Required Info:

End of edit.

Expected behavior

The robot searches for a path to get to the point and stops with an error message when it can't reach the goal.

Actual behavior

The pathplanner plans a way through a wall. The robot drives to the wall, slows down a bit but keeps going and hits the wall.

Additional information

I use an omnidirectional robot platform with one laserscanner in front (180 degrees) and the Intel T265 camera as odometry signal. While this test slam_toolbox was running in online_async mode. I encountered the same behavior when running nav2 without slam and prerecorded map.


Is there a parameter or something else that needs to be set so the wall has maximum cost so there will never be a path planned through? Picture of the wrong planned path: image

Here my nav2_params.yaml: amcl: rosparameters: use_sim_time: False alpha1: 0.2 alpha2: 0.2 alpha3: 0.25 alpha4: 0.2 alpha5: 0.2 base_frame_id: "base_link" beam_skip_distance: 0.5 beam_skip_error_threshold: 0.9 beam_skip_threshold: 0.3 do_beamskip: false global_frame_id: "map" lambda_short: 0.1 laser_likelihood_max_dist: 2.0 laser_max_range: 25.0 laser_min_range: 0.05 laser_model_type: "likelihood_field" max_beams: 60 max_particles: 2000 min_particles: 500 odom_frame_id: "odom" pf_err: 0.05 pf_z: 0.99 recovery_alpha_fast: 0.0 recovery_alpha_slow: 0.0 resample_interval: 1 robot_model_type: "omnidirectional" save_pose_rate: 0.5 sigma_hit: 0.2 tf_broadcast: true transform_tolerance: 1.0 update_min_a: 0.2 update_min_d: 0.25 z_hit: 0.9 z_max: 0.05 z_rand: 0.5 z_short: 0.05 scan_topic: scan amcl_map_client: rosparameters: use_sim_time: False amcl_rclcpp_node: ros__parameters: use_sim_time: False bt_navigator: ros__parameters: use_sim_time: False global_frame: map robot_base_frame: base_link odom_topic: /odom default_bt_xml_filename: "navigate_w_replanning_and_recovery.xml" plugin_lib_names:

SteveMacenski commented 4 years ago

Please reformat your ticket and provide specific reproduction instructions in the TB3 default world (and any parameter changes from default you had to make to have this occur).

pfedom commented 4 years ago

With turtlebot its all still default. On the real robot, the major changes should be in the DWB Controller the velocities and accelerations. Furthermore the robot has a rectengular footprint and not a radius like the turtlebot. Like mentioned in the edit: The turtlebot slows down but keeps moving and than hits the wall (see the gif I added) I think setting a bigger inflation radius in my real setup should slow down the process because the robot slows down earlier. Do I understand this right? But is it the expected behavior that the path gets planned through a wall?

SteveMacenski commented 4 years ago

I don't see any attempt at planning into that space from your gif, I just see a robot driving into a wall which is inside of the minimim range of the laser. It wouldn't surprise me if the laser gazebo plugin doesn't properly handle collision objects there resulting in that little nub in the map. But I don't see any action items here.

I can't reproduce any replanning outside of valid space in the default TB3 world without doing really unreasonable things trying to get it to do so like spawning inside walls or putting it in an impossible position.

gramss commented 4 years ago

Cannot get it to display nice on github, but hope this video makes this issue clearer Play Video here

SteveMacenski commented 4 years ago

This is super strange, I was just testing on my branch working on Hybrid-A* and I don't see this at all, but current main branch does, thanks for bringing this up. So something's changed letting this happen from after Jun 17.

naiveHobo commented 4 years ago

It looks like the bug was introduced in e850db02ba88e03e2b6519e2b2ad51bbdc9f97f6 after the fix in RecoveryNode logic. I reverted this patch and everything seems to work as expected but that only happens because RecoveryNode doesn't retry ComputePathToPose again after it fails the first time and the planner specific recovery is called (which is what this patch was supposed to fix).

Even after reverting e850db02ba88e03e2b6519e2b2ad51bbdc9f97f6, this bug can still be reproduced by increasing the number of retries in the planning branch to anything > 1:

<RecoveryNode number_of_retries="2" name="ComputePathToPose">
    <ComputePathToPose goal="{goal}" path="{path}" planner_id="GridBased"/>
    <ClearEntireCostmap name="ClearGlobalCostmap-Context" service_name="global_costmap/clear_entirely_global_costmap"/>
</RecoveryNode>
naiveHobo commented 4 years ago

So I figured out the actual reason behind this bug.

Currently, when you give a goal when there's no valid path to the goal the first ComputePathToPose call will fail since the planner is not able to plan a valid path. The RecoveryNode then calls ClearEntireCostmap and clears the global costmap. In the very next iteration within a fraction of a second RecoveryNode retries ComputePathToPose and this time the planner plans on an empty global costmap since the global costmap was cleared and hasn't spawned back up yet. The planner is able to find a straight path to the invalid goal, ComputePathToPose succeeds and we have the bug!!

I added a 1 second sleep just above this line to test this and it checks out, everything works as expected: https://github.com/ros-planning/navigation2/blob/506384f2bcf32978428278f65567bfaf429d6d63/nav2_planner/src/planner_server.cpp#L238

SteveMacenski commented 4 years ago

Where was this originally introduced? Was it from a BT update to include clearing costmaps as a failure action in planning? I don't see how the commit hash you linked to could cause that issue (since N has been > 1 I believe always).

I see the server_timeout for clear costmaps is 10ms https://github.com/ros-planning/navigation2/blob/f237751f557b2e921e08f96d4749bfe3f2b2b091/nav2_bt_navigator/src/bt_navigator.cpp#L126 and looking at the service caller https://github.com/ros-planning/navigation2/blob/main/nav2_behavior_tree/include/nav2_behavior_tree/bt_service_node.hpp#L102 we shouldn't be returning from the service node until after its completed the clear costmaps action.

So I'd be curious what the return code from check_future is. If its timeout, we just need to increase that timeout, if its success, then there's something wrong in costmap_2d because we should not be returning from the service until things are good again. It wouldn't surprise me at all if this was actually another DDS issue and services were failing again and its manifesting for us in this way because its timing out mid-execution.

If that was the case, that sleep you added would have the same effect because the service call was received by costmap but not finished, so then when the planner is planning, the clearing is still underway causing that overlap. If that is indeed what's happening (not that the timeout isn't big enough, but service calls are failing), then we really need a better way to identify that situation or get around it if this will be an issue that comes up over and over again.

But its not clear to me where this got introduced.

naiveHobo commented 4 years ago

Where was this originally introduced?

https://github.com/ros-planning/navigation2/pull/1823 fixed a recovery node error because of which the main child of the recovery node was not retried after executing the second child. Before this patch, the flow of the recovery node was: first child (fails) -> second child (succeeds) -> recovery node (succeeds)

After this path, the flow of the recovery node was changed to: first child (fails) -> second child (succeeds) -> first child (succeeds) -> recovery node (succeeds)

I don't see how the commit hash you linked to could cause that issue (since N has been > 1 I believe always).

The number of retries for the planner specific recovery node has always been 1 because of which this issue was masked before https://github.com/ros-planning/navigation2/pull/1823 https://github.com/ros-planning/navigation2/blob/a30816f2939e441f2054df0f40775adb76b57997/nav2_bt_navigator/behavior_trees/navigate_w_replanning_and_recovery.xml#L11

I don't think the clear costmap service returns a timeout, the service executes correctly and ends up clearing the costmap. I think this issue is caused because of the time it takes to rebuild the costmap after it has been cleared. The planner server calls getPlan() right after the clear costmap service executes which makes me think that the planner actually ends up planning on the cleared costmap before the sensor data is received and the costmap is rebuilt again.

The timeout I added just before getPlan() is called gives the costmap enough time to receive sensor data and rebuild its layers after which the planner is able to plan correctly. This should confirm that the planner ends up planning on an empty costmap if it's called right after the clear costmap service executes and returns success.

SteveMacenski commented 4 years ago

After this path, the flow of the recovery node was changed to: first child (fails) -> second child (succeeds) -> first child (succeeds) -> recovery node (succeeds)

For the recovery node, I don't think that's good behavior, I think the original was correct. If the planner is now passing, we shouldn't be triggering the clear costmap again. Imagine this recovery was backing up or something, we'd be backing up, in a good state, and then backing up again.

The timeout I added just before getPlan() is called gives the costmap enough time to receive sensor data and rebuild its layers after which the planner is able to plan correctly. This should confirm that the planner ends up planning on an empty costmap if it's called right after the clear costmap service executes and returns success.

I would just make sure its not timing out, because that's something we can more easily do something about. If it really just needs time, that becomes more complicated to deal with. We'd need to keep flags between clears and updates so that we can wait until the costmap is valid again to use.

naiveHobo commented 4 years ago

If the planner is now passing, we shouldn't be triggering the clear costmap again. Imagine this recovery was backing up or something, we'd be backing up, in a good state, and then backing up again.

I think you misinterpreted my comment. The RecoveryNode is the parent node which triggers its second child when the first child fails. The second child is assumed to be the recovery behavior that you want to execute when your main execution branch (RecoveryNode's first child) fails. Taking the planning branch as an example:

<RecoveryNode number_of_retries="1" name="ComputePathToPose">
    <ComputePathToPose goal="{goal}" path="{path}" planner_id="GridBased"/>
    <ClearEntireCostmap name="ClearGlobalCostmap-Context" service_name="global_costmap/clear_entirely_global_costmap"/>
</RecoveryNode>

Before #1823, the RecoveryNode would have returned success right after ClearEntireCostmap returns success without re-ticking ComputePathToPose. The flow in that case would be: ComputePathToPose (fails) -> ClearEntireCostmap (succeeds) At this point, RecoveryNode just returns success without going back to ComputePathToPose. This is also why the planning through walls issue never showed up before #1823.

After #1823, the flow becomes: ComputePathToPose (fails) -> ClearEntireCostmap (succeeds) -> ComputePathToPose (succeeds) The first child of RecoveryNode is executed again after a successful recovery (second child) executes. At this point RecoveryNode returns success. This should be the expected behavior right?

I would just make sure its not timing out, because that's something we can more easily do something about.

I checked the return code from the ClearEntireCostmap service, it returns success. 100% sure now that it's the planner planning on an empty costmap issue.

SteveMacenski commented 4 years ago

This should be the expected behavior right?

That does sound right I misread recovery node (succeeds) part - I thought that was meaning the recovery behavior action. God, the nomenclature we used in this is going to be messing with my head for years to come, isn't it?

I checked the return code from the ClearEntireCostmap service, it returns success. 100% sure now that it's the planner planning on an empty costmap issue.

Ok.

My knee-jerk solution would be to add a variable to Costmap2DROS about whether it was cleared, and if so, it needs to wait for a new update before planning. Or in the clear costmap services themselves, the last step is an update for the costmap so that by the time it returns, it already did a sync.

SteveMacenski commented 4 years ago

@pfedom @naiveHobo please review / test this patch in https://github.com/ros-planning/navigation2/pull/2044

simutisernestas commented 4 years ago

@SteveMacenski I'm getting similar behavior at regular intervals. I thought that this patch could fix that, but unfortunately, it doesn't help at all. The main difference is that I'm not using amcl, but the map and localization is provided by SLAM at 1 Hz. The rest of the configuration is the same. By any chance, these are related? You can see path planning failures in attached screenshots, sometimes it just goes straight through obstacles. I could provide step by step reproduction here if that's relevant or should I fill in a new issue?

Screenshot from 2020-11-02 17-53-05

Screenshot from 2020-11-02 17-53-13

Screenshot from 2020-11-02 17-53-51

SteveMacenski commented 4 years ago

I'd very much like to merge a solution for this this week, will be more actively dealing with #2009 and #2044

@simutisernestas please use our default configs of things to test against these patches. We don't need to introduce any additional changes to fix unrelated bugs

SteveMacenski commented 4 years ago

https://github.com/ros-planning/navigation2/pull/2090 merging imminently to resolve

nedo99 commented 3 years ago

Is there a solution for Foxy? PRs here cannot be applied to foxy-devel branch.