ros-navigation / navigation2

ROS 2 Navigation Framework and System
https://nav2.org/
Other
2.52k stars 1.28k forks source link

Reconfiguring planner server fails #3768

Closed Aposhian closed 1 year ago

Aposhian commented 1 year ago

Bug report

Required Info:

Steps to reproduce issue

ros2 launch nav2_bringup navigation.launch.py
ros2 lifecycle set /planner_server deactivate
ros2 lifecycle set /planner_server cleanup
ros2 lifecycle set /planner_server configure

Expected behavior

Planner server reconfigures with no issue

Actual behavior

Planner server throws an exception and crashes.

[planner_server-3] [INFO] [1692651875.999218146] [planner_server]: Configuring
[planner_server-3] [INFO] [1692651875.999248989] [global_costmap.global_costmap]: Configuring
[planner_server-3] [INFO] [1692651876.000645715] [global_costmap.global_costmap]: Using plugin "static_layer"
[planner_server-3] [INFO] [1692651876.000701514] [global_costmap.global_costmap]: Subscribing to the map topic (/map) with transient local durability
[planner_server-3] [INFO] [1692651876.000898492] [global_costmap.global_costmap]: Initialized plugin "static_layer"
[planner_server-3] [INFO] [1692651876.000919229] [global_costmap.global_costmap]: Using plugin "obstacle_layer"
[planner_server-3] [INFO] [1692651876.000967905] [global_costmap.global_costmap]: Subscribed to Topics: scan
[planner_server-3] [INFO] [1692651876.001283202] [global_costmap.global_costmap]: Initialized plugin "obstacle_layer"
[planner_server-3] [INFO] [1692651876.001304907] [global_costmap.global_costmap]: Using plugin "inflation_layer"
[planner_server-3] [INFO] [1692651876.001413722] [global_costmap.global_costmap]: Initialized plugin "inflation_layer"
[planner_server-3] terminate called after throwing an instance of 'std::runtime_error'
[planner_server-3]   what():  Node '/global_costmap/global_costmap' has already been added to an executor.
[planner_server-3] [INFO] [1692651876.003720405] [planner_server]: Created global planner plugin GridBased of type nav2_navfn_planner/NavfnPlanner

Additional information

I think the underlying reason for this might be that planner server runs costmap_ros in its own thread, while costmap_ros already creates its own thread. https://github.com/ros-planning/navigation2/blob/6ef3d7bfd0f617bd0751cdbe7a58c3f9f66fe882/nav2_planner/src/planner_server.cpp#L88 https://github.com/ros-planning/navigation2/blob/6ef3d7bfd0f617bd0751cdbe7a58c3f9f66fe882/nav2_costmap_2d/src/costmap_2d_ros.cpp#L253

SteveMacenski commented 1 year ago

Thanks for the re-report that this is still an issue (I closed the original report of this since I thought the Humble sync's change resolved this).

while costmap_ros already creates its own thread

That thread is for the TF2 buffer updates and the plugins themselves, not the core costmap which the planner server version entails. You can note that by the callback group that it is given. The thread established in the planner/controller server deals with things like the footprint subscription/publication, the costmap publication, and clear costmap service.

While its a little messy, I was concerned that long-updates in the costmap utils would block updating of sensor data to the individual layers causing data to be dropped (which is bad for the costmap to do!). We might be able to re-add this all into one executor again and make costmap layers each themselves handle their own executor, but I decided at the time that N executors wasn't a great option in terms of compute overhead.

Anyway... its been like that for a long time now and I don't think that the fact that there are the 2 executors is the cause of any problem or else we'd see this happen on the first startup as well. The issue we have is on resetting only, so I think something's not being shutdown / reset that needs to be.

The update from 1.1.9 is that we moved the planner/controller server node threads to be in configure not the constructor so that resetting could work https://github.com/ros-planning/navigation2/pull/3548 @BriceRenaudeau do you see this problem?

Perhaps we need to have the costmap configure happen after the costmapthread creation in on_configure of the planner/controller server? Or there's something not being reset properly (maybe an explicit executor_ / callbackgroup reset?).

BriceRenaudeau commented 1 year ago

We moved to Iron and we don't have this issue.

We are using the nav2_lifecycle_manager and nav2_lifecycle_manager_client to activate/reset the navigation nodes.

SteveMacenski commented 1 year ago

@Aposhian can you confirm you don't see this on Iron? If so, that gives me a relatively easy diff to look at for the solution. I have Jury Duty next week (ugh) so I should have some downtime to look at this between events

Aposhian commented 1 year ago

I can check on iron

Aposhian commented 1 year ago

Yes I can confirm this issue doesn't show up on iron, but it does on humble. I confirmed with the following for both humble and iron:

docker run -it --rm --name nav2-test ros:$ROS_DISTRO
apt update && apt install ros-$ROS_DISTRO-navigation2 ros-$ROS_DISTRO-nav2* ros-$ROS_DISTRO-rmw-cyclonedds-cpp
export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
ros2 launch nav2_bringup navigation_launch.py
docker exec -it nav2-test /ros_entrypoint.sh bash
ros2 lifecycle set /planner_server cleanup
ros2 lifecycle set /planner_server configure
SteveMacenski commented 1 year ago

Did this not used to happen in Humble or is this just something new you're experiencing because you're trying something new (e.g. checking if a regression or just uncovering an issue)?

If uncovering, I suspect why Iron works is due to https://github.com/ros-planning/navigation2/commit/4aaefad1162d5edcba26878cdb4a2f11322ada99 fixing some lifecycle transition items which wasn't backported to Humble due to ABI breaking changes. Its possible however that we could find another way to do it in Humble with less major changes.

If its a regression, I'm at a bit of a loss unless it has nothing to do with Nav2 at all but rclcpp. In which case, I'd ask you to try the second-to-last Humble release to see if its still an issue to show that its not a change we made but something lower-level. At that point, a trace would be the only way to see what line is specifically triggering and try to backtrack from there

Aposhian commented 1 year ago

I'm not sure if this is something new to the latest humble release. This was triggered by controller_server segfaulting unexpectedly and triggering a lifecycle reset of everything. I'll file an issue on that separate cause when I have more info.

SteveMacenski commented 1 year ago

I'm going to play with this this afternoon and figure something out

SteveMacenski commented 1 year ago

@Aposhian try this on for size, seems to work for me: https://github.com/ros-planning/navigation2/pull/3786

SteveMacenski commented 1 year ago

Closing, issue resolved