Closed GoesM closed 3 months ago
So that, the bug seems to be caused because
action_server_
is still executing but mppi has been freed.
But, I'm confused about such result because I notice there's already a action_server_->deactivate()
to make sure the thread of action_server_
shutdown before cleanup.
So I check all related functions , and I noticed those lines:
It's found that computeAndPublishVelocity()
is called following while(rclcpp::ok())
But, due to our experiments before, we could know that rclcpp::ok()
might bring some exception errors for waiting threads exit, here's the link to our discussion before:
https://github.com/SteveMacenski/slam_toolbox/issues/691
Shall we also replace the rclcpp::ok()
by a bool flag ?
same happens while using RPP, see: https://github.com/ros-navigation/navigation2/issues/4438 (duplicate)
@GoesM do you see any of the following errors with deactivate? https://github.com/ros-navigation/navigation2/blob/main/nav2_util/include/nav2_util/simple_action_server.hpp#L305-L320
I assume you only see this one shutdown when the system is currently operating, not when the robot is staying still and you shutdown the system (otherwise that thread / action server wouldn't be active in the first place). Thus, I would think you'd see one of these 2 errors / warnings.
If we set the server active state to false in the deactive function, it should be accessible by the controller server here to be checking it, which runs every cycle. So, at most we should be 1 cycle away from checking that we're not active anymore and thus should exit the control loop. The action server should be waiting 500ms while trying to shut down (https://github.com/ros-navigation/navigation2/blob/main/nav2_util/include/nav2_util/simple_action_server.hpp#L315) which should be plenty for the controller's loop to end since the control rates should pretty much always be 10hz+.
So... this seems like we shouldn't be having an issue, regardless of rclcpp::ok
and you should be seeing error / warning logs about some failures on deactivate()
that we could use to debug further.
Firstly, sorry for my previous misunderstanding regarding this bug.
To ensure the accuracy of my fix, I conducted several experiments to verify my hypothesis.
After these experiments, I invalidated my previous analysis, but fortunately, I now have a very definite answer.
This is still an issue caused by costmap_ros_
as an independent lifecycle node responding to the Ctrl+C
signal before nav2_controller
, resulting in a null pointer bug.
Therefore, this issue and #4438 are actually additional tickets to issue #4437, which will be resolved through a single PR.
Adding a delay before rclcpp::ok()
does not trigger the bug.
std::this_thread::sleep_for(std::chrono::seconds(2));
rclcpp::WallRate loop_rate(controller_frequency_);
while (rclcpp::ok()) {
Adding a delay after rclcpp::ok()
but before computeVelocityCommands()
almost always triggers the bug.
updateGlobalPath();
std::this_thread::sleep_for(std::chrono::seconds(2));
computeAndPublishVelocity();
Inserting a delay at the same position as in Experiment 2 but replacing rclcpp::ok()
with a bool flag
method still almost always triggers the bug.
This indicates that rclcpp::ok()
is not the cause. My previous speculative analysis was incorrect.
Inserting a delay before computeVelocityCommands()
makes the deactivate
process of action_server_
very slow, while costmap_ros_
, as an independent ROS2-LifecycleNode, has already responded to the Ctrl+C
signal;
At this time, nav2_controller
is attempting to execute computeVelocityCommands()
and will call different controller plugins. However, plugins like mppi_controller
, regulated_pure_pursuit_controller
, and the default dwb_controller
all call functions of costmap_ros_->()
, leading to null pointer access.
regulated_pure_pursuit_controller
accesses costmap_ros_->getCostmap()
:mppi_controller
accesses costmap_ros_->worldToMap()
:dwb_controller
accesses costmap_ros_->getSizeInCellsX()
:Merged!
Bug report
Required Info:
Steps to reproduce issue
The bug occurs during normal use of nav2_mppi_controller. By changing the controller plugin to MPPI while keeping the rest of the configuration the same as the official default, start navigation2. The YAML configuration for changing the controller plugin to MPPI is:
Expected behavior
no NullPtr bug occured.
Actual behavior
the Asan report of this NullPtr bug is as following:
Additional information
It's a shutdown-issue
First, based on my execution logs, I can confirm this is a shutdown issue.
the casue analysis
Below is an analysis of the cause of this bug:
The actionserver binds the
computeControl()
function as a callback function:https://github.com/ros-navigation/navigation2/blob/635880d3d6ad1f23eb75d793409189c653797f58/nav2_controller/src/controller_server.cpp#L228-L234
Therefore, when the
action_server_
receives a task command, it will invoke the function pathcomputeControl()
->computeAndPublishVelocity()
->mppi::PathHandler::transformPath()
-> .......NOTICE: from here, the
action_server_
access the internal memory ofnav2_mppi_controller
, during thecomputeControl()
execution.So that, the bug seems to be caused because
action_server_
is still executing butmppi
has been freed.