ros-navigation / navigation2

ROS 2 Navigation Framework and System
https://nav2.org/
Other
2.52k stars 1.28k forks source link

BUG REPORT [ the NullPtr bug within communication between two pkgs -- nav2_planner & nav2_costmap_2d] #3940

Closed GoesM closed 11 months ago

GoesM commented 11 months ago

<1> Bug Description

Bug Type: NullPtr referenced

nav2_planner referenced a NULL-Pointer of nav2_costmap_2d

workstation_environment set

ROS system navigation2 TURTLEBOT3_MODEL
ROS2-humble nav2_humble "waffle"

code location

/navigation2/nav2_costmap_2d/src/layered_costmap.cpp

/navigation2/nav2_planner/src/planner_server.cpp

function isCurrent() from /navigation2/nav2_costmap_2d/src/layered_costmap.cpp

bool LayeredCostmap::isCurrent()
{
  current_ = true;
  for (vector<std::shared_ptr<Layer>>::iterator plugin = plugins_.begin();
    plugin != plugins_.end(); ++plugin)
  {
    current_ = current_ && ((*plugin)->isCurrent() || !(*plugin)->isEnabled());
  }
  for (vector<std::shared_ptr<Layer>>::iterator filter = filters_.begin();
    filter != filters_.end(); ++filter)
  {
    current_ = current_ && ((*filter)->isCurrent() || !(*filter)->isEnabled());
  }
  return current_;
}

isCurrent() is accessed by /navigation2/nav2_planner/src/planner_server.cpp

void PlannerServer::waitForCostmap()
{
  // Don't compute a plan until costmap is valid (after clear costmap)
  rclcpp::Rate r(100);
  while (!costmap_ros_->isCurrent()) {
    r.sleep();
  }
}

<2> References [ log_files ]

More details are provided here.

function calling stack [by Asan report]

=================================================================
==23106==ERROR: AddressSanitizer: SEGV on unknown address 0x0000000000a1 (pc 0x7f18e85ce4e0 bp 0x7f18da8b7af0 sp 0x7f18da8b7a18 T54)
==23106==The signal is caused by a WRITE memory access.
==23106==Hint: address points to the zero page.
    #0 0x7f18e85ce4e0 in nav2_costmap_2d::LayeredCostmap::isCurrent() (/.../..../...../navigation2/install/nav2_costmap_2d/lib/libnav2_costmap_2d_core.so+0xda4e0) (BuildId: 41e31d3f75e7a644fd0534297548b67bd616808d)
    #1 0x7f18e8d79c87 in nav2_planner::PlannerServer::waitForCostmap() (/.../..../...../navigation2/install/nav2_planner/lib/libplanner_server_core.so+0x179c87) (BuildId: 88d2332dc0a7b289359c7c3dfdca327d13582e9f)
    #2 0x7f18e8d70740 in nav2_planner::PlannerServer::computePlan() (/.../..../...../navigation2/install/nav2_planner/lib/libplanner_server_core.so+0x170740) (BuildId: 88d2332dc0a7b289359c7c3dfdca327d13582e9f)
    #3 0x7f18e8e74909 in nav2_util::SimpleActionServer<nav2_msgs::action::ComputePathToPose>::work() (/.../..../...../navigation2/install/nav2_planner/lib/libplanner_server_core.so+0x274909) (BuildId: 88d2332dc0a7b289359c7c3dfdca327d13582e9f)
    #4 0x7f18e8e73d88 in std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<nav2_util::SimpleActionServer<nav2_msgs::action::ComputePathToPose>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::ComputePathToPose> >)::'lambda'()> >, void>::operator()() const (/.../..../...../navigation2/install/nav2_planner/lib/libplanner_server_core.so+0x273d88) (BuildId: 88d2332dc0a7b289359c7c3dfdca327d13582e9f)
    #5 0x7f18e8e73aa9 in std::enable_if<is_invocable_r_v<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter>, std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<nav2_util::SimpleActionServer<nav2_msgs::action::ComputePathToPose>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::ComputePathToPose> >)::'lambda'()> >, void>&>, std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> >::type std::__invoke_r<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter>, std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<nav2_util::SimpleActionServer<nav2_msgs::action::ComputePathToPose>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::ComputePathToPose> >)::'lambda'()> >, void>&>(std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<nav2_util::SimpleActionServer<nav2_msgs::action::ComputePathToPose>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::ComputePathToPose> >)::'lambda'()> >, void>&) (/.../..../...../navigation2/install/nav2_planner/lib/libplanner_server_core.so+0x273aa9) (BuildId: 88d2332dc0a7b289359c7c3dfdca327d13582e9f)
    #6 0x7f18e8e73930 in std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<nav2_util::SimpleActionServer<nav2_msgs::action::ComputePathToPose>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::ComputePathToPose> >)::'lambda'()> >, void> >::_M_invoke(std::_Any_data const&) (/.../..../...../navigation2/install/nav2_planner/lib/libplanner_server_core.so+0x273930) (BuildId: 88d2332dc0a7b289359c7c3dfdca327d13582e9f)
    #7 0x7f18e8e73546 in std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) (/.../..../...../navigation2/install/nav2_planner/lib/libplanner_server_core.so+0x273546) (BuildId: 88d2332dc0a7b289359c7c3dfdca327d13582e9f)
    #8 0x7f18e6a99ee7 in __pthread_once_slow nptl/./nptl/pthread_once.c:116:7
    #9 0x7f18e8e7170a in std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<nav2_util::SimpleActionServer<nav2_msgs::action::ComputePathToPose>::handle_accepted(std::shared_ptr<rclcpp_action::ServerGoalHandle<nav2_msgs::action::ComputePathToPose> >)::'lambda'()> >, void>::_M_run() (/.../..../...../navigation2/install/nav2_planner/lib/libplanner_server_core.so+0x27170a) (BuildId: 88d2332dc0a7b289359c7c3dfdca327d13582e9f)
    #10 0x7f18e6edc252  (/lib/x86_64-linux-gnu/libstdc++.so.6+0xdc252) (BuildId: e37fe1a879783838de78cbc8c80621fa685d58a2)
    #11 0x7f18e6a94ac2 in start_thread nptl/./nptl/pthread_create.c:442:8
    #12 0x7f18e6b26a3f  misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
...
...
...
>Thread T15 created by T0 here:
    #0 0x55be89e317dc in __interceptor_pthread_create (/.../..../...../navigation2/install/nav2_planner/lib/nav2_planner/planner_server+0x977dc) (BuildId: 595dafd2fd88fddc63cc968e55abb47694f6d722)
    #1 0x7f18e6edc328 in std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State> >, void (*)()) (/lib/x86_64-linux-gnu/libstdc++.so.6+0xdc328) (BuildId: e37fe1a879783838de78cbc8c80621fa685d58a2)
    #2 0x7f18e8e5262c in nav2_util::SimpleActionServer<nav2_msgs::action::ComputePathToPose>::SimpleActionServer(std::shared_ptr<rclcpp::node_interfaces::NodeBaseInterface>, std::shared_ptr<rclcpp::node_interfaces::NodeClockInterface>, std::shared_ptr<rclcpp::node_interfaces::NodeLoggingInterface>, std::shared_ptr<rclcpp::node_interfaces::NodeWaitablesInterface>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void ()>, std::function<void ()>, std::chrono::duration<long, std::ratio<1l, 1000l> >, bool, rcl_action_server_options_s const&) (/.../..../...../navigation2/install/nav2_planner/lib/libplanner_server_core.so+0x25262c) (BuildId: 88d2332dc0a7b289359c7c3dfdca327d13582e9f)
    #3 0x7f18e8e4fd78 in nav2_util::SimpleActionServer<nav2_msgs::action::ComputePathToPose>::SimpleActionServer<std::shared_ptr<nav2_util::LifecycleNode> >(std::shared_ptr<nav2_util::LifecycleNode>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void ()>, std::function<void ()>, std::chrono::duration<long, std::ratio<1l, 1000l> >, bool, rcl_action_server_options_s const&) (/.../..../...../navigation2/install/nav2_planner/lib/libplanner_server_core.so+0x24fd78) (BuildId: 88d2332dc0a7b289359c7c3dfdca327d13582e9f)
    #4 0x7f18e8d9c42d in std::_MakeUniq<nav2_util::SimpleActionServer<nav2_msgs::action::ComputePathToPose> >::__single_object std::make_unique<nav2_util::SimpleActionServer<nav2_msgs::action::ComputePathToPose>, std::shared_ptr<nav2_util::LifecycleNode>, char const (&) [21], std::_Bind<void (nav2_planner::PlannerServer::* (nav2_planner::PlannerServer*))()>, std::nullptr_t, std::chrono::duration<long, std::ratio<1l, 1000l> >, bool>(std::shared_ptr<nav2_util::LifecycleNode>&&, char const (&) [21], std::_Bind<void (nav2_planner::PlannerServer::* (nav2_planner::PlannerServer*))()>&&, std::nullptr_t&&, std::chrono::duration<long, std::ratio<1l, 1000l> >&&, bool&&) (/.../..../...../navigation2/install/nav2_planner/lib/libplanner_server_core.so+0x19c42d) (BuildId: 88d2332dc0a7b289359c7c3dfdca327d13582e9f)
    #5 0x7f18e8d6e702 in nav2_planner::PlannerServer::on_configure(rclcpp_lifecycle::State const&) (/.../..../...../navigation2/install/nav2_planner/lib/libplanner_server_core.so+0x16e702) (BuildId: 88d2332dc0a7b289359c7c3dfdca327d13582e9f)
    #6 0x7f18e91a0b8c  (/opt/ros/humble/lib/librclcpp_lifecycle.so+0x28b8c) (BuildId: ccb1b39efb3eaee63de490386a8850a5752f307d)
==23106==ABORTING

planner_server work_log [by ros_node_log]

We have met the same bug more than 50 times totally, just in one week

Each time, the planner_server met the bug after a [INFO] as: "Received request to clear entirely the global_costmap",

The following is representative log_files of planner_server from these 50 attempts

...
...
[INFO] [*****3176.955120154] [planner_server]: Configuring plugin GridBased of type NavfnPlanner
[INFO] [*****3176.955580448] [planner_server]: Planner Server has GridBased  planners available.
[INFO] [*****3177.746643121] [planner_server]: Activating
[INFO] [*****3177.746690562] [global_costmap.global_costmap]: Activating
[INFO] [*****3177.746702439] [global_costmap.global_costmap]: Checking transform
[INFO] [*****3177.746767047] [global_costmap.global_costmap]: Timed out waiting for transform from base_link to map to become available, tf error: Lookup would require extrapolation into the past.  Requested time 0.647000 but the earliest data is at time 1.201000, when looking up transform from frame [base_link] to frame [map]
[INFO] [*****3178.246867321] [global_costmap.global_costmap]: Timed out waiting for transform from base_link to map to become available, tf error: Lookup would require extrapolation into the past.  Requested time 1.157000 but the earliest data is at time 1.201000, when looking up transform from frame [base_link] to frame [map]
[INFO] [*****3178.747014193] [global_costmap.global_costmap]: start
[INFO] [*****3179.798119907] [planner_server]: Activating plugin GridBased of type NavfnPlanner
[INFO] [*****3179.799260621] [planner_server]: Creating bond (planner_server) to lifecycle manager.
[INFO] [*****3180.284098882] [global_costmap.global_costmap]: Message Filter dropping message: frame 'base_scan' at time 0.601 for reason 'the timestamp on the message is earlier than all the data in the transform cache'
[WARN] [*****3196.562101008] [planner_server]: GridBased: failed to create plan with tolerance 0.50.
[WARN] [*****3196.562149795] [planner_server]: Planning algorithm GridBased failed to generate a valid path to (0.56, 1.51)
[WARN] [*****3196.562165888] [planner_server]: [compute_path_to_pose] [ActionServer] Aborting handle.
[INFO] [*****3196.580592290] [global_costmap.global_costmap]: Received request to clear entirely the global_costmap
[INFO] [*****3196.642112157] [global_costmap.global_costmap]: Received request to clear entirely the global_costmap
[WARN] [*****3196.753239141] [planner_server]: GridBased: failed to create plan with tolerance 0.50.
[WARN] [*****3196.753289160] [planner_server]: Planning algorithm GridBased fa

at this line, the planner_server shutdown suddenly.

<3> Analysis

In what situations bugs would happen?

When executing instructions [ nav2_goal action sent by user ], planner-node need the current costmap-result so that would access the pointer variable [ costmapros ] from nav2_costmap_2d

However, it seems that there's no checks by planner_server before accessing the pointer,

There may be a coincidental collision causing the bug:

nav2_costmap_2d node happens to have reached its lifecycle or is undergoing recalculation due to changes in sensors_msg (like odom and scan) . At this stage, the pointer may have been released, and at this point, the planner_server coincidentally called its pointer for the need of executing an action instruction , finally resulting in a null pointer access.

<4> POC design

This seems to be a concurrency problem caused by multithreaded execution, with a high frequency of bugs but also full of coincidences, so we cannot provide a 100% successful POC design. But we can provide some ideas to try and trigger the bug:

Method 1:

At present, it seems that when a certain situation occurs (such as changes in sensor information or end of life cycle), it will lead to nav2 costmap 2d's operation -- "clear entirely the global_costmap".

Therefore, it could be tried:

when program executing nav2goal aciton normally, we could send some interference sensor messages, constantly make nav2 costmap_2d perform the operations of clearing or recalculating, at the same time observing if the same bug will occur.

Method 2:

this method is just used for checking the bug, but not a real POC:

when program executing nav2_goal aciton normally, try to restart nodes related to nav2costmap 2d;

SteveMacenski commented 11 months ago

I don't understand what you think is the problem.

waitForCostmap is called once the system is lifecycle activated and processing. All plugins and filters have been initialized and started processing at this point. There should be no situation where these pointers are invalid.

They are not, however, thread locked it does not appear, so perhaps you're referring to memory corruptions with enabled_ and current_?

Your report shows that one of the traces shows on_configure - which I see no possible way waitForCostmap is called in that method or how that could possibly relate to repeated clearing costmap operations.

Thus, your report to me is unclear and I don't see a potential problem / resolution without some more information from you @GoesM

GoesM commented 11 months ago

we've confirmed that sometimes members in layered_costmap_ ( like plugins_, filters_ ) in nav2_planner/src/planner_server.cpp were changed into invalid pointers after the function isCurrent() was referenced, and then it always lead to a crash of the program.

  bool isCurrent()
  {
    return layered_costmap_->isCurrent();
  }

Sorry that our question wasn't specific enough. We will check and provide more details, Thanks for replying.

GoesM commented 11 months ago

here's our PR #3958, to decribe the bug more clearly and the solution. Check it please : ) @SteveMacenski

SteveMacenski commented 11 months ago

Closing ticket, moving discussion to PR which should be merged after you fix some linting / rebase