ros-navigation / navigation2

ROS 2 Navigation Framework and System
https://nav2.org/
Other
2.46k stars 1.24k forks source link

AMCL core dumps due to a race condition when used with the map server. #4537

Closed techtasie closed 1 month ago

techtasie commented 1 month ago

Bug report

Required Info:

Steps to reproduce issue

In the core dump, you see that map_ can be a null pointer and therefore you get an unhandled null pointer exception. Uncommenting those lines circumvented the crash for us:

diff --color -ur navigation2-release-release-jazzy-nav2_amcl-1.3.1-1/src/amcl_node.cpp navigation2-release-release-jazzy-nav2_amcl-1.3.1-1-new/src/amcl_node.cpp
--- navigation2-release-release-jazzy-nav2_amcl-1.3.1-1/src/amcl_node.cpp   2024-06-25 20:01:09.000000000 +0200
+++ navigation2-release-release-jazzy-nav2_amcl-1.3.1-1-new/src/amcl_node.cpp   2024-07-15 17:43:12.792477289 +0200
@@ -540,13 +540,13 @@
       global_frame_id_.c_str());
     return;
   }
-  if (abs(msg->pose.pose.position.x) > map_->size_x ||
-    abs(msg->pose.pose.position.y) > map_->size_y)
-  {
-    RCLCPP_ERROR(
-      get_logger(), "Received initialpose from message is out of the size of map. Rejecting.");
-    return;
-  }
+  // if (abs(msg->pose.pose.position.x) > map_->size_x ||
+  //   abs(msg->pose.pose.position.y) > map_->size_y)
+  // {
+  //   RCLCPP_ERROR(
+  //     get_logger(), "Received initialpose from message is out of the size of map. Rejecting.");
+  //   return;
+  // }

   // Overriding last published pose to initial pose
   last_published_pose_ = *msg;

There is a race condition between the receiving of the map topic subscriber and setting the initial pose. On our robot, this race condition almost always happens but could be hard to reproduce on other setups.

Expected behavior

AMCL localization at initial pose or at least not crashing.

Actual behavior

AMCL coredumps

Additional information

I have attached our core dump. For the core dump, I rebuilt only AMCL in Debug mode; everything else was built as ReleaseWithDebug. core.amcl.1000.c11f12581e0949ccb1dcd9e3d30b45d0.13740.1721053490000000.tar.gz We are using a custom map server that may experience a slight delay when starting up. Therefore, AMCL could be launched without a map.

SteveMacenski commented 1 month ago

That was recently added in https://github.com/ros-navigation/navigation2/pull/4416 by @GoesM during his fuzzing testing.

That makes sense, I think we should check if map_ != nullptr as part of that if statement before dereferencing it. @techtasie can you submit that change and make sure it solves your issue?

i.e.

if (map_ && (abs(msg->pose.pose.position.x) > map_->size_x || abs(msg->pose.pose.position.y) > map_->size_y))
techtasie commented 1 month ago

@SteveMacenski Thank you for your fast reply I will test it thoroughly tomorrow during robocup.

GoesM commented 1 month ago

Oh, so sorry for that! This is entirely caused by my mistake! I didn't notice that this map might be a null pointer.

techtasie commented 1 month ago

Don't we need to still check for the nullptr like this?

  if ( (!first_map_received_) || (map_ && (abs(msg->pose.pose.position.x) > map_->size_x ||
    abs(msg->pose.pose.position.y) > map_->size_y))
  {
    RCLCPP_ERROR(
      get_logger(), "Received initialpose from message is out of the size of map. Rejecting.");
    return;
  }
techtasie commented 1 month ago

I thing the way it is supposed to work is that the initial pose fires directly after the map has been received. Because in some circumstances you want to set the initial pose but get the map a second or so after amcl launch.

GoesM commented 1 month ago

I thing the way it is supposed to work is that the initial pose fires directly after the map has been received. Because in some circumstances you want to set the initial pose but get the map a second or so after amcl launch.

I see. Thanks for your share √ 😀

SteveMacenski commented 1 month ago

I think in that case we just need to check for map_ != nullptr, no? That would have been the behavior prior to the change (i.e. could set the initial pose before a map) so I think that's solid to continue to do

GoesM commented 1 month ago

That would have been the behavior prior to the change

no problem, I misunderstood something before and I've understood it now.

SteveMacenski commented 1 month ago

https://github.com/ros-navigation/navigation2/pull/4605 to resolve. thanks for the note @techtasie !