ros-navigation / navigation2

ROS 2 Navigation Framework and System
https://nav2.org/
Other
2.52k stars 1.28k forks source link

A large value for the inflation-radius parameter in globalcostmap could cause WRITE-memory-access-error. #4062

Closed GoesM closed 8 months ago

GoesM commented 8 months ago

Bug report

Required Info:

Steps to reproduce issue

To adapt to scenarios where the robot within high-speed needs to decelerate in advance, we tried setting a larger inflation_radius for the global_costmap in the configuration and launch navigation2 under this setup.

Here is our launch command:

source install/setup.bash
source /opt/ros/humble/setup.bash
export TURTLEBOT3_MODEL=waffle
export GAZEBO_MODEL_PATH=$GAZEBO_MODEL_PATH:/opt/ros/humble/share/turtlebot3_gazebo/models
ros2 launch nav2_bringup tb3_simulation_launch.py params_file:=my_nav2_params.yaml

there's only one difference between my_nav2_params.yaml and defaulted nav2_params.yaml, ( we specially set the inflation_radius of global_costmap as 2333), as following:

#my_nav2_params.yaml
......
global_costmap:
      ......
      inflation_layer:
        inflation_radius: 2333.3
    ......

Expected behavior

We expected to see the robot's performance within a larger inflation_radius, but we encountered a program crash instead.

Actual behavior

We found WRITE-memory-access-error during the costmap calculation process.

Additional information

asan-report

In order to identify the cause of the program crash, we attempted to use the Asanitizer to capture detailed information about the crash,as following:

#asan_report
=================================================================
==87154==ERROR: AddressSanitizer: SEGV on unknown address 0x7f75b3e9e000 (pc 0x7f75bef84aca bp 0x0000081bbafa sp 0x7fffc773d110 T0)
==87154==The signal is caused by a WRITE memory access.
    #0 0x7f75bef84aca in nav2_costmap_2d::InflationLayer::computeCaches() (/home/.../ROS_ws/nav2/install/nav2_costmap_2d/lib/liblayers.so+0x184aca) (BuildId: 053d6c4190bf3c85c0b7d1884d980b5209d377f4)
    #1 0x7f75bef84863 in nav2_costmap_2d::InflationLayer::matchSize() (/home/.../ROS_ws/nav2/install/nav2_costmap_2d/lib/liblayers.so+0x184863) (BuildId: 053d6c4190bf3c85c0b7d1884d980b5209d377f4)
    #2 0x7f75bef8379a in nav2_costmap_2d::InflationLayer::onInitialize() (/home/.../ROS_ws/nav2/install/nav2_costmap_2d/lib/liblayers.so+0x18379a) (BuildId: 053d6c4190bf3c85c0b7d1884d980b5209d377f4)
    #3 0x7f75bed22088 in nav2_costmap_2d::Layer::initialize(nav2_costmap_2d::LayeredCostmap*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tf2_ros::Buffer*, std::weak_ptr<rclcpp_lifecycle::LifecycleNode> const&, std::shared_ptr<rclcpp::CallbackGroup>) (/home/.../ROS_ws/nav2/install/nav2_costmap_2d/lib/libnav2_costmap_2d_core.so+0xd7088) (BuildId: 1ff17a2f2827088023e2e75f00edafe9ef8d043c)
    #4 0x7f75bed2c18a in nav2_costmap_2d::Costmap2DROS::on_configure(rclcpp_lifecycle::State const&) (/home/.../ROS_ws/nav2/install/nav2_costmap_2d/lib/libnav2_costmap_2d_core.so+0xe118a) (BuildId: 1ff17a2f2827088023e2e75f00edafe9ef8d043c)
    #5 0x7f75bf8ddb8c  (/opt/ros/humble/lib/librclcpp_lifecycle.so+0x28b8c) (BuildId: ccb1b39efb3eaee63de490386a8850a5752f307d)
    #6 0x7f75bf8e8a89  (/opt/ros/humble/lib/librclcpp_lifecycle.so+0x33a89) (BuildId: ccb1b39efb3eaee63de490386a8850a5752f307d)
    #7 0x7f75bf8dd365 in rclcpp_lifecycle::LifecycleNode::configure() (/opt/ros/humble/lib/librclcpp_lifecycle.so+0x28365) (BuildId: ccb1b39efb3eaee63de490386a8850a5752f307d)
    #8 0x7f75bf56b9e8 in nav2_planner::PlannerServer::on_configure(rclcpp_lifecycle::State const&) (/home/rozz-gy/ROS_ws/fuzz_nav2/install/nav2_planner/lib/libplanner_server_core.so+0x16b9e8) (BuildId: 9577178dbbd58bcd57de6e7a3301f7aadab1ef76)
    #9 0x7f75bf8ddb8c  (/opt/ros/humble/lib/librclcpp_lifecycle.so+0x28b8c) (BuildId: ccb1b39efb3eaee63de490386a8850a5752f307d)
    #10 0x7f75bf8e8a89  (/opt/ros/humble/lib/librclcpp_lifecycle.so+0x33a89) (BuildId: ccb1b39efb3eaee63de490386a8850a5752f307d)
    #11 0x7f75bf8d5dbf in rclcpp_lifecycle::LifecycleNode::LifecycleNodeInterfaceImpl::on_change_state(std::shared_ptr<rmw_request_id_s>, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Request_<std::allocator<void> > >, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Response_<std::allocator<void> > >) (/opt/ros/humble/lib/librclcpp_lifecycle.so+0x20dbf) (BuildId: ccb1b39efb3eaee63de490386a8850a5752f307d)
    #12 0x7f75bf8d7985 in std::_Function_handler<void (std::shared_ptr<rmw_request_id_s>, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Request_<std::allocator<void> > >, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Response_<std::allocator<void> > >), std::_Bind<void (rclcpp_lifecycle::LifecycleNode::LifecycleNodeInterfaceImpl::* (rclcpp_lifecycle::LifecycleNode::LifecycleNodeInterfaceImpl*, std::_Placeholder<1>, std::_Placeholder<2>, std::_Placeholder<3>))(std::shared_ptr<rmw_request_id_s>, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Request_<std::allocator<void> > >, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Response_<std::allocator<void> > >)> >::_M_invoke(std::_Any_data const&, std::shared_ptr<rmw_request_id_s>&&, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Request_<std::allocator<void> > >&&, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Response_<std::allocator<void> > >&&) (/opt/ros/humble/lib/librclcpp_lifecycle.so+0x22985) (BuildId: ccb1b39efb3eaee63de490386a8850a5752f307d)
    #13 0x7f75bf8e6735  (/opt/ros/humble/lib/librclcpp_lifecycle.so+0x31735) (BuildId: ccb1b39efb3eaee63de490386a8850a5752f307d)
    #14 0x7f75be4c8295  (/opt/ros/humble/lib/librclcpp.so+0xea295) (BuildId: 79589eb7552577357ad155ddcc914b38e18f0811)
    #15 0x7f75be4c5c79 in rclcpp::Executor::execute_service(std::shared_ptr<rclcpp::ServiceBase>) (/opt/ros/humble/lib/librclcpp.so+0xe7c79) (BuildId: 79589eb7552577357ad155ddcc914b38e18f0811)
    #16 0x7f75be4c5fe5 in rclcpp::Executor::execute_any_executable(rclcpp::AnyExecutable&) (/opt/ros/humble/lib/librclcpp.so+0xe7fe5) (BuildId: 79589eb7552577357ad155ddcc914b38e18f0811)
    #17 0x7f75be4cd89f in rclcpp::executors::SingleThreadedExecutor::spin() (/opt/ros/humble/lib/librclcpp.so+0xef89f) (BuildId: 79589eb7552577357ad155ddcc914b38e18f0811)
    #18 0x7f75be4cdab4 in rclcpp::spin(std::shared_ptr<rclcpp::node_interfaces::NodeBaseInterface>) (/opt/ros/humble/lib/librclcpp.so+0xefab4) (BuildId: 79589eb7552577357ad155ddcc914b38e18f0811)
    #19 0x558caf9bc975 in main (/home/.../ROS_ws/nav2/install/nav2_planner/lib/nav2_planner/planner_server+0xeb975) (BuildId: 7a20b60ecc77aeff2e67a6ba4a294fcdddbdd0a2)
    #20 0x7f75bd229d8f in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #21 0x7f75bd229e3f in __libc_start_main csu/../csu/libc-start.c:392:3
    #22 0x558caf8fc514 in _start (/home/.../ROS_ws/nav2/install/nav2_planner/lib/nav2_planner/planner_server+0x2b514) (BuildId: 7a20b60ecc77aeff2e67a6ba4a294fcdddbdd0a2)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV (/home/.../ROS_ws/nav2/install/nav2_costmap_2d/lib/liblayers.so+0x184aca) (BuildId: 053d6c4190bf3c85c0b7d1884d980b5209d377f4) in nav2_costmap_2d::InflationLayer::computeCaches()
==87154==ABORTING
our analysis

focus on code lines : inflation_layer.cpp#L354-L361

We noticed that, vector<> is used here for computing caches. However, if the inflation_radius is larger than 2000, the size of vector (is equal to cache_length_ * cache_length_ ) would be larger than 10^9, which could lead to issues with memory allocation for the vector.

In addition, we have a try to change these code lines as following, to empirically confirm our correct identification of the error in the code.

  // based on the inflation radius... compute distance and cost caches
  if (cell_inflation_radius_ != cached_cell_inflation_radius_) {
    cached_costs_.resize(cache_length_ * cache_length_);
    cached_distances_.resize(cache_length_ * cache_length_);

    // code for debug
   std::cout << "before write" << std::endl;
   cached_distances_[ (cache_length_ -1) * cache_length_ - 1 ] = 1.0;
   std::cout << "after write" << std::endl;
   //  end 

    for (unsigned int i = 0; i < cache_length_; ++i) {
      for (unsigned int j = 0; j < cache_length_; ++j) {
        cached_distances_[i * cache_length_ + j] = hypot(i, j);
      }
    }

as a result, we catched the output as following:

before write
Segmentation fault (core dumped)

It seems that using the vector<>.resize() function here may not handle the corresponding memory allocation correctly.

suggestion:

If the navigation2-stack is willing to support calculations with such large data, could we consider changing the vector to use a Linked List Array to prevent such memory errors?

SteveMacenski commented 8 months ago

2km is an outrageous size for an inflation radius - you don't have a 2km wide costmap, do you?

I don't consider that to be a legit solution to the problem you're seeking to address and I wouldn't support messing with lower-level data structures that have been proven in production for 10+ years to support that :wink:

Moreover, inflation radius is a double, not an int, so I'm not sure that your snippet actually works with ROS 2 static parameter type declarations.

It seems that using the vector<>.resize() function here may not handle the corresponding memory allocation correctly.

That seems to point that your problem. If my back of the envelope calculations are right, (233320)^2 is the size of the cache then. If we have doubles, thats `8` bytes. That is over 8.7GB on RAM. That's a ridiculous request unless you're on a high performance computing platform with 20+ GB of RAM. This doesn't seem to point to any issue with Nav2, but what you're doing is probably a little misguided :-)

GoesM commented 8 months ago

Moreover, inflation radius is a double, not an int, so I'm not sure that your snippet actually works with ROS 2 static parameter type declarations.

Sorry, I made a typo here. The input we used was 2333.3, and I have corrected it in the original text

2km is an outrageous size for an inflation radius - you don't have a 2km wide costmap, do you?

I am curious about how navigation2-stack performs on micro robots, and I consider the unit of numbers to be mm during use, which is why I use large values like 2k (which means it represents 2m instead of 2km).

That's a ridiculous request unless you're on a high performance computing platform with 20+ GB of RAM.

It makes sense! It's not very possible to deploy 20+ GB of RAM on a micro robot, so this is indeed caused by my incorrect use.

Thanks for your patience : )

GoesM commented 5 months ago

I think it's also a ticket for #4005 , so I link it to that issue.