ros-navigation / navigation2

ROS 2 Navigation Framework and System
https://nav2.org/
Other
2.53k stars 1.28k forks source link

planner_server dies when clearing costmap: Segmentation fault in nav2_voxel_grid #2781

Closed janx8r closed 2 years ago

janx8r commented 2 years ago

Bug report

Required Info:

Steps to reproduce issue

I use the package turtlebot3_navigation2 to let a simulated turtlebot navigate in an environment with many dynamic obstacles (simulated humans). In long simulations, where after reaching the goal a new goal is set again and again, this error happens sometimes (about once in 10 tries). Subjectively, the error usually happens when many dynamic obstacles come close to the robot.

Expected behavior

The planner_server should not crash when clearing the costmap.

Actual behavior

In various situations (but probably not all situations) where a costmap (both local and global) is to be cleared, planner_server crashes. 4s later all nodes bonded to it are stopped. The robot does not move anymore.

Additional information

Here are the outputs from all nodes except planner_server:

[controller_server-13] [INFO] [1642770375.384868426] [controller_server]: Passing new path to controller.
[controller_server-13] [INFO] [1642770376.434876989] [controller_server]: Passing new path to controller.
[controller_server-13] [INFO] [1642770377.435171565] [controller_server]: Passing new path to controller.
[controller_server-13] [INFO] [1642770378.484868835] [controller_server]: Passing new path to controller.
[controller_server-13] [INFO] [1642770379.484867185] [controller_server]: Passing new path to controller.
[lifecycle_manager-18] [INFO] [1642770385.536995074] [lifecycle_manager_navigation]: Have not received a heartbeat from planner_server.
[lifecycle_manager-18] [ERROR] [1642770385.537065639] [lifecycle_manager_navigation]: CRITICAL FAILURE: SERVER planner_server IS DOWN after not receiving a heartbeat for 4000 ms. Shutting down related nodes.

Here are the gdb outputs of planner_server:

[WARN] [1642770380.484649057] [planner_server]: GridBased: failed to create plan with tolerance 0.50.
[WARN] [1642770380.484748077] [planner_server]: Planning algorithm GridBased failed to generate a valid path to (4.00, 4.00)
[WARN] [1642770380.484793487] [planner_server_rclcpp_node]: [compute_path_to_pose] [ActionServer] Aborting handle.
[Thread 0x7fffa3fff700 (LWP 74252) exited]
[INFO] [1642770380.502613407] [global_costmap.global_costmap]: Received request to clear entirely the global_costmap
[New Thread 0x7fffc4ff9700 (LWP 74253)]
--Type <RET> for more, q to quit, c to continue without paging--

Thread 18 "planner_server" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffc5ffb700 (LWP 54874)]
nav2_voxel_grid::VoxelGrid::ClearVoxelInMap::operator() (z_mask=32768, 
    offset=4294967039, this=<synthetischer Zeiger>)
    at /home/jan/social_navigation/webots-menge-ros/ws_galactic/src/navigation2/nav2_voxel_grid/include/nav2_voxel_grid/voxel_grid.hpp:380
380           *col &= ~(z_mask);  // clear unknown and clear cell

It seems that the raytraceLine function triggers a segmentation fault in rare cases.

My configuration file is burger.yaml from ros-galactic-turtlebot3-navigation2 (source apt):

amcl:
  ros__parameters:
    use_sim_time: False
    alpha1: 0.2
    alpha2: 0.2
    alpha3: 0.2
    alpha4: 0.2
    alpha5: 0.2
    base_frame_id: "base_footprint"
    beam_skip_distance: 0.5
    beam_skip_error_threshold: 0.9
    beam_skip_threshold: 0.3
    do_beamskip: false
    global_frame_id: "map"
    lambda_short: 0.1
    laser_likelihood_max_dist: 2.0
    laser_max_range: 100.0
    laser_min_range: -1.0
    laser_model_type: "likelihood_field"
    max_beams: 60
    max_particles: 2000
    min_particles: 500
    odom_frame_id: "odom"
    pf_err: 0.05
    pf_z: 0.99
    recovery_alpha_fast: 0.0
    recovery_alpha_slow: 0.0
    resample_interval: 1
    robot_model_type: "differential"
    save_pose_rate: 0.5
    sigma_hit: 0.2
    tf_broadcast: true
    transform_tolerance: 1.0
    update_min_a: 0.2
    update_min_d: 0.25
    z_hit: 0.5
    z_max: 0.05
    z_rand: 0.5
    z_short: 0.05
    scan_topic: scan

amcl_map_client:
  ros__parameters:
    use_sim_time: False

amcl_rclcpp_node:
  ros__parameters:
    use_sim_time: False

bt_navigator:
  ros__parameters:
    use_sim_time: False
    global_frame: map
    robot_base_frame: base_link
    odom_topic: /odom
    default_bt_xml_filename: "navigate_w_replanning_and_recovery.xml"
    bt_loop_duration: 10
    default_server_timeout: 20
    enable_groot_monitoring: True
    groot_zmq_publisher_port: 1666
    groot_zmq_server_port: 1667
    plugin_lib_names:
    - nav2_compute_path_to_pose_action_bt_node
    - nav2_compute_path_through_poses_action_bt_node
    - nav2_follow_path_action_bt_node
    - nav2_back_up_action_bt_node
    - nav2_spin_action_bt_node
    - nav2_wait_action_bt_node
    - nav2_clear_costmap_service_bt_node
    - nav2_is_stuck_condition_bt_node
    - nav2_goal_reached_condition_bt_node
    - nav2_goal_updated_condition_bt_node
    - nav2_initial_pose_received_condition_bt_node
    - nav2_reinitialize_global_localization_service_bt_node
    - nav2_rate_controller_bt_node
    - nav2_distance_controller_bt_node
    - nav2_speed_controller_bt_node
    - nav2_truncate_path_action_bt_node
    - nav2_goal_updater_node_bt_node
    - nav2_recovery_node_bt_node
    - nav2_pipeline_sequence_bt_node
    - nav2_round_robin_node_bt_node
    - nav2_transform_available_condition_bt_node
    - nav2_time_expired_condition_bt_node
    - nav2_distance_traveled_condition_bt_node
    - nav2_single_trigger_bt_node
    - nav2_is_battery_low_condition_bt_node
    - nav2_navigate_through_poses_action_bt_node
    - nav2_navigate_to_pose_action_bt_node
    - nav2_remove_passed_goals_action_bt_node
    - nav2_planner_selector_bt_node
    - nav2_controller_selector_bt_node
    - nav2_goal_checker_selector_bt_node

bt_navigator_rclcpp_node:
  ros__parameters:
    use_sim_time: False

controller_server:
  ros__parameters:
    use_sim_time: False
    controller_frequency: 20.0
    min_x_velocity_threshold: 0.001
    min_y_velocity_threshold: 0.5
    min_theta_velocity_threshold: 0.001
    failure_tolerance: 0.3
    progress_checker_plugin: "progress_checker"
    goal_checker_plugins: ["general_goal_checker"] 
    controller_plugins: ["FollowPath"]

    # Progress checker parameters
    progress_checker:
      plugin: "nav2_controller::SimpleProgressChecker"
      required_movement_radius: 0.5
      movement_time_allowance: 10.0

    general_goal_checker:
      stateful: True
      plugin: "nav2_controller::SimpleGoalChecker"
      xy_goal_tolerance: 0.25
      yaw_goal_tolerance: 0.25

    # DWB parameters
    FollowPath:
      plugin: "dwb_core::DWBLocalPlanner"
      debug_trajectory_details: True
      min_vel_x: 0.0
      min_vel_y: 0.0
      max_vel_x: 0.22
      max_vel_y: 0.0
      max_vel_theta: 1.0
      min_speed_xy: 0.0
      max_speed_xy: 0.22
      min_speed_theta: 0.0
      # Add high threshold velocity for turtlebot 3 issue.
      # https://github.com/ROBOTIS-GIT/turtlebot3_simulations/issues/75
      acc_lim_x: 2.5
      acc_lim_y: 0.0
      acc_lim_theta: 3.2
      decel_lim_x: -2.5
      decel_lim_y: 0.0
      decel_lim_theta: -3.2
      vx_samples: 20
      vy_samples: 5
      vtheta_samples: 20
      sim_time: 1.7
      linear_granularity: 0.05
      angular_granularity: 0.025
      transform_tolerance: 0.2
      xy_goal_tolerance: 0.25
      trans_stopped_velocity: 0.25
      short_circuit_trajectory_evaluation: True
      stateful: True
      critics: ["RotateToGoal", "Oscillation", "BaseObstacle", "GoalAlign", "PathAlign", "PathDist", "GoalDist"]
      BaseObstacle.scale: 0.02
      PathAlign.scale: 32.0
      PathAlign.forward_point_distance: 0.1
      GoalAlign.scale: 24.0
      GoalAlign.forward_point_distance: 0.1
      PathDist.scale: 32.0
      GoalDist.scale: 24.0
      RotateToGoal.scale: 32.0
      RotateToGoal.slowing_factor: 5.0
      RotateToGoal.lookahead_time: -1.0

controller_server_rclcpp_node:
  ros__parameters:
    use_sim_time: False

local_costmap:
  local_costmap:
    ros__parameters:
      update_frequency: 5.0
      publish_frequency: 2.0
      global_frame: odom
      robot_base_frame: base_link
      use_sim_time: False
      rolling_window: true
      width: 3
      height: 3
      resolution: 0.05
      robot_radius: 0.1
      plugins: ["obstacle_layer", "voxel_layer", "inflation_layer"]
      inflation_layer:
        plugin: "nav2_costmap_2d::InflationLayer"
        cost_scaling_factor: 1.0
        inflation_radius: 0.55
      obstacle_layer:
        plugin: "nav2_costmap_2d::ObstacleLayer"
        enabled: True
        observation_sources: scan
        scan:
          topic: /scan
          max_obstacle_height: 2.0
          clearing: True
          marking: True
          data_type: "LaserScan"
      voxel_layer:
        plugin: "nav2_costmap_2d::VoxelLayer"
        enabled: True
        publish_voxel_map: True
        origin_z: 0.0
        z_resolution: 0.05
        z_voxels: 16
        max_obstacle_height: 2.0
        mark_threshold: 0
        observation_sources: scan
        scan:
          topic: /scan
          max_obstacle_height: 2.0
          clearing: True
          marking: True
          data_type: "LaserScan"
          raytrace_max_range: 3.0
          raytrace_min_range: 0.0
          obstacle_max_range: 2.5
          obstacle_min_range: 0.0
      static_layer:
        map_subscribe_transient_local: True
      always_send_full_costmap: True
  local_costmap_client:
    ros__parameters:
      use_sim_time: False
  local_costmap_rclcpp_node:
    ros__parameters:
      use_sim_time: False

global_costmap:
  global_costmap:
    ros__parameters:
      update_frequency: 1.0
      publish_frequency: 1.0
      global_frame: map
      robot_base_frame: base_link
      use_sim_time: True
      robot_radius: 0.1
      resolution: 0.05
      track_unknown_space: true
      plugins: ["static_layer", "obstacle_layer", "voxel_layer", "inflation_layer"]
      obstacle_layer:
        plugin: "nav2_costmap_2d::ObstacleLayer"
        enabled: True
        observation_sources: scan
        scan:
          topic: /scan
          max_obstacle_height: 2.0
          clearing: True
          marking: True
          data_type: "LaserScan"
          raytrace_max_range: 3.0
          raytrace_min_range: 0.0
          obstacle_max_range: 2.5
          obstacle_min_range: 0.0
      voxel_layer:
        plugin: "nav2_costmap_2d::VoxelLayer"
        enabled: True
        publish_voxel_map: True
        origin_z: 0.0
        z_resolution: 0.05
        z_voxels: 16
        max_obstacle_height: 2.0
        mark_threshold: 0
        observation_sources: scan
        scan:
          topic: /scan
          max_obstacle_height: 2.0
          clearing: True
          marking: True
          data_type: "LaserScan"
          raytrace_max_range: 3.0
          raytrace_min_range: 0.0
          obstacle_max_range: 2.5
          obstacle_min_range: 0.0
      static_layer:
        plugin: "nav2_costmap_2d::StaticLayer"
        map_subscribe_transient_local: True
      inflation_layer:
        plugin: "nav2_costmap_2d::InflationLayer"
        cost_scaling_factor: 3.0
        inflation_radius: 0.55
      always_send_full_costmap: True
  global_costmap_client:
    ros__parameters:
      use_sim_time: False
  global_costmap_rclcpp_node:
    ros__parameters:
      use_sim_time: False

map_server:
  ros__parameters:
    use_sim_time: False
    yaml_filename: "map.yaml"

map_saver:
  ros__parameters:
    use_sim_time: False
    save_map_timeout: 5.0
    free_thresh_default: 0.25
    occupied_thresh_default: 0.65
    map_subscribe_transient_local: True

planner_server:
  ros__parameters:
    expected_planner_frequency: 20.0
    use_sim_time: False
    planner_plugins: ["GridBased"]
    GridBased:
      plugin: "nav2_navfn_planner/NavfnPlanner"
      tolerance: 0.5
      use_astar: false
      allow_unknown: true

planner_server_rclcpp_node:
  ros__parameters:
    use_sim_time: False

recoveries_server:
  ros__parameters:
    costmap_topic: local_costmap/costmap_raw
    footprint_topic: local_costmap/published_footprint
    cycle_frequency: 10.0
    recovery_plugins: ["spin", "backup", "wait"]
    spin:
      plugin: "nav2_recoveries/Spin"
    backup:
      plugin: "nav2_recoveries/BackUp"
    wait:
      plugin: "nav2_recoveries/Wait"
    global_frame: odom
    robot_base_frame: base_link
    transform_timeout: 0.1
    use_sim_time: true
    simulate_ahead_time: 2.0
    max_rotational_vel: 1.0
    min_rotational_vel: 0.4
    rotational_acc_lim: 3.2

robot_state_publisher:
  ros__parameters:
    use_sim_time: False

waypoint_follower:
  ros__parameters:
    loop_rate: 2000
    stop_on_failure: false
    waypoint_task_executor_plugin: "wait_at_waypoint"   
    wait_at_waypoint:
      plugin: "nav2_waypoint_follower::WaitAtWaypoint"
      enabled: True
      waypoint_pause_duration: 200

Jan

SteveMacenski commented 2 years ago

Please provide your configuration file. If you're able to compile with debug flags and get a traceback (guide: https://navigation.ros.org/tutorials/docs/get_backtrace.html) that would give us line numbers and see specifically where it was invoked. That will be pretty important if we're going to try to find it.

This sounds similar to an issue that @AlexeyMerzlyakov resolved in the second half of last year, so I think this would be good for him to take a look at once he has the information I requested above since he's familiar with that part of the codebase.

janx8r commented 2 years ago

I added traceback for nav2_voxel_grid and reproduced the error. Above is updated gdb output from planner_server. Apparently it is line 380 in voxel_grid.hpp. The line also looks like it could cause segmentation faults.

I have added my configuration above.

janx8r commented 2 years ago

I made a few test outputs of the calculation of offset (voxel_grid.hpp line 255) and could observe the following situation:

offset(31918) = (unsigned int)min_y0(124) * size_x_(256) + (unsigned int)min_x0(174)
offset(31918) = (unsigned int)min_y0(124) * size_x_(256) + (unsigned int)min_x0(174)
offset(31918) = (unsigned int)min_y0(124) * size_x_(256) + (unsigned int)min_x0(174)
offset(0) = (unsigned int)min_y0(0) * size_x_(256) + (unsigned int)min_x0(0)
--Type <RET> for more, q to quit, c to continue without paging--

Thread 18 "planner_server" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffc5ffb700 (LWP 85777)]
nav2_voxel_grid::VoxelGrid::ClearVoxelInMap::operator() (z_mask=32768, 
    offset=4294967039, this=<synthetischer Zeiger>)
    at /home/jan/social_navigation/webots-menge-ros/ws_galactic/src/navigation2/nav2_voxel_grid/include/nav2_voxel_grid/voxel_
380           *col &= ~(z_mask);  // clear unknown and clear cell

The offset 0 is suddenly 4294967039 after passing it to ClearVoxelInMap::operator(). The bresenham3D function (voxel_grid.hpp line 294) calls ClearVoxelInMap::operator() and passes offset to it.

  inline void bresenham3D(
    ActionType at, OffA off_a, OffB off_b, OffC off_c,
    unsigned int abs_da, unsigned int abs_db, unsigned int abs_dc,
    int error_b, int error_c, int offset_a, int offset_b, int offset_c, unsigned int & offset,
    unsigned int & z_mask, unsigned int max_length = UINT_MAX)
janx8r commented 2 years ago
scale: 1  offset: 30621  offset_dx:  0  offset_dy:  0    dx:  0           dy:  0           x0: 157.741  y0: 119.876  min_x0: 157.741  min_y0: 119.876  x1: 157.868  y1: 119.933  dist: 0.139428
scale: 1  offset: 30621  offset_dx:  0  offset_dy:  0    dx:  0           dy:  0           x0: 157.741  y0: 119.876  min_x0: 157.741  min_y0: 119.876  x1: 157.837  y1: 119.917  dist: 0.104754
scale: 1  offset: 30621  offset_dx:  0  offset_dy:  0    dx:  0           dy:  0           x0: 157.741  y0: 119.876  min_x0: 157.741  min_y0: 119.876  x1: 157.826  y1: 119.91   dist: 0.092021
scale: 1  offset: 30621  offset_dx:  0  offset_dy:  0    dx:  0           dy:  0           x0: 157.741  y0: 119.876  min_x0: 157.741  min_y0: 119.876  x1: 157.769  y1: 119.887  dist: 0.030343
scale: 1  offset: 0      offset_dx: -1  offset_dy: -256  dx: -2147483491  dy: -2147483529  x0: 157.741  y0: 119.876  min_x0: -nan     min_y0: -nan     x1: 157.741  y1: 119.876  dist: 0
--Type <RET> for more, q to quit, c to continue without paging--

Thread 18 "planner_server" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffc1ffb700 (LWP 146808)]
nav2_voxel_grid::VoxelGrid::ClearVoxelInMap::operator() (z_mask=32768, 
    offset=4294967039, this=<synthetischer Zeiger>)
    at /home/jan/social_navigation/webots-menge-ros/ws_galactic/src/navigation2/nav2_voxel_grid/include/nav2_voxel_grid/voxel_grid.hpp:389
389           *col &= ~(z_mask);  // clear unknown and clear cell

After examining the variables shown above at runtime, I was able to find the error. In the function raytraceLine (voxel_grid.hpp) dist is calculated. dist can become 0 in rare cases. As it is divided by dist, min_x0 and min_y0 then become NaN. Subsequently, this causes the offset for accessing the data to be calculated incorrectly, resulting in a Segmentation Fault. I suggest to change lines 235-240 in voxel_grid.hpp

    double scale = std::min(1.0, max_length / dist);

    // Updating starting point to the point at distance min_length from the initial point
    double min_x0 = x0 + (x1 - x0) / dist * min_length;
    double min_y0 = y0 + (y1 - y0) / dist * min_length;
    double min_z0 = z0 + (z1 - z0) / dist * min_length;

to:

    double scale, min_x0, min_y0, min_z0;
    if(dist > 0.0){
      scale = std::min(1.0, max_length / dist);

      // Updating starting point to the point at distance min_length from the initial point
      min_x0 = x0 + (x1 - x0) / dist * min_length;
      min_y0 = y0 + (y1 - y0) / dist * min_length;
      min_z0 = z0 + (z1 - z0) / dist * min_length;
    }
    else{
      scale = 1.0;
      min_x0 = x0;
      min_y0 = y0;
      min_z0 = z0;
    }

This ensures that for the rare case dist == 0, the pixel containing the start and end point of the line is processed (i.e. marked or cleared depending on the context)

SteveMacenski commented 2 years ago

That sounds like a very reasonable solution, can you submit a PR? Alexey is out currently sick so he's not able to review right now, but once he's back, I want him to give it just a glance over to make sure he's OK with it. But otherwise, this looks good to me!

The only major thing I want him to look at is what the value of scale should be (e.g. 1.0 or 0.0)

janx8r commented 2 years ago

I think that the value of scale doesn't matter in this special case. scale is only multiplied by abs_dx, abs_dy or abs_dz after that, which are always 0 when dist == 0. I have decided for scale = 1.0 because

scale = std::min(1.0, max_length / dist);

with dist -> 0 would result in a 1.0.

AlexeyMerzlyakov commented 2 years ago

Apologies for my late reply. I've checked the case. Indeed: this is corner case where initial and final points are the same for raytraceLine(p0, p1) which is not taken into account in its algorithms (by code it is always implied that (x0,y0,z0) and (x1,y1,z1) are different, so we never have had expected NAN-s there). I've checked the patch proposed @janx8r: we can not just ignore the bresenham3D() for same point, but need to handle this correctly as made above. This point will be marked in that case. So, the patch proposed looks good to me.