Closed janx8r closed 2 years ago
Please provide your configuration file. If you're able to compile with debug flags and get a traceback (guide: https://navigation.ros.org/tutorials/docs/get_backtrace.html) that would give us line numbers and see specifically where it was invoked. That will be pretty important if we're going to try to find it.
This sounds similar to an issue that @AlexeyMerzlyakov resolved in the second half of last year, so I think this would be good for him to take a look at once he has the information I requested above since he's familiar with that part of the codebase.
I added traceback for nav2_voxel_grid and reproduced the error. Above is updated gdb output from planner_server. Apparently it is line 380 in voxel_grid.hpp. The line also looks like it could cause segmentation faults.
I have added my configuration above.
I made a few test outputs of the calculation of offset (voxel_grid.hpp line 255) and could observe the following situation:
offset(31918) = (unsigned int)min_y0(124) * size_x_(256) + (unsigned int)min_x0(174)
offset(31918) = (unsigned int)min_y0(124) * size_x_(256) + (unsigned int)min_x0(174)
offset(31918) = (unsigned int)min_y0(124) * size_x_(256) + (unsigned int)min_x0(174)
offset(0) = (unsigned int)min_y0(0) * size_x_(256) + (unsigned int)min_x0(0)
--Type <RET> for more, q to quit, c to continue without paging--
Thread 18 "planner_server" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffc5ffb700 (LWP 85777)]
nav2_voxel_grid::VoxelGrid::ClearVoxelInMap::operator() (z_mask=32768,
offset=4294967039, this=<synthetischer Zeiger>)
at /home/jan/social_navigation/webots-menge-ros/ws_galactic/src/navigation2/nav2_voxel_grid/include/nav2_voxel_grid/voxel_
380 *col &= ~(z_mask); // clear unknown and clear cell
The offset 0 is suddenly 4294967039 after passing it to ClearVoxelInMap::operator(). The bresenham3D function (voxel_grid.hpp line 294) calls ClearVoxelInMap::operator() and passes offset to it.
inline void bresenham3D(
ActionType at, OffA off_a, OffB off_b, OffC off_c,
unsigned int abs_da, unsigned int abs_db, unsigned int abs_dc,
int error_b, int error_c, int offset_a, int offset_b, int offset_c, unsigned int & offset,
unsigned int & z_mask, unsigned int max_length = UINT_MAX)
scale: 1 offset: 30621 offset_dx: 0 offset_dy: 0 dx: 0 dy: 0 x0: 157.741 y0: 119.876 min_x0: 157.741 min_y0: 119.876 x1: 157.868 y1: 119.933 dist: 0.139428
scale: 1 offset: 30621 offset_dx: 0 offset_dy: 0 dx: 0 dy: 0 x0: 157.741 y0: 119.876 min_x0: 157.741 min_y0: 119.876 x1: 157.837 y1: 119.917 dist: 0.104754
scale: 1 offset: 30621 offset_dx: 0 offset_dy: 0 dx: 0 dy: 0 x0: 157.741 y0: 119.876 min_x0: 157.741 min_y0: 119.876 x1: 157.826 y1: 119.91 dist: 0.092021
scale: 1 offset: 30621 offset_dx: 0 offset_dy: 0 dx: 0 dy: 0 x0: 157.741 y0: 119.876 min_x0: 157.741 min_y0: 119.876 x1: 157.769 y1: 119.887 dist: 0.030343
scale: 1 offset: 0 offset_dx: -1 offset_dy: -256 dx: -2147483491 dy: -2147483529 x0: 157.741 y0: 119.876 min_x0: -nan min_y0: -nan x1: 157.741 y1: 119.876 dist: 0
--Type <RET> for more, q to quit, c to continue without paging--
Thread 18 "planner_server" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffc1ffb700 (LWP 146808)]
nav2_voxel_grid::VoxelGrid::ClearVoxelInMap::operator() (z_mask=32768,
offset=4294967039, this=<synthetischer Zeiger>)
at /home/jan/social_navigation/webots-menge-ros/ws_galactic/src/navigation2/nav2_voxel_grid/include/nav2_voxel_grid/voxel_grid.hpp:389
389 *col &= ~(z_mask); // clear unknown and clear cell
After examining the variables shown above at runtime, I was able to find the error. In the function raytraceLine (voxel_grid.hpp) dist is calculated. dist can become 0 in rare cases. As it is divided by dist, min_x0 and min_y0 then become NaN. Subsequently, this causes the offset for accessing the data to be calculated incorrectly, resulting in a Segmentation Fault. I suggest to change lines 235-240 in voxel_grid.hpp
double scale = std::min(1.0, max_length / dist);
// Updating starting point to the point at distance min_length from the initial point
double min_x0 = x0 + (x1 - x0) / dist * min_length;
double min_y0 = y0 + (y1 - y0) / dist * min_length;
double min_z0 = z0 + (z1 - z0) / dist * min_length;
to:
double scale, min_x0, min_y0, min_z0;
if(dist > 0.0){
scale = std::min(1.0, max_length / dist);
// Updating starting point to the point at distance min_length from the initial point
min_x0 = x0 + (x1 - x0) / dist * min_length;
min_y0 = y0 + (y1 - y0) / dist * min_length;
min_z0 = z0 + (z1 - z0) / dist * min_length;
}
else{
scale = 1.0;
min_x0 = x0;
min_y0 = y0;
min_z0 = z0;
}
This ensures that for the rare case dist == 0, the pixel containing the start and end point of the line is processed (i.e. marked or cleared depending on the context)
That sounds like a very reasonable solution, can you submit a PR? Alexey is out currently sick so he's not able to review right now, but once he's back, I want him to give it just a glance over to make sure he's OK with it. But otherwise, this looks good to me!
The only major thing I want him to look at is what the value of scale
should be (e.g. 1.0 or 0.0)
I think that the value of scale doesn't matter in this special case. scale is only multiplied by abs_dx, abs_dy or abs_dz after that, which are always 0 when dist == 0. I have decided for scale = 1.0 because
scale = std::min(1.0, max_length / dist);
with dist -> 0 would result in a 1.0.
Apologies for my late reply. I've checked the case. Indeed: this is corner case where initial and final points are the same for raytraceLine(p0, p1)
which is not taken into account in its algorithms (by code it is always implied that (x0,y0,z0)
and (x1,y1,z1)
are different, so we never have had expected NAN-s there). I've checked the patch proposed @janx8r: we can not just ignore the bresenham3D()
for same point, but need to handle this correctly as made above. This point will be marked in that case. So, the patch proposed looks good to me.
Bug report
Required Info:
Steps to reproduce issue
I use the package turtlebot3_navigation2 to let a simulated turtlebot navigate in an environment with many dynamic obstacles (simulated humans). In long simulations, where after reaching the goal a new goal is set again and again, this error happens sometimes (about once in 10 tries). Subjectively, the error usually happens when many dynamic obstacles come close to the robot.
Expected behavior
The planner_server should not crash when clearing the costmap.
Actual behavior
In various situations (but probably not all situations) where a costmap (both local and global) is to be cleared, planner_server crashes. 4s later all nodes bonded to it are stopped. The robot does not move anymore.
Additional information
Here are the outputs from all nodes except planner_server:
Here are the gdb outputs of planner_server:
It seems that the raytraceLine function triggers a segmentation fault in rare cases.
My configuration file is burger.yaml from ros-galactic-turtlebot3-navigation2 (source apt):
Jan