ros-navigation / navigation2

ROS 2 Navigation Framework and System
https://nav2.org/
Other
2.54k stars 1.28k forks source link

costmap_2d::VoxelLayer::raytraceFreeSpace fails with because clearing_endpoints_ is empty #2692

Closed m2-farzan closed 2 years ago

m2-farzan commented 2 years ago

Bug report

Required Info:

Steps to reproduce issue

ros2 launch nav2_bringup tb3_simulation_launch.py slam:=True

Expected behavior

Launch a gazebo simulation with turtlebot3 in a closed environment, with Nav2 running in slam mode. User should be able to mark "2D pose estimate" and "Nav2 Goal" in rviz, and the robot should move to target position.

Actual behavior

nav2_controller node fails at startup with no error message. As a consequence, "Goal was rejected by server" is printed when user selects a goal in rviz

Additional information

I recompiled everything with debug info (-DCMAKE_BUILD_TYPE=RelWithDebInfo) and got this backtrace:

#0  0x00007ffff74a0d22 in raise () from /usr/lib/libc.so.6
#1  0x00007ffff748a862 in abort () from /usr/lib/libc.so.6
#2  0x00005555555be5fa in std::__replacement_assert (__file=<optimized out>, 
    __line=<optimized out>, __function=<optimized out>, __condition=<optimized out>)
    at /usr/include/c++/11.1.0/x86_64-pc-linux-gnu/bits/c++config.h:504
#3  0x00007ffff7c6dcf8 in std::vector<unsigned char, std::allocator<unsigned char> >::front (
    this=0x7fffb00042c8) at /usr/include/c++/11.1.0/bits/stl_vector.h:1121
#4  std::vector<unsigned char, std::allocator<unsigned char> >::front (this=0x7fffb00042c8)
    at /usr/include/c++/11.1.0/bits/stl_vector.h:1121
#5  sensor_msgs::impl::PointCloud2IteratorBase<float, float, unsigned char, sensor_msgs::msg::PointCloud2_<std::allocator<void> >, sensor_msgs::PointCloud2Iterator>::PointCloud2IteratorBase (this=0x7fffc1ff7fe0, cloud_msg=..., field_name=...)
    at /opt/ros2/galactic/include/sensor_msgs/impl/point_cloud2_iterator.hpp:275
#6  0x00007fffe476d80f in sensor_msgs::PointCloud2Iterator<float>::PointCloud2Iterator (
    field_name="x", cloud_msg=..., this=0x7fffc1ff7fe0)
    at /opt/ros2/galactic/include/sensor_msgs/point_cloud2_iterator.hpp:295
#7  nav2_costmap_2d::VoxelLayer::raytraceFreespace (this=0x555555892980, 
    clearing_observation=..., min_x=0x55555584cf78, min_y=0x55555584cf80, 
    max_x=0x55555584cf88, max_y=0x55555584cf90)
    at /home/mostafa/navigation2/nav2_costmap_2d/plugins/voxel_layer.cpp:350
#8  0x00007fffe4770fb6 in nav2_costmap_2d::VoxelLayer::updateBounds (this=0x555555892980, 
    robot_x=-1.9999543006678138, robot_y=-0.50000042315649595, 
    robot_yaw=0.00053837867434282133, min_x=0x55555584cf78, min_y=0x55555584cf80, 
    max_x=0x55555584cf88, max_y=0x55555584cf90)
    at /home/mostafa/navigation2/nav2_costmap_2d/plugins/voxel_layer.cpp:162
#9  0x00007ffff7bfd3f6 in nav2_costmap_2d::LayeredCostmap::updateMap (this=0x55555584ced0, 
    robot_x=-1.9999543006678138, robot_y=-0.50000042315649595, 
    robot_yaw=robot_yaw@entry=0.00053837867434282133)
    at /home/mostafa/navigation2/nav2_costmap_2d/src/layered_costmap.cpp:161
#10 0x00007ffff7c07bab in nav2_costmap_2d::Costmap2DROS::updateMap (this=0x5555557713b0)
    at /home/mostafa/navigation2/nav2_costmap_2d/src/costmap_2d_ros.cpp:458
#11 0x00007ffff7c08098 in nav2_costmap_2d::Costmap2DROS::mapUpdateLoop (this=0x5555557713b0, 
    frequency=<optimized out>)
    at /home/mostafa/navigation2/nav2_costmap_2d/src/costmap_2d_ros.cpp:412
#12 0x00007ffff78623c4 in std::execute_native_thread_routine (__p=0x555555ac9c40)
    at /build/gcc/src/gcc/libstdc++-v3/src/c++11/thread.cc:82
#13 0x00007ffff79cc259 in start_thread () from /usr/lib/libpthread.so.0
#14 0x00007ffff75625e3 in clone () from /usr/lib/libc.so.6

So if I'm not wrong, the problem is that in this line: https://github.com/ros-planning/navigation2/blob/7befc4975d3df4e252677a7a2b7c1d6c783fbd20/nav2_costmap_2d/plugins/voxel_layer.cpp#L350

*clearing_endpoints_ contains no data.

Running again and stepping through these lines: https://github.com/ros-planning/navigation2/blob/7befc4975d3df4e252677a7a2b7c1d6c783fbd20/nav2_costmap_2d/plugins/voxel_layer.cpp#L326-L342

I found that the code inside that if branch never runs. In fact, when I run print publish_clearing_points, gdb says that it's optimized out.

I tried reverting aeccbe7a931b2820930eeb7cbd6f62af2c6f5c9 and the problem was gone. I could set a goal for the robot and it moved to that point just fine. Maybe that gives us a clue.

SteveMacenski commented 2 years ago

A PR would be appreciated, I haven't run into this yet. I'd try to debug from those changes what the issue may be. This would be relatively low priority for us right now from the maintainer's side, but would be more than happy to review a PR or help work through the problem with you

SteveMacenski commented 2 years ago

I'm actually testing with the voxel layer right now on Galactic binaries and haven't run into an issue, are you sure you've enabled a clearing source? That is not default enabled. In the situation that one is not enabled, it shouldn't crash, so we should fix that, but I think that may be more the root of your issue.

Edit: tested with that off too, still don't see any issue

SteveMacenski commented 2 years ago

We could check if the width * height = 0, if so, pass, what do you think?

Edit: from https://github.com/ros-planning/navigation2/blob/main/nav2_costmap_2d/plugins/voxel_layer.cpp#L309-L311 it should already be doing that.

I'm looking over this and really not seeing where the possible issue is unless there's been modifications to the software.

The point of this part of the code is to populate clearingendpoints so of course it should be empty at that point. You could remove the if() around this block https://github.com/ros-planning/navigation2/blob/galactic/nav2_costmap_2d/plugins/voxel_layer.cpp#L336-L342 and would that help? There's no reason that that only needs to be done if publishing, so if running that code in general fixes the problem, we're not doing any harm by changing that.

SteveMacenski commented 2 years ago

Have you tried this on Ubuntu or some supported ROS 2 OS? I wonder if that's at play here, that should definitely not be optimized out so I wonder if there's a bug somewhere since Arch is usually a little... buggy by nature.

m2-farzan commented 2 years ago

I did not explicitly enable a clearing source. However, since tb3_simulation_launch.py loads nav2_params.yaml, there is actually a clearing source enabled and it's /scan. I verified this by inspecting the params tmp file and also gdb breakpoint at obstacle_layer.cpp:198.

We could check if the width * height = 0, if so, pass, what do you think?

There is actually a check equivalent to width * height = 0 for clearing_observation.cloud_ in voxel_layer.cpp:309 but in my case, gdb shows that clearing_observation.cloud_->height=1 and clearing_observation.cloud_->width=325. Full gdb info I got at line 309:

(gdb) print *clearing_observation.cloud_
$4 = {header = {stamp = {sec = 9, nanosec = 0}, frame_id = "odom"}, height = 1, width = 325, 
  fields = std::vector of length 5, capacity 5 = {{name = "x", offset = 0, datatype = 7 '\a', count = 1}, {name = "y", offset = 4, 
      datatype = 7 '\a', count = 1}, {name = "z", offset = 8, datatype = 7 '\a', count = 1}, {name = "intensity", offset = 12, 
      datatype = 7 '\a', count = 1}, {name = "index", offset = 16, datatype = 5 '\005', count = 1}}, is_bigendian = false, 
  point_step = 20, row_step = 6500, data = std::vector of length 6500, capacity 6500 = {96 '`', 121 'y', 126 '~', 63 '?', 
    204 '\314', 50 '2', 250 '\372', 189 '\275', 29 '\035', 81 'Q', 11 '\v', 62 '>', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 
    7 '\a', 0 '\000', 0 '\000', 0 '\000', 144 '\220', 16 '\020', 116 't', 63 '?', 132 '\204', 127 '\177', 150 '\226', 189 '\275', 
    0 '\000', 98 'b', 11 '\v', 62 '>', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 8 '\b', 0 '\000', 0 '\000', 0 '\000', 188 '\274', 
    255 '\377', 111 'o', 63 '?', 48 '0', 145 '\221', 180 '\264', 188 '\274', 194 '\302', 104 'h', 11 '\v', 62 '>', 0 '\000', 
    0 '\000', 0 '\000', 0 '\000', 9 '\t', 0 '\000', 0 '\000', 0 '\000', 192 '\300', 253 '\375', 54 '6', 189 '\275', 158 '\236', 
    142 '\216', 17 '\021', 190 '\276', 65 'A', 250 '\372', 12 '\f', 62 '>', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 10 '\n', 
    0 '\000', 0 '\000', 0 '\000', 0 '\000', 183 '\267', 197 '\305', 189 '\275', 68 'D', 4 '\004', 237 '\355', 189 '\275', 
    158 '\236', 15 '\017', 13 '\r', 62 '>', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 11 '\v', 0 '\000', 0 '\000', 0 '\000', 
    208 '\320', 224 '\340', 250 '\372', 189 '\275', 40 '(', 242 '\362', 174 '\256', 189 '\275', 101 'e', 26 '\032', 13 '\r', 
    62 '>', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 12 '\f', 0 '\000', 0 '\000', 0 '\000', 72 'H', 72 'H', 22 '\026', 190 '\276', 
    184 '\270', 240 '\360', 98 'b', 189 '\275', 122 'z', 36 '$', 13 '\r', 62 '>', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 13 '\r', 
    0 '\000', 0 '\000', 0 '\000', 240 '\360', 254 '\376', 20 '\024', 190 '\276', 64 '@', 154 '\232', 160 '\240', 188 '\274', 
    36 '$', 36 '$', 13 '\r', 62 '>', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 14 '\016', 0 '\000', 0 '\000', 0 '\000', 160 '\240', 
    222 '\336', 22 '\026', 190 '\276', 128 '\200', 208 '\320', 128 '\200', 60 '<', 16 '\020', 37 '%', 13 '\r', 62 '>', 0 '\000', 
    0 '\000', 0 '\000', 0 '\000', 15 '\017', 0 '\000', 0 '\000', 0 '\000', 24 '\030', 20 '\024', 13 '\r', 190 '\276', 112 'p', 
    168 '\250', 95 '_', 61 '=', 87 'W', 33 '!', 13 '\r', 62 '>', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 16 '\020', 0 '\000', 
    0 '\000', 0 '\000'...}, is_dense = false}

So the problem is that clearing_endpoints_ is empty (not clearing_observation.cloud_) and I admittedly couldn't figure out what the purpose of clearingendpoints is, but I think it's evident that should publish_clearing_points be false, clearing_endpoints_ will be empty and voxel_layer.cpp:350 will crash.

It surprises me that Ubuntu binaries don't show that problem, and maybe that's a hint. BTW, Did you exactly run ros2 launch nav2_bringup tb3_simulation_launch.py slam:=True and it worked, or you tested voxel_layers independently?

m2-farzan commented 2 years ago

Have you tried this on Ubuntu or some supported ROS 2 OS? I wonder if that's at play here, that should definitely not be optimized out so I wonder if there's a bug somewhere since Arch is usually a little... buggy by nature.

Yes, that optimized out is still a mystery (gcc bug?)... I'll try to see if I can reproduce this on Ubuntu or older versions of gcc...

SteveMacenski commented 2 years ago

clearing_endpoints_ should be empty prior, its being populated in this process in order to publish for debug. It's not something used in the algorithm, its for tuning and debugging.

I have a different simulation setup with more sensors on a different robot as part of another project right now, but I test daily with the launch file you mention and have never experienced it.

Try removing the if() block here (e.g. keep interior code, just remove the if branching it): https://github.com/ros-planning/navigation2/blob/a9c55026c2cda08a44e7eeff895db72c8823ae84/nav2_costmap_2d/plugins/voxel_layer.cpp#L336-L342, that shouldn't do anything, but if the compiler is optimizing it out, lets just stop it from doing that. It doesn't do the program any harm to run that anyway, honestly it would probably be good to be complete.

Else - I think this sounds like an Arch / new buggy compiler version problem. I'd recommend trying on another compute platform like Ubuntu just to sanity check

SteveMacenski commented 2 years ago

Did you get a chance to test on another platform / compiler / version?

m2-farzan commented 2 years ago

Working on it... I'll report in a few days.

m2-farzan commented 2 years ago

Summary

I was inadvertently passing CXXFLAGS="-Wp,-D_GLIBCXX_ASSERTIONS" to the compiler. Not passing that flag solves the problem (or at least hides it under the rug).

Context

I observed the issue when I was trying to build an Arch package. Arch's package builder (makepkg) sets some compiler flags automatically. It was one of those flags that caused the problem. I should have remembered to at least mention the flags. Sorry for that.

Testing on Ubuntu

I tested on Ubuntu 20.04 (gcc 9.3). The error can be reproduced there too, if and only if contmap_2d is built with CXXFLAGS="-Wp,-D_GLIBCXX_ASSERTIONS". Without the flag, everything works fine both on Ubuntu and Arch.

Possible Actions

  1. Conservative approach: Everything works with default settings. Don't touch it if it works.
  2. Purist approach: _GLIBCXX_ASSERTIONS is revealing a problem. The error occurs at std::vector::front because the vector is empty (Assertion '__builtin_expect(!this->empty(), true)' failed). It happens to be fine on glibc with assertions disabled, but calling front on empty vector generally causes undefined behavior (see this). It can be fixed by the simple solution you mentioned (removing if around lines 337-341) and I'd be happy to create a PR if we should take this approach.

<optimized out> mystery

I didn't get <optimized out> on Ubuntu (the value of publish_clearing_points was false there). Also, I think it's not a problem even on Arch: Turns out that <optimized out> does not necessarily mean that the value was statically calculated. It could also mean that the compiler has decided to store the value elsewhere, or it has reordered the instructions. To make sure that the boolean expression is correctly evaluated at runtime, I created a subscriber to be counted (ros2 topic echo local_costmap/clearing_endpoints sensor_msgs/msg/PointCloud2), and verified that the branch inside if block was executed.

SteveMacenski commented 2 years ago

It can be fixed by the simple solution you mentioned (removing if around lines 337-341) and I'd be happy to create a PR if we should take this approach.

I'm totally happy with that, please submit a PR. No reason to keep something around that might cause issues to future Arch users if that's a default flag.