Closed m2-farzan closed 2 years ago
A PR would be appreciated, I haven't run into this yet. I'd try to debug from those changes what the issue may be. This would be relatively low priority for us right now from the maintainer's side, but would be more than happy to review a PR or help work through the problem with you
I'm actually testing with the voxel layer right now on Galactic binaries and haven't run into an issue, are you sure you've enabled a clearing source? That is not default enabled. In the situation that one is not enabled, it shouldn't crash, so we should fix that, but I think that may be more the root of your issue.
Edit: tested with that off too, still don't see any issue
We could check if the width * height = 0, if so, pass, what do you think?
Edit: from https://github.com/ros-planning/navigation2/blob/main/nav2_costmap_2d/plugins/voxel_layer.cpp#L309-L311 it should already be doing that.
I'm looking over this and really not seeing where the possible issue is unless there's been modifications to the software.
The point of this part of the code is to populate clearingendpoints so of course it should be empty at that point. You could remove the if()
around this block https://github.com/ros-planning/navigation2/blob/galactic/nav2_costmap_2d/plugins/voxel_layer.cpp#L336-L342 and would that help? There's no reason that that only needs to be done if publishing, so if running that code in general fixes the problem, we're not doing any harm by changing that.
Have you tried this on Ubuntu or some supported ROS 2 OS? I wonder if that's at play here, that should definitely not be optimized out so I wonder if there's a bug somewhere since Arch is usually a little... buggy by nature.
I did not explicitly enable a clearing source. However, since tb3_simulation_launch.py loads nav2_params.yaml, there is actually a clearing source enabled and it's /scan
. I verified this by inspecting the params tmp file and also gdb breakpoint at obstacle_layer.cpp:198.
We could check if the width * height = 0, if so, pass, what do you think?
There is actually a check equivalent to width * height = 0 for clearing_observation.cloud_
in voxel_layer.cpp:309 but in my case, gdb shows that clearing_observation.cloud_->height=1
and clearing_observation.cloud_->width=325
. Full gdb info I got at line 309:
(gdb) print *clearing_observation.cloud_
$4 = {header = {stamp = {sec = 9, nanosec = 0}, frame_id = "odom"}, height = 1, width = 325,
fields = std::vector of length 5, capacity 5 = {{name = "x", offset = 0, datatype = 7 '\a', count = 1}, {name = "y", offset = 4,
datatype = 7 '\a', count = 1}, {name = "z", offset = 8, datatype = 7 '\a', count = 1}, {name = "intensity", offset = 12,
datatype = 7 '\a', count = 1}, {name = "index", offset = 16, datatype = 5 '\005', count = 1}}, is_bigendian = false,
point_step = 20, row_step = 6500, data = std::vector of length 6500, capacity 6500 = {96 '`', 121 'y', 126 '~', 63 '?',
204 '\314', 50 '2', 250 '\372', 189 '\275', 29 '\035', 81 'Q', 11 '\v', 62 '>', 0 '\000', 0 '\000', 0 '\000', 0 '\000',
7 '\a', 0 '\000', 0 '\000', 0 '\000', 144 '\220', 16 '\020', 116 't', 63 '?', 132 '\204', 127 '\177', 150 '\226', 189 '\275',
0 '\000', 98 'b', 11 '\v', 62 '>', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 8 '\b', 0 '\000', 0 '\000', 0 '\000', 188 '\274',
255 '\377', 111 'o', 63 '?', 48 '0', 145 '\221', 180 '\264', 188 '\274', 194 '\302', 104 'h', 11 '\v', 62 '>', 0 '\000',
0 '\000', 0 '\000', 0 '\000', 9 '\t', 0 '\000', 0 '\000', 0 '\000', 192 '\300', 253 '\375', 54 '6', 189 '\275', 158 '\236',
142 '\216', 17 '\021', 190 '\276', 65 'A', 250 '\372', 12 '\f', 62 '>', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 10 '\n',
0 '\000', 0 '\000', 0 '\000', 0 '\000', 183 '\267', 197 '\305', 189 '\275', 68 'D', 4 '\004', 237 '\355', 189 '\275',
158 '\236', 15 '\017', 13 '\r', 62 '>', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 11 '\v', 0 '\000', 0 '\000', 0 '\000',
208 '\320', 224 '\340', 250 '\372', 189 '\275', 40 '(', 242 '\362', 174 '\256', 189 '\275', 101 'e', 26 '\032', 13 '\r',
62 '>', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 12 '\f', 0 '\000', 0 '\000', 0 '\000', 72 'H', 72 'H', 22 '\026', 190 '\276',
184 '\270', 240 '\360', 98 'b', 189 '\275', 122 'z', 36 '$', 13 '\r', 62 '>', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 13 '\r',
0 '\000', 0 '\000', 0 '\000', 240 '\360', 254 '\376', 20 '\024', 190 '\276', 64 '@', 154 '\232', 160 '\240', 188 '\274',
36 '$', 36 '$', 13 '\r', 62 '>', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 14 '\016', 0 '\000', 0 '\000', 0 '\000', 160 '\240',
222 '\336', 22 '\026', 190 '\276', 128 '\200', 208 '\320', 128 '\200', 60 '<', 16 '\020', 37 '%', 13 '\r', 62 '>', 0 '\000',
0 '\000', 0 '\000', 0 '\000', 15 '\017', 0 '\000', 0 '\000', 0 '\000', 24 '\030', 20 '\024', 13 '\r', 190 '\276', 112 'p',
168 '\250', 95 '_', 61 '=', 87 'W', 33 '!', 13 '\r', 62 '>', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 16 '\020', 0 '\000',
0 '\000', 0 '\000'...}, is_dense = false}
So the problem is that clearing_endpoints_
is empty (not clearing_observation.cloud_
) and I admittedly couldn't figure out what the purpose of clearingendpoints is, but I think it's evident that should publish_clearing_points
be false, clearing_endpoints_
will be empty and voxel_layer.cpp:350 will crash.
It surprises me that Ubuntu binaries don't show that problem, and maybe that's a hint. BTW, Did you exactly run ros2 launch nav2_bringup tb3_simulation_launch.py slam:=True
and it worked, or you tested voxel_layers independently?
Have you tried this on Ubuntu or some supported ROS 2 OS? I wonder if that's at play here, that should definitely not be optimized out so I wonder if there's a bug somewhere since Arch is usually a little... buggy by nature.
Yes, that optimized out is still a mystery (gcc bug?)... I'll try to see if I can reproduce this on Ubuntu or older versions of gcc...
clearing_endpoints_
should be empty prior, its being populated in this process in order to publish for debug. It's not something used in the algorithm, its for tuning and debugging.
I have a different simulation setup with more sensors on a different robot as part of another project right now, but I test daily with the launch file you mention and have never experienced it.
Try removing the if()
block here (e.g. keep interior code, just remove the if branching it): https://github.com/ros-planning/navigation2/blob/a9c55026c2cda08a44e7eeff895db72c8823ae84/nav2_costmap_2d/plugins/voxel_layer.cpp#L336-L342, that shouldn't do anything, but if the compiler is optimizing it out, lets just stop it from doing that. It doesn't do the program any harm to run that anyway, honestly it would probably be good to be complete.
Else - I think this sounds like an Arch / new buggy compiler version problem. I'd recommend trying on another compute platform like Ubuntu just to sanity check
Did you get a chance to test on another platform / compiler / version?
Working on it... I'll report in a few days.
I was inadvertently passing CXXFLAGS="-Wp,-D_GLIBCXX_ASSERTIONS"
to the compiler. Not passing that flag solves the problem (or at least hides it under the rug).
I observed the issue when I was trying to build an Arch package. Arch's package builder (makepkg) sets some compiler flags automatically. It was one of those flags that caused the problem. I should have remembered to at least mention the flags. Sorry for that.
I tested on Ubuntu 20.04 (gcc 9.3). The error can be reproduced there too, if and only if contmap_2d
is built with CXXFLAGS="-Wp,-D_GLIBCXX_ASSERTIONS"
. Without the flag, everything works fine both on Ubuntu and Arch.
_GLIBCXX_ASSERTIONS
is revealing a problem. The error occurs at std::vector::front
because the vector is empty (Assertion '__builtin_expect(!this->empty(), true)' failed
). It happens to be fine on glibc with assertions disabled, but calling front
on empty vector generally causes undefined behavior (see this). It can be fixed by the simple solution you mentioned (removing if
around lines 337-341) and I'd be happy to create a PR if we should take this approach.<optimized out>
mysteryI didn't get <optimized out>
on Ubuntu (the value of publish_clearing_points
was false
there). Also, I think it's not a problem even on Arch: Turns out that <optimized out>
does not necessarily mean that the value was statically calculated. It could also mean that the compiler has decided to store the value elsewhere, or it has reordered the instructions. To make sure that the boolean expression is correctly evaluated at runtime, I created a subscriber to be counted (ros2 topic echo local_costmap/clearing_endpoints sensor_msgs/msg/PointCloud2
), and verified that the branch inside if
block was executed.
It can be fixed by the simple solution you mentioned (removing if around lines 337-341) and I'd be happy to create a PR if we should take this approach.
I'm totally happy with that, please submit a PR. No reason to keep something around that might cause issues to future Arch users if that's a default flag.
Bug report
Required Info:
Steps to reproduce issue
Expected behavior
Launch a gazebo simulation with turtlebot3 in a closed environment, with Nav2 running in slam mode. User should be able to mark "2D pose estimate" and "Nav2 Goal" in rviz, and the robot should move to target position.
Actual behavior
nav2_controller node fails at startup with no error message. As a consequence, "Goal was rejected by server" is printed when user selects a goal in rviz
Additional information
I recompiled everything with debug info (
-DCMAKE_BUILD_TYPE=RelWithDebInfo
) and got this backtrace:So if I'm not wrong, the problem is that in this line: https://github.com/ros-planning/navigation2/blob/7befc4975d3df4e252677a7a2b7c1d6c783fbd20/nav2_costmap_2d/plugins/voxel_layer.cpp#L350
*clearing_endpoints_
contains no data.Running again and stepping through these lines: https://github.com/ros-planning/navigation2/blob/7befc4975d3df4e252677a7a2b7c1d6c783fbd20/nav2_costmap_2d/plugins/voxel_layer.cpp#L326-L342
I found that the code inside that
if
branch never runs. In fact, when I runprint publish_clearing_points
, gdb says that it'soptimized out
.I tried reverting aeccbe7a931b2820930eeb7cbd6f62af2c6f5c9 and the problem was gone. I could set a goal for the robot and it moved to that point just fine. Maybe that gives us a clue.