ros-navigation / navigation2

ROS 2 Navigation Framework and System
https://nav2.org/
Other
2.47k stars 1.25k forks source link

Fast-RTPS services and network discovery regression (Local costmap not appearing or clear costmap not called) #1772

Closed dkuenster closed 4 years ago

dkuenster commented 4 years ago

Bug report

Required Info:

Steps to reproduce issue

use tb3_simulation_launch.py to start the gazebo simulation and nav2 stack. Then use "2D Pose Estimate to localize the robot".

Expected behavior

Local Costmap should show every time, e.g: correct_launch

Actual behavior

In more than 50% of all initializations the Local Costmap doesn't show, eg: faulty_launch

Echoing /local_costmap/costmap shows the costmap constains all zeroes despite being in the same position as in the working case, where it contains actual values. Rviz doesn't report any issues with the topics.

Additional information

console_output_empty_costmap.txt console_output_working_costmap.txt

I can't find any differences or errors in the console output. Can anyone reproduce the issue or has any idea what is happening?

SteveMacenski commented 4 years ago

Is this the same rviz window or a new one on a new navigation launch? Can you verify that if you toggle the rviz display types for the costmap (or relaunch rviz) that it appears? I think what you're seeing has nothing to do with navigation but rather a failure in the visualization tools.

Even 0s would show up here with a boundary because of the changes in transparency between the 2 costmap settings. I think you would see 0s on the costmap in your pictures if the costmap were actually being shown.

Not sure it relates, but Buster is also not a Tier 1 supported OS so it may be that the DDS vendors / RMW layers don't do detection properly on that or something. Not sure its related, but could certainly be.

I was 4/4 in launching them just now - so might want to take a second look and make sure what's happening is what you think is happening.

naiveHobo commented 4 years ago

Not sure it relates, but Buster is also not a Tier 1 supported OS

I face this issue too on Ubuntu 18.04 so it shouldn't be related to Buster.

naiveHobo commented 4 years ago

I looked into this a bit more, the local costmap is set to all 0s for some reason.

As @dkuenster said, this happens in more than 50% of the times when starting up the simulation. When this happens, only the static layer seems to be working for the global costmap as well. The local costmap also has non-zero values if a static layer is added to it and it shows up every time.

I'm not completely sure what the error with the rest of the layers is exactly, trying to look into it.

Screenshot from 2020-05-28 02-12-40

SteveMacenski commented 4 years ago

Your image makes it seem like the laser is up and running given that I see some red in the center pole that is off the map (from robot localization quality and laserscans). But I get your point if that happens but this isn't a good example of that.

Let me know what you find out.

dkuenster commented 4 years ago

Is this the same rviz window or a new one on a new navigation launch? Can you verify that if you toggle the rviz display types for the costmap (or relaunch rviz) that it appears? I think what you're seeing has nothing to do with navigation but rather a failure in the visualization tools.

It was a new window with a new navigation launch. When switching the visualization, I get the same result that can be seen in the screenshot of @naiveHobo.

dkuenster commented 4 years ago

I also found that each time the Local Costmap doesn't appear, the Controller only gets 0 as initial velocity in the twist message of the computeVelocityCommand, despite the robot moving and the odom topic containing the correct velocities. On a start where the Local Costmap starts correctly on the other hand the actual current velocity gets passed to the controller.

The pose parameter however works correctly in both cases.

dkuenster commented 4 years ago

While echo constantly shows msgs on the "odom" topic in both cases, the OdomSubscriber in the Controller gets messages on some starts and on others the callback method never gets called. Each time it doesn't get messages, we also get the problem with the local costmap plugins, as soon as we set the initial pose. I don't know how it is related, but something seems to go wrong before we even set an intial pose.

dkuenster commented 4 years ago

While echo constantly shows msgs on the "odom" topic in both cases, the OdomSubscriber in the Controller gets messages on some starts and on others the callback method never gets called. Each time it doesn't get messages, we also get the problem with the local costmap plugins, as soon as we set the initial pose. I don't know how it is related, but something seems to go wrong before we even set an intial pose.

Same problem with the LaserScanSubscriber in the Obstacle Layer. On the starts where the OdomSubscriber callback never gets called, the callback in the LaserScan subscriber also doesn't get called despite echo showing messages on "scan".

SteveMacenski commented 4 years ago

Just to verify, what you're describing are specific instances of topics that are being published that have not yet connected to the costmaps, correct?

Can you try seeing if switching DDS vendors to Cyclone DDS resolves those issues? I'm wondering if there was a regression or an issue with the local discovery with Fast-RTPS. What version of ROS2 are you on right now (eloquent, master, foxy, etc)

dkuenster commented 4 years ago

Just to verify, what you're describing are specific instances of topics that are being published that have not yet connected to the costmaps, correct?

Yes.

Can you try seeing if switching DDS vendors to Cyclone DDS resolves those issues? I'm wondering if there was a regression or an issue with the local discovery with Fast-RTPS. What version of ROS2 are you on right now (eloquent, master, foxy, etc)

Switching to Cyclone DDS indeed solves this problem. Also switching back to version v1.10.0 of Fast-RTPS, as suggested in #1788 solves the problem. So it seems to be an issue introduced in newer versions of Fast-RTPS.

SteveMacenski commented 4 years ago

Ah ok, yeah that appears to be the same issue at #1788 and https://github.com/ros2/ros2/issues/931. Can you quickly verify that the commit https://github.com/eProsima/Fast-DDS/commit/a9bd1a9003adb7ca80c0f6854de58e181059de94 is the offender? If so, we can merge these 2 tickets together and track them.

dkuenster commented 4 years ago

Yes, it works right until commit https://github.com/eProsima/Fast-DDS/commit/d5c9d6bcd4fdfe7edadb137c6203a2db8d01154f (the commit right before https://github.com/eProsima/Fast-DDS/commit/a9bd1a9003adb7ca80c0f6854de58e181059de94) and then breaks on https://github.com/eProsima/Fast-DDS/commit/a9bd1a9003adb7ca80c0f6854de58e181059de94

SteveMacenski commented 4 years ago

I'm rolling in the scope of https://github.com/ros-planning/navigation2/issues/1788 into this one so we have 1 ticket per issue and renaming this issue to Fast-RTPS services and network discovery regression. We should track that upstream issue but also potentially move to Cyclone DDS for development since that doesn't exhibit the issue.

MiguelCompany commented 4 years ago

I checked this using commit 69977cd83d9040df3422d8a2e564715b6002f3fb + current ros2 master, and running several experiments. For each experiment I followed this procedure:

  1. Start wireshark capture
  2. run RMW_IMPLEMENTATION=<impl> ros2 launch nav2_bringup tb3_simulation_launch.py 2>&1 | tee console.txt
  3. Wait for everything to start (including gazebo showing the turtlebot waffle)
  4. Use 2D Pose Estimate button
  5. Wait for local_costmap status showing increasing reception counts
  6. Use navigation goal
  7. Wait for navigation to complete
  8. Close rviz
  9. Stop and export wireshark capture
  10. Move files from ~/.ros/log into /ros-log

As I work with Windows, I ran the experiments using VirtualBox to run Ubuntu Focal on a virtual machine.

I have checked with rmw_cyclonedds_cpp and rmw_fastrtps_cpp. For the latter, I have checked with eProsima/Fast-DDS@b710b1f53a4ecf6c92f87661347a93c46e5f4854 (current head of 2.0.x branch) as long as with eProsima/Fast-DDS@d5c9d6bcd4fdfe7edadb137c6203a2db8d01154f

I have never been able to see the expected image. Some times rviz crashed. Other times I could correctly navigate, but the local costmap was not shown. A summary of the results so far...

ROS 2 repos file rmw implementation result result files
master rmw_cyclonedds_cpp rviz crashed after step 4 here
master rmw_cyclonedds_cpp navigation complete. local costmap not shown here
master rmw_fastrtps_cpp rviz crashed after step 4 here
master rmw_fastrtps_cpp navigation complete. local costmap not shown here
Fast-DDS-d5c9d6bcd rmw_fastrtps_cpp navigation complete. local costmap not shown here
Fast-DDS-d5c9d6bcd rmw_fastrtps_cpp navigation complete. local costmap not shown here

My impression is that now that both implementations have workarounds to make services more reliable, this issue is always reproduced, so maybe there is something wrong in navigation2 that is now reproducibly failing.

NB: It would be nice if someone could check this with RTI connext

SteveMacenski commented 4 years ago

[rviz2-4] what(): InternalErrorException: Cannot create GL vertex buffer in GLHardwareVertexBuffer::GLHardwareVertexBuffer at /home/miguel/ros2_master/build/rviz_ogre_vendor/ogre-v1.12.1-prefix/src/ogre-v1.12.1/RenderSystems/GL/src/OgreGLHardwareVertexBuffer.cpp (line 46)

For rviz crashing, I can't help you on that unless its a result of the navigation2 plugins, but I don't think that's the case. If you run with debug symbols and its our fault, I'll look into it, but I think that's rviz.

Keep in mind its not just about the costmap showing up, the issue we're talking about is services, which those experiments don't do anything to measure. Services can be trivially tested without the navigation stack with some simple call-response nodes.

@daisukes thoughts? I'm not read up or tracking fast-rtps commits so those hashes or the specific changes don't mean much to me (I'm an expert in robotics, not DDS/networking). Have you reproduced the service problem at all from the reports? That's the best starting point that I have also experienced and we still see in the navigation2 CI. Once you've reproduced the problem, I think that's more clear to show that those changes actually fixes the underlying problem.

daisukes commented 4 years ago

@SteveMacenski

As I investigated the commits of Fast-DDS, it worked fine until this commit. I tested with this simple service test code https://github.com/ros2/ros2/issues/931#issuecomment-639489955

terminal 1 $ ros2 launch nav2_bringup tb3_simulation_launch.py     # and give an initial position
terminal 2 $ ros2 run service_test service_test

RMW_IMPLEMENTATION=rmw_cyclonedds_cpp 
[INFO] [1594423870.718509112] [rclcpp]: 0 Successed
[INFO] [1594423872.919848288] [rclcpp]: 0 Successed
[INFO] [1594423874.913237309] [rclcpp]: 0 Successed
...

unset RMW_IMPLEMENTATION (default Fast-RTPS)
[INFO] [1594423963.004730778] [rclcpp]: 0 service not available.
[INFO] [1594423968.228786391] [rclcpp]: 0 service not available.
[ERROR] [1594423974.496116010] [rclcpp]: 0 Failed
[INFO] [1594423979.727908774] [rclcpp]: 0 service not available
...

We also had rviz2 crash if we use the latest binary (after June 25th), so we use the source build with rviz2 v8.1.1 not v8.2.0. I'm not sure if it is v8.2.0 problem or binary problem. https://github.com/ros2/ros2/commit/fc010c9a297eceaedb398213dea14d5ad5d67844#diff-215a2eb6c7ad8b20796a9fceb48f8cc7

SteveMacenski commented 4 years ago

Can you file a ticket if one doesnt exist on rviz2 for that? Make sure someone knows there's a problem

Thanks for the experiment and specification. That will definitely help clear things up.

daisukes commented 4 years ago

FYI: I made a ticket https://github.com/ros2/rviz/issues/574

MiguelCompany commented 4 years ago

@SteveMacenski @daisukes It seems we found the issue. Could you give a try to eProsima/Fast-DDS#1295 ?

daisukes commented 4 years ago

@MiguelCompany I have built the branch and confirmed that the service_test works well and also my own simulation works well with RMW_IMPLEMENTATION=rmw_fastrtps_cpp. Thank you!

MiguelCompany commented 4 years ago

@SteveMacenski As eProsima/Fast-DDS#1295 has been merged, and @daisukes checked correct behavior, I think this issue can be closed?

SteveMacenski commented 4 years ago

@MiguelCompany has it been released into foxy?

MiguelCompany commented 4 years ago

@MiguelCompany has it been released into foxy?

I don't think so, but I think we should ask @jacobperron about it.

SteveMacenski commented 4 years ago

@naiveHobo there's been a foxy sync so this might be OK now

jacobperron commented 4 years ago

Fast-DDS 2.0.0 is currently version in Foxy. Once a 2.0.1 tag exists, we can make a new release containing eProsima/Fast-DDS#1295.

MiguelCompany commented 4 years ago

@jacobperron v2.0.1 has been released, please go ahead 😉

MiguelCompany commented 4 years ago

@SteveMacenski @daisukes v2.0.1 has long ago been released into foxy. This and related issues should have been solved.

SteveMacenski commented 4 years ago

I confirmed its been released now - closing.