osrf / ros2_test_cases

Tracking of tests to be performed on a ROS 2 release
16 stars 4 forks source link

Full System Testing: Nav2 Edition P1: The Exector Strikes Back #1042

Closed SteveMacenski closed 1 year ago

SteveMacenski commented 1 year ago

Someone please tag amd debian cyclone jammy. Should there be a second ticket if I'm also going to do fastdds? I'll likely test binary too but I'm not committing to that up front :-)

clalancette commented 1 year ago

Someone please tag amd debian cyclone jammy. Should there be a second ticket if I'm also going to do fastdds? I'll likely test binary too but I'm not committing to that up front :-)

Yeah, a separate ticket for fastdds would be preferred. That way it is easier to make this a standard part of the testing in the future if we want to.

SteveMacenski commented 1 year ago
SteveMacenski commented 1 year ago

@clalancette Ticket from Iron testing --> https://github.com/colcon/colcon-core/issues/554 Colcon has warnings when executed. Should I report this anywhere else?

SteveMacenski commented 1 year ago

Another: https://github.com/ros2/rosidl/issues/743 --> Previously at https://github.com/ros2/rosidl_typesupport_fastrtps/issues/28 but I think from re-reading it, that's incorrectly attributed to Fast-DDS of another (but related) problem and filed in that rosidl repo. Ours is a more central issue in rosidl_generator_cpp which only appeared after the last Rolling sync and now in Iron.

@clalancette I think this should be looked at soon after release - but I don't think show stopping for release. If it was only nav2_msgs, honestly not a big deal on the scale of things, but sensor_msgs would also exhibit this issue with messages such as http://docs.ros.org/en/api/sensor_msgs/html/msg/BatteryState.html containing a number of const definitions.

SteveMacenski commented 1 year ago

https://github.com/osrf/ros2_test_cases/issues/1044 is the Fast-DDS version (linking to be able to find later)

SteveMacenski commented 1 year ago

Woof. Nav2 appears to be broken but I don't think its Iron, though it could I suppose :shrug:. We're seeing Nav2 receiving AMCL localization service requests, but not honoring the content of them (and seems to randomly just bounce around) & Gazebo Classic appears to be producing data that doesn't seems sensible for where the robot truly is in the environment + reporting that the robot is outside of the local costmap bounds for update. This feels like TF, services, and/or gazebo issues which I doubt are related to the core packages but maybe some regression we have within Nav2 itself. This happened around the time of the last Rolling sync, but that could be entirely coincidental.

Once its moving, the planning and control stack seems to be functional from really quick testing.

I'm currently building a branch that I nuked all changes back to March 1 to see if it still is a problem - at which point it should be clear. I narrowed down from our CI results the last job I can clearly know it didn't happen in and the first when it appears. I can't see any commits in that range that touch any of this, but I flashed back much further than even those. Not at a point yet I'm asking for any particular help, just giving an update and summary in case this turns into a multi-day excursion

SteveMacenski commented 1 year ago

OK, now I'm asking for some attention, I believe this is not Nav2 related and is instead ROS 2 core regressions.

I've backed out all of the commits since last Christmas on Nav2 and I'm still seeing the same problem unchanged on Iron binaries.

I was able to find the last nightly build on Rolling that succeeded, and it was the night of April 13th (early morning hours of April 14th). This corresponds with the last Rolling release on the 14th before the Iron fork which broke Nav2's CI (rosidl issue, not relevant). Since the first build after resolving that CI issue, we see all system tests failing with this issue described above since.

To confirm from a different direction, I then repeated this exercise using Humble and Nav2's current state without rolling back commits and things work perfectly (after a couple of 2 line changes for deprecated API). So the Iron vs Humble and Pre-last-Rolling-release shows that difference in behavior. So in summary, I think something in the last Rolling release before the fork broke some stuff that we see in Iron now.

I scanned through the list of packages updated in the last sync and I see RMW, DDS, and client library stuff as well as TF2. Regressions here could explain the behavior. Nothing else in the list stood out to me, though I would have added Gazebo to that list if the sync included any updates there.

I know this isn't very specific about where or what, but this is a flag worth raising early :flags:

SteveMacenski commented 1 year ago

Followup - I went through the commits in the time range in rclcpp and TF2 but nothing stood out to me, so I took a more emperical approach.

Building Humble's version of geometry2 (just selected something arbitrarily old but relatively new not to have API issues) in the Nav2 workspace seems to make the problems disappear. While Rviz and other tools are using the binary versions, Nav2 luckily is self contained enough that at least all the Nav2 modules can use that version of TF2 instead when built alongside it.

That seems to resolve the issues, which is welcome news that I don't need to dig into rclXXX or the RMW vendors.

As best I can tell, there's only 2 commits that seem like they might be causing an issue that have been in the ros2 branch after march:

I'm in the process of rebuilding geometry2 + nav2 without these to see if either is the culprit, but perhaps someone involved with those decisions (@clalancette) might have an educated guess looking at the recent commit history https://github.com/ros2/geometry2/commits/rolling

SteveMacenski commented 1 year ago

OK, I've identified it. Compiling TF next to Nav2, I bisected commits from the last time I knew it was working properly for us 0.30.0 and now and found the behavior breaks when https://github.com/ros2/geometry2/commit/3bac620777043a1c8761cd74c1bb814667341b65 is introduced (and by extension https://github.com/ros2/geometry2/commit/7afeb6c2b300be5a6b97eed0fa822a1d294e7b77).

I'm not going to speculate as to why in particular, I don't know, but its without a doubt the cause. This commit breaks ABI so we can't just revert it and ship Iron if its something important. So the options are to revert or try to fix it before release, but I can't make the value call of that.

CC @roncapat, in case anything I've said here sounds familiar to you that you might know the reason for an issue to proposal a solution!

@clalancette it looks like you merged that PR so

clalancette commented 1 year ago

OK, I've identified it. Compiling TF next to Nav2, I bisected commits from the last time I knew it was working properly for us 0.30.0 and now and found the behavior breaks when ros2/geometry2@3bac620 is introduced (and by extension ros2/geometry2@7afeb6c).

Thanks for running this one to ground, that is really helpful. I'll comment more on https://github.com/ros2/geometry2/issues/600

Yadunund commented 1 year ago

@SteveMacenski is it safe to mark this test as a success?

SteveMacenski commented 1 year ago

I will not be able to complete this. I spent the 2 days I had allocated for Iron testing on debugging the TF issue and unfortunately I need to be conscious of my time

These are good to have around for next year, but unfortunately I just won't have time in the next 2 weeks to come back to this

Yadunund commented 1 year ago

Understood. Thanks for update!

clalancette commented 1 year ago

Some of this got tested, which is good. Since we are out of the tutorial party now, I'm going to close this out as untested.