Move Nav2 CI to 24.04 / Rolling

tonynajjar commented 4 months ago

Basic Info

Info	Please fill out this column
Ticket(s) this addresses	(add tickets here #1)
Primary OS tested on	(Ubuntu, MacOS, Windows)
Robotic platform tested on	(Steve's Robot, gazebo simulation of Tally, hardware turtlebot)
Does this PR contain AI generated software?	(No; Yes and it is marked inline in the code)

Description of contribution in a few bullet points

Various fixes for Nav2 Rolling on Noble

Description of documentation updates required from your changes

Future work that may be required in bullet points

For Maintainers:

[ ] Check that any new parameters added are updated in navigation.ros.org
[ ] Check that any significant change is added to the migration guide
[ ] Check that any new features OR changes to existing behaviors are reflected in the tuning guide
[ ] Check that any new functions have Doxygen added
[ ] Check that any new features have test coverage
[ ] Check that any new plugins is added to the plugins page
[ ] If BT Node, Additionally: add to BT's XML index of nodes for groot, BT package's readme table, and BT library lists

mergify[bot] commented 3 months ago

@tonynajjar, your PR has failed to build. Please check CI outputs and resolve issues. You may need to rebase or pull in main due to API changes (or your contribution genuinely fails).

SteveMacenski commented 3 months ago

@tonynajjar my only beef left here is the cmake changes - I don't want to remove flags for our project due to dependency issues. Do you know how many files-ish (not exact; is it a lot or a little) have a problem due to this? If small, we can do what we did with other dependencies (example):

// turn off the specific warning. Can also use "-Wall"
#pragma GCC diagnostic ignored "-Wunused-but-set-variable"

#include <boost/uuid/uuid.hpp>
#include <boost/uuid/uuid_generators.hpp>
#include <boost/uuid/uuid_io.hpp>
#include <boost/lexical_cast.hpp>

// turn the warnings back on
#pragma GCC diagnostic pop

Though, I'm not sure why -finline-limit=10000000 was removed from MPPI?

mergify[bot] commented 3 months ago

@tonynajjar, your PR has failed to build. Please check CI outputs and resolve issues. You may need to rebase or pull in main due to API changes (or your contribution genuinely fails).

tonynajjar commented 3 months ago

Okay, I restored the global compile flags and disabled them only where necessary. Currently we only know that the nav2_behavior_tree error will be resolved at the next sync. I'm not sure how to resolve the ones from nav2_system_tests and nav2_mppi_controller since the error originate from third-party code.

mergify[bot] commented 3 months ago

@tonynajjar, your PR has failed to build. Please check CI outputs and resolve issues. You may need to rebase or pull in main due to API changes (or your contribution genuinely fails).

SteveMacenski commented 3 months ago

At last review, this looks good to me. @ruffsl any thoughts?

Otherwise, CI is failing to build (still looks to be on Jammy?) but no more code-related concerns. I'll probably file a few tickets when we merge as reminders but nothing that I'm going to concern you with

tonynajjar commented 3 months ago

CI is failing to build (still looks to be on Jammy?)

yes because ghcr.io/ros-planning/navigation2:main is still on jammy, since the Update CI image has been failing for a while, probably since ubuntu noble migration. @ruffsl any idea why it fails?

I'll probably file a few tickets when we merge as reminders but nothing that I'm going to concern you with

Yep, didn't plan on leaving simple TODOs in the code

ruffsl commented 3 months ago

@ruffsl any idea why it fails?

Something funky is going on. Looking at the latest workflow run for the Update CI Image, why is it trying to login to GHCR using my own user name? Did something screwy happen with the GitHub org/repo transfer?

Starting job container
  /usr/bin/docker --config /home/runner/work/_temp/.docker_b1f940ca-3298-4b86-adc9-d03dea1cd05a login ghcr.io -u ruffsl --password-stdin

For example, for that same workflow run linked above, my username is also associated with the trigger commit on main:

Triggered via schedule 11 hours ago @ruffsl bf291d7 main

But when clicking at the commit link, @SteveMacenski is the author of that commit, as I haven't pushed in a while.

At first I had suspected that perhaps the context ${{ github.repository_owner }} now somehow resolves to ruffsl, E.g:

https://github.com/ros-navigation/navigation2/actions/runs/9124355843/workflow#L73

But that thankfully doesn't seem to be the case, as seen from the next workflow job Rebuild CI Image:

https://github.com/ros-navigation/navigation2/actions/runs/9124355843/job/25088411537#step:4:4

This github org/repo has no Actions secrets and variables, so not sure how else the docker login user could get altered.

ruffsl commented 3 months ago

Oh no, we left behind our CI images at the old ros-planning GitHub org, and our workflow is also just as confused:

/usr/bin/docker --config /home/runner/work/_temp/.docker_fec47f9d-77ff-4c82-92d2-c6f945d07d45 pull ghcr.io/ros-navigation/navigation2:main https://github.com/ros-navigation/navigation2/actions/runs/9107994592/job/25037900242#step:2:19

Oh well, I'm unsure if GitHub packages even can be migrated, but we should probably delete them to avoid use confusion.

I'll build the images for the builder target stage locally, then create and push them to a new container registry repo for our new GitHub org. That should hopefully bootstrap everything.

mergify[bot] commented 3 months ago

@tonynajjar, your PR has failed to build. Please check CI outputs and resolve issues. You may need to rebase or pull in main due to API changes (or your contribution genuinely fails).

ruffsl commented 3 months ago

Error response from daemon: Head "https://ghcr.io/v2/ros-navigation/navigation2/manifests/main": unauthorized https://app.circleci.com/pipelines/github/ros-navigation/navigation2/11638/workflows/99f2d6a4-4d2e-4fed-a149-6f9c7f95008d/jobs/35768?invite=true#step-0-452_109

@SteveMacenski , could I ask you to change the package visibility for this new container registry to public?

https://github.com/orgs/ros-navigation/packages/container/navigation2/settings

Public Make this package visible to anyone. Setting is disabled by organization administrators.

Then CircleCI, and any end users, could download the image. Once that's public, then just retrigger the CircleCI for this PR and it should rev right back up.

I'm going to refactor the CI for GitHub action towards the end of this month, so not to concerned with the funky stuff above.

SteveMacenski commented 3 months ago

I think it should be done? Please let me know if I missed something

ruffsl commented 3 months ago

@tonynajjar , looks like the CI is now working but the build is still failing:

Summary: 34 packages finished [42min 33s] 1 package failed: nav2_mppi_controller 1 package aborted: nav2_smac_planner 3 packages had stderr output: nav2_common nav2_mppi_controller nav2_simple_commander 2 packages not processed https://app.circleci.com/pipelines/github/ros-navigation/navigation2/11638/workflows/a432e259-88ed-4d15-89e1-36b6a335661f/jobs/35773?invite=true#step-117-51302_41

It's the weekend so my eyes glaze over from the wall of gcc errors, but it odd in that I had this branch building locally just fine, albeit a different Dockerfile I'm refactoring. Did you use any Cmake flags to ignore any warnings when building locally?

mergify[bot] commented 3 months ago

@tonynajjar, your PR has failed to build. Please check CI outputs and resolve issues. You may need to rebase or pull in main due to API changes (or your contribution genuinely fails).

tonynajjar commented 3 months ago

Very odd, that's the same error that I get when building nav2_mppi_controller on main, it's supposed to be solved with the latest changes though... Anyway this lead me to do a better fix and with that one looks like the build succeeds!

tonynajjar commented 3 months ago

@SteveMacenski system tests obviously fail because of gazebo missing. Would you want to disable system tests needing gazebo and merge this or wait for the gazebo transition to be done first?

SteveMacenski commented 3 months ago

I was hoping we wouldn't be this close to the wire on the new gazebo migration, but I guess we are where we are.

Comment them out but add a ticket which points to the commit where they are commented out to add back in once we have a migration to a newer simulator. We should make sure these are not forgotten.

SteveMacenski commented 3 months ago

By the way, I see many more test failures than those shown above when I run this locally on the jazzy-desktop-full image. Reproduce with (change the builders to your liking, I like to limit it when running in the background it doesn't thrash my machine while doing other foreground work).

mkdir -p jazzy_ws/src
cd jazzy_ws/src 
git clone git@github.com:ros-navigation/navigation2
cd navigation2
git fetch origin pull/4298/head:BR1
git checkout BR1
cd ../../../
sudo docker run -it --net=host -v .:/jazzy_ws  osrf/ros:jazzy-desktop-full

sudo apt update
rosdep update
rosdep install --from-paths src --ignore-src -y -r --skip-keys "slam_toolbox gazebo_ros_pkgs turtlebot3_gazebo"
colcon build --parallel-workers 2
colcon test --parallel-workers 2 --packages-skip nav2_system_tests

Failures all resolved - see https://github.com/ros-navigation/navigation2/pull/4298#issuecomment-2125809383 which discusses what I think is root cause of the WPF / Smac / Actions bugs

The following failed with all tests passing, but issues in shutdown like

    Warning: class_loader.ClassLoader: SEVERE WARNING!!! Attempting to unload library while objects created by this loader exist in the heap! You should delete your objects before attempting to unload the library or destroying the ClassLoader. The library will NOT be unloaded.
             at line 127 in ./src/class_loader.cpp
    [INFO] [1716243365.746525522] [costmap]: Running Nav2 LifecycleNode rcl preshutdown (costmap)
    [INFO] [1716243365.746611543] [costmap]: Destroying bond (costmap) to lifecycle manager.
    Failed to delete datawriter, at ./src/publisher.cpp:45 during '__function__'
    cannot publish data, at ./src/rmw_publish.cpp:66 during '__function__'
    Fail in delete datareader, at ./src/rmw_service.cpp:89 during '__function__'
    -- run_test.py: return code -11
    -- run_test.py: inject classname prefix into gtest result file '/jazzy_ws/build/nav2_smac_planner/test_results/nav2_smac_planner/test_nodehybrid.gtest.xml'
    -- run_test.py: verify result file '/jazzy_ws/build/nav2_smac_planner/test_results/nav2_smac_planner/test_nodehybrid.gtest.xml'
  >>>

NodeHybridTest.test_node_hybrid, AStarTest.test_a_star_2d, SmacTest.test_smac_se2, SmacTest.test_smac_lattice, SmootherTest.test_full_smoother, WPTest.test_dynamic_parameters

GoalUpdaterTestFixture.test_get_latest_goal_update Failure RESOLVED - Nav2's problem Tester.testCollisionPointsMarkers Failure RESOLVED Nav2's problem PointPortTest.test_wrong_syntax Failure RESOLVED - Change in BT.CPP error handing WPF Test RESOLVED - Seemingly from rclcpp regression or behavior change on shutdown? LINTING RESOLVED - not a big deal Action failures RESOLVED - Unclear, but partially from rclcpp regression or behavior change on shutdown? Though definitely some seems to be on us in Nav2 unless the continued action server was also a regression (but I think its safe to say that part may be us) Behavior tree node failures RESOLVED - BT.CPP bug/regression causing crash if input configs don't contain all possible ports for the tests (CC @facontidavide: even if that was an intentional change, I expect that a crash isn't the expected behavior rather an exception or log. Though seems like maybe its a regression?)

SteveMacenski commented 3 months ago

I started looking into some of them, for instance the waypoint follower test can be made to pass by adding the following to the end of the test_dynamic_parameters test

  follower->deactivate();
  follower->cleanup();
  follower.reset();

Even though these should automatically be called on destruction which was tested to work prior to the 24.04 migration https://github.com/ros-navigation/navigation2/blob/main/nav2_util/src/lifecycle_node.cpp#L77-L106

This error is new to us, so maybe its just straight up crashing at the RMW layer by running the rcl preshutdown procedure?

    [INFO] [1716245596.048208666] [waypoint_follower]: Running Nav2 LifecycleNode rcl preshutdown (waypoint_follower)
    Failed to publish log message to rosout: cannot publish data, at ./src/rmw_publish.cpp:66, at ./src/rcl/publisher.c:284
    [INFO] [1716245596.048288331] [waypoint_follower]: Destroying bond (waypoint_follower) to lifecycle manager.
    Failed to publish log message to rosout: cannot publish data, at ./src/rmw_publish.cpp:66, at ./src/rcl/publisher.c:284

I don't put it past us that we're misusing the API, but this is a new failure mode from Iron/Rolling on 22.04 which would either be a regression in RMW or the misuse that we have (?) is now causing the concern. We should be able to destroy objects at the end of tests without this kind of failure.

Edit: The tests in this case don't crash, but seems to deadlock after the tests pass. The ~sub-10-second test hangs and eventually the testing framework kills the test over the timeout is exceeded (1-2 min) while cleaning up the test after the test itself reports a PASS in some of the test failure models reported above. I think that points to the regression thesis (or something to do with sharing the 22.04 kernal from the 24.04 docker - since you say it works on your end)?

    [INFO] [1716246519.780695572] [costmap]: Running Nav2 LifecycleNode rcl preshutdown (costmap)
    Failed to publish log message to rosout: cannot publish data, at ./src/rmw_publish.cpp:66, at ./src/rcl/publisher.c:284
    [INFO] [1716246519.780739136] [costmap]: Destroying bond (costmap) to lifecycle manager.
    Failed to publish log message to rosout: cannot publish data, at ./src/rmw_publish.cpp:66, at ./src/rcl/publisher.c:284
    Failed to delete datawriter, at ./src/publisher.cpp:45 during '__function__'
    [INFO] [1716246519.781438012] [costmap]: Destroying

    >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
    This error state is being overwritten:

      'cannot publish data, at ./src/rmw_publish.cpp:66, at ./src/rcl/publisher.c:201'

    with this new error message:

      'publisher's context is invalid, at ./src/rcl/publisher.c:423'

    rcutils_reset_error() should be called after error handling to avoid this.

    [       OK ] SmacTest.test_smac_lattice_reconfigure (1392 ms)
    [----------] 2 tests from SmacTest (1507 ms total)

    [----------] Global test environment tear-down
    [==========] 2 tests from 1 test suite ran. (1507 ms total)
    [  PASSED  ] 2 tests.
    Warning: class_loader.ClassLoader: SEVERE WARNING!!! Attempting to unload library while objects created by this loader exist in the heap! You should delete your objects before attempting to unload the library or destroying the ClassLoader. The library will NOT be unloaded.
             at line 127 in ./src/class_loader.cpp
    [INFO] [1716333662.281861223] [global_costmap.global_costmap]: Destroying
    [ERROR] [1716333662.281946878] [global_costmap.global_costmap]: Unable to start transition 5 from current state shuttingdown: Could not publish transition: publisher's context is invalid, at ./src/rcl/publisher.c:423, at ./src/rcl_lifecycle.c:368
    [WARN] [1716333662.281975528] [rclcpp_lifecycle]: Shutdown error in destruction of LifecycleNode: final state(unconfigured)
    cannot publish data, at ./src/rmw_publish.cpp:66 during '__function__'
    [ERROR] [1716333662.282100052] [global_costmap.global_costmap.rclcpp]: Error in destruction of rcl subscription handle: Failed to delete datareader, at ./src/subscription.cpp:52, at ./src/rcl/subscription.c:200
    cannot publish data, at ./src/rmw_publish.cpp:66 during '__function__'
    Fail in delete datareader, at ./src/rmw_service.cpp:89 during '__function__'
    -- run_test.py: return code -11
    -- run_test.py: inject classname prefix into gtest result file '/jazzy_ws/build/nav2_smac_planner/test_results/nav2_smac_planner/test_smac_lattice.gtest.xml'
    -- run_test.py: verify result file '/jazzy_ws/build/nav2_smac_planner/test_results/nav2_smac_planner/test_smac_lattice.gtest.xml'
  >>>

tonynajjar commented 3 months ago

Looks like ament_uncrustify from CI does not agree with the one I'm running locally (rolling image) and there is no common ground. Something to look into

SteveMacenski commented 3 months ago

This is the action test failure:

   [----------] Global test environment tear-down
    [==========] 7 tests from 1 test suite ran. (5276 ms total)
    [  PASSED  ] 7 tests.
    [WARN] [1716327862.626602630] [fibonacci_server_node]: [fibonacci] [ActionServer] Current goal was not completed successfully.
    [WARN] [1716327862.626609212] [fibonacci_server_node]: [fibonacci] [ActionServer] Client requested to cancel the goal. Cancelling.

    >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
    This error state is being overwritten:

      'feedback publisher is invalid, at ./src/rcl_action/action_server.c:940'

    with this new error message:

      'cannot publish data, at ./src/rmw_publish.cpp:66'

    rcutils_reset_error() should be called after error handling to avoid this.
    <<<
    cannot publish data, at ./src/rmw_publish.cpp:66 during '__function__'
    Fail in delete datareader, at ./src/rmw_service.cpp:89 during '__function__'

    >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
    This error state is being overwritten:

      'Fail in delete datawriter, at ./src/rmw_service.cpp:104, at ./src/rcl/service.c:262'

    with this new error message:

      'cannot publish data, at ./src/rmw_publish.cpp:66'

    rcutils_reset_error() should be called after error handling to avoid this.
    <<<
    cannot publish data, at ./src/rmw_publish.cpp:66 during '__function__'
    Fail in delete datareader, at ./src/rmw_service.cpp:89 during '__function__'

    >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
    This error state is being overwritten:

      'Fail in delete datawriter, at ./src/rmw_service.cpp:104, at ./src/rcl/service.c:262'

    with this new error message:

      'cannot publish data, at ./src/rmw_publish.cpp:66'

    rcutils_reset_error() should be called after error handling to avoid this.
    <<<
    cannot publish data, at ./src/rmw_publish.cpp:66 during '__function__'
    Fail in delete datareader, at ./src/rmw_service.cpp:89 during '__function__'

    >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
    This error state is being overwritten:

      'Fail in delete datawriter, at ./src/rmw_service.cpp:104, at ./src/rcl/service.c:262'

    with this new error message:

      'cannot publish data, at ./src/rmw_publish.cpp:66'

    rcutils_reset_error() should be called after error handling to avoid this.
    <<<
    Failed to delete datawriter, at ./src/publisher.cpp:45 during '__function__'
    Failed to delete datawriter, at ./src/publisher.cpp:45 during '__function__'
    cannot publish data, at ./src/rmw_publish.cpp:66 during '__function__'
    [ERROR] [1716327862.627785482] [fibonacci_server_node.rclcpp]: Error in destruction of rcl subscription handle: Failed to delete datareader, at ./src/subscription.cpp:52, at ./src/rcl/subscription.c:200
    cannot publish data, at ./src/rmw_publish.cpp:66 during '__function__'
    [ERROR] [1716327862.627801920] [fibonacci_server_node.rclcpp]: Error in destruction of rcl subscription handle: Failed to delete datareader, at ./src/subscription.cpp:52, at ./src/rcl/subscription.c:200
    cannot publish data, at ./src/rmw_publish.cpp:66 during '__function__'
    [ERROR] [1716327862.627815406] [fibonacci_server_node.rclcpp]: Error in destruction of rcl subscription handle: Failed to delete datareader, at ./src/subscription.cpp:52, at ./src/rcl/subscription.c:200
    cannot publish data, at ./src/rmw_publish.cpp:66 during '__function__'
    [ERROR] [1716327862.627862083] [fibonacci_server_node.rclcpp]: Error in destruction of rcl subscription handle: Failed to delete datareader, at ./src/subscription.cpp:52, at ./src/rcl/subscription.c:200
    cannot publish data, at ./src/rmw_publish.cpp:66 during '__function__'
    Fail in delete datareader, at ./src/rmw_service.cpp:89 during '__function__'
    [ERROR] [1716327862.627889188] [fibonacci_server_node.rclcpp]: Error in destruction of rcl service handle: Fail in delete datawriter, at ./src/rmw_service.cpp:104, at ./src/rcl/service.c:262
    cannot publish data, at ./src/rmw_publish.cpp:66 during '__function__'
    Fail in delete datareader, at ./src/rmw_service.cpp:89 during '__function__'
    [ERROR] [1716327862.627910522] [fibonacci_server_node.rclcpp]: Error in destruction of rcl service handle: Fail in delete datawriter, at ./src/rmw_service.cpp:104, at ./src/rcl/service.c:262
    cannot publish data, at ./src/rmw_publish.cpp:66 during '__function__'
    Fail in delete datareader, at ./src/rmw_service.cpp:89 during '__function__'
    [ERROR] [1716327862.627930226] [fibonacci_server_node.rclcpp]: Error in destruction of rcl service handle: Fail in delete datawriter, at ./src/rmw_service.cpp:104, at ./src/rcl/service.c:262
    cannot publish data, at ./src/rmw_publish.cpp:66 during '__function__'
    Fail in delete datareader, at ./src/rmw_service.cpp:89 during '__function__'
    [ERROR] [1716327862.627949664] [fibonacci_server_node.rclcpp]: Error in destruction of rcl service handle: Fail in delete datawriter, at ./src/rmw_service.cpp:104, at ./src/rcl/service.c:262
    cannot publish data, at ./src/rmw_publish.cpp:66 during '__function__'
    Fail in delete datareader, at ./src/rmw_service.cpp:89 during '__function__'
    [ERROR] [1716327862.627970082] [fibonacci_server_node.rclcpp]: Error in destruction of rcl service handle: Fail in delete datawriter, at ./src/rmw_service.cpp:104, at ./src/rcl/service.c:262
    cannot publish data, at ./src/rmw_publish.cpp:66 during '__function__'
    Fail in delete datareader, at ./src/rmw_service.cpp:89 during '__function__'
    [ERROR] [1716327862.627990988] [fibonacci_server_node.rclcpp]: Error in destruction of rcl service handle: Fail in delete datawriter, at ./src/rmw_service.cpp:104, at ./src/rcl/service.c:262
    Failed to delete datawriter, at ./src/publisher.cpp:45 during '__function__'
    [ERROR] [1716327862.628012608] [fibonacci_server_node.rclcpp]: Error in destruction of rcl publisher handle: cannot publish data, at ./src/rmw_publish.cpp:66, at ./src/rcl/publisher.c:201
    -- run_test.py: return code -11
    -- run_test.py: inject classname prefix into gtest result file '/jazzy_ws/build/nav2_util/test_results/nav2_util/test_actions.gtest.xml'
    -- run_test.py: verify result file '/jazzy_ws/build/nav2_util/test_results/nav2_util/test_actions.gtest.xml'
  >>>

The actions passed during this run, but when bringing down the file, that's when all the issues occur. I fixed some of it by adjusting the shutdown procedure. I basically just more explicitly handle the shutdown of the node before rclcpp::shutdown() is called. That shouldn't be necessary! rclcpp::shutdown() historically has kicked spinning nodes out of spin() rather than throwing a ton of errors:

rclcpp::spin(node->get_node_base_interface());
node->on_term();
node.reset()

But doing the following fixed that (where stop_ is set in a new TearDown method)

    while (rclcpp::ok() && !stop_.load()) {
      rclcpp::spin_some(node->get_node_base_interface());
      std::this_thread::sleep_for(10ms);
    }
    node->on_term();
    node.reset();

The last problem boils down to a -11 crash when I try to reset the action server's shared pointer in on_term which is strange. I found that if I deactivated the action server before trying to reset it that went away. I speculate that it was still running, I'm not sure why it would be running now after the 24.04/Jazzy changes but not before, however. I can't explain that.

This however is another datapoint that either (A) there's been an intentional change in ROS Core to be super sensitive to exit conditions or (B) there's some regressions in shutdown.

Nodes rclcpp::spin(node->get_node_base_interface()); are deadlocking when rclcpp::shutdown() is called rather than kicking it out of the spin function as detailed in this comment.
Spinning some isn't working the first time at least in one example that Tony reported in https://github.com/ros-navigation/navigation2/pull/4298#discussion_r1608866660
Pre-shutdown callbacks weren't triggered in WPF (so probably all of Nav2) causing the WPF test failure, but if we set what that would have done explicitly it works https://github.com/ros-navigation/navigation2/pull/4298#issuecomment-2121351566
Various messages in https://github.com/ros-navigation/navigation2/pull/4298#issuecomment-2121319831 that are still yet to be investigated - but all related to failing to shutdown cleanly it appears at first glance

SteveMacenski commented 3 months ago

@tonynajjar merge in changes from https://github.com/ros-navigation/navigation2/tree/BR2 into your branch. This addresses Actions and the BT util conversion failures. It also addresses the BT node unit tests, ~I didn't get to them all but I got a few of them done to test the general pattern~ (done). It doesn't seem like it matters how we populate them, just that they're initialized. It looks like we need to specify all BT input ports now in the SetUp function of the tests. See below

The behavior tree node unit test failures I'm pretty sure are BT.CPP related. Analysis in https://github.com/ros-navigation/navigation2/commit/ebd9fcb45d222c7f07865582a4fd877585f60f51#r142258931 but seems like getInput is crashing for known but not understood reasons. The crash is bad it should throw or log or something, that seems like a bug in BT.CPP we're uncovering. Either way, the crashes in the nav2_behavior_tree are not RMW related and those we can fix & report the behavior to Davide to see if its actionable on his end or just something we should patch here.

Overall, good progress today! That leaves the last real big problem being the exit problems to get Nav2's CI good to go for Jazzy. Then, a post-mortem summary to see what things we identified (if any) are actionable for the ROS team to be aware of

facontidavide commented 3 months ago

Behavior tree node failures RESOLVED - BT.CPP bug/regression causing crash if input configs don't contain all possible ports for the tests (CC @facontidavide: even if that was an intentional change, I expect that a crash isn't the expected behavior rather an exception or log. Though seems like maybe its a regression?)

Yes, I should fix it and have am exception instead.

SteveMacenski commented 3 months ago

With the recent pushes, the remaining are the test bringdown failures which are just N instances of presumably the same problem. All the tests exhibit the Fsat-DDS shared memory log errors, but not blocking our CI motion to 24.04. Nor is the fact that there may be some issues (?) on bring down upstream once we find a workaround that gets us moving. But these probably should be followed up upon to those appropriate project leads for their awareness (which I've started).

Will be looking into the last thing this afternoon.

SteveMacenski commented 3 months ago

OK, fixed. It looks like another instance of the destruction not working like described in the WPF comment above. To my unsophisticated eye, it looks like some of the context / RMW items aren't still active during the pre-shutdown callbacks, causing exceptions / internal ROS 2 things to crash / deadlock. When we fully set things down so that our pre-shutdown callback doesn't have to do any work or is avoided altogether by destroying the object before the rclcpp shutdown, then we avoid the problem. I think this last, short one does a good job at highlighting this:

    [INFO] [1716413808.442739807] [costmap]: Running Nav2 LifecycleNode rcl preshutdown (costmap)
    [INFO] [1716413808.442783529] [costmap]: Destroying bond (costmap) to lifecycle manager.
    Failed to delete datawriter, at ./src/publisher.cpp:45 during '__function__'
    cannot publish data, at ./src/rmw_publish.cpp:66 during '__function__'
    Fail in delete datareader, at ./src/rmw_service.cpp:89 during '__function__'
    -- run_test.py: return code -11

@tonynajjar pull in my BR2 branch and this should be good to go pending any other issues that comes up from the CircleCI job versions of this

SteveMacenski commented 3 months ago

@tonynajjar do you see the nav2_costmap_2d.DynParamTestNode test failure locally? I do not but CI is reporting it :(

tonynajjar commented 3 months ago

@tonynajjar do you see the nav2_costmap_2d.DynParamTestNode test failure locally? I do not but CI is reporting it :(

Just tried on rolling locally 10 times and I didn't get it

SteveMacenski commented 3 months ago

Maybe break the cache with v23? Or the docker image https://github.com/ros-navigation/navigation2/blob/f671d38f73f543cb5c5adc6b0ff24b32c0ccc227/.circleci/config.yml#L488 ?

ros-navigation / navigation2