ros2 / rclcpp

rclcpp (ROS Client Library for C++)
Apache License 2.0
514 stars 410 forks source link

Possible regression in rcl preshutdown callbacks - context invalid? #2547

Open SteveMacenski opened 1 month ago

SteveMacenski commented 1 month ago

Bug report

Required Info:

Description

I've noticed in the Nav2 transition in both unit tests and now at the system level that there appears to be a regression in the handling of the context and the pre-shutdown callbacks starting with the transition to 24.04 and seen in both Jazzy and current rolling. There has been some extended discussion wrt the unit tests in the comments:

But I'm also now seeing it on shutdown of Nav2 at the system level now that we have Gazebo Harmonic working with Nav2. We run our rcl preshutdown callbacks to transition the nodes that may not have been properly transitioned to shutdown before Control+C was hit. This is executing as expected with logs:

[component_container_isolated-8] [INFO] [1717089793.781486696] [controller_server]: Running Nav2 LifecycleNode rcl preshutdown (controller_server)
[component_container_isolated-8] [INFO] [1717089793.781516996] [controller_server]: Destroying bond (controller_server) to lifecycle manager.
[component_container_isolated-8] [INFO] [1717089793.781521867] [smoother_server]: Running Nav2 LifecycleNode rcl preshutdown (smoother_server)
[component_container_isolated-8] [INFO] [1717089793.781526590] [smoother_server]: Destroying bond (smoother_server) to lifecycle manager.
...

But then also contain traces from rcl which seem related to publishers failing to do their thing because the context is invalid at the stage of pre-shutdown. We see only publisher failures when we have the system properly cycled down, but if we have the Nav2 lifecycle nodes still active so services are called to transition them, then we see the service eq as well.

[component_container_isolated-8] [INFO] [1717089793.874960887] [controller_server]: Destroying
[component_container_isolated-8] [ERROR] [1717089793.875015316] [controller_server]: Unable to start transition 5 from current state shuttingdown: Could not publish transition: publisher's context is invalid, at ./src/rcl/publisher.c:423, at ./src/rcl_lifecycle.c:368
[component_container_isolated-8] [WARN] [1717089793.875026497] [rclcpp_lifecycle]: Shutdown error in destruction of LifecycleNode: final state(unconfigured)
...

I clipped the logs to show one server in particular, but this repeats for all instances.

This can be reproduced in Nav2 using the 24.04/Rolling + the simulator in https://github.com/ros-navigation/navigation2/pull/3634. Drive around for a moment, stop, and hit control+C. We previously had a clean shutdown without crash. It can also be more minimally reproduced in the PR https://github.com/ros-navigation/navigation2/pull/4298/files TEST(WPTest, test_dynamic_parameters) test changes -- where we manually shut things down & reset the object internally to the test to avoid the need for the rcl pre-shutdown to do work, thereby bypassing it

SteveMacenski commented 1 month ago

I don't know if its related, but also point out this issue as well discovered which points to issues in spin() exiting cleanly when Control+C is hit so that rclcpp::shutdown or others can be called once the node is kicked out. It seems related, but I can file a separate ticket for it. Similarly to that @tonynajjar has another ticket on spin_some not working the first time its called in some cases.

fujitatomoya commented 1 month ago

@SteveMacenski

https://github.com/ros2/rclcpp/pull/2545 can fix that graceful shutdown case, and generated by https://github.com/ros2/rclcpp/pull/2528. (sorry...)

SteveMacenski commented 1 month ago

No worries, shit happens. Another good reason Nav2's around using all these bells and whistles as a canary in the coalmine. Only used up like a day on this so not bad at all in the scheme of things. Lovely that there's already a solution! Thanks @fujitatomoya

fujitatomoya commented 1 month ago

https://github.com/ros2/rclcpp/pull/2545 fixes LifecycleNode to avoid leaving in unknow state, if the object is destroyed. (originally this is issued https://github.com/ros2/rclcpp/issues/997).

But if the context is gracefully shutdown (e.g deferred signal handler via SIGINT, which is the exactly case here), the context is already shutdown, and that makes the lifecycle node dtor fails to publish the transition (lifecyclenode should be shutdown to avoid unknow state for sensors and devices but fails since context is not valid anymore). https://github.com/ros2/rclcpp/pull/2545 bypass this situation at this moment, shutdown is called only when the context is valid. this is tagged with TODO until we figure out how to address this condition.

https://github.com/ros2/rclcpp/pull/2545 CI is running now, once i have green light, i will merge it to rolling.

SteveMacenski commented 1 month ago

BTW: this worked fine those pre-24.04 Rolling (22.04 Rolling didn't exhibit this) so there was something that changed that caused this, it was fully functioning for ~a year prior. I think my understanding of your response is that there's a bit of a catch-22, so mentioning this as a "I think there should be a solution, no?"

fujitatomoya commented 1 month ago

@SteveMacenski i think there are 2 problems here,

[component_container_isolated-8] [ERROR] [1717089793.875015316] [controller_server]: Unable to start transition 5 from current state shuttingdown: Could not publish transition: publisher's context is invalid, at ./src/rcl/publisher.c:423, at ./src/rcl_lifecycle.c:368 [component_container_isolated-8] [WARN] [1717089793.875026497] [rclcpp_lifecycle]: Shutdown error in destruction of LifecycleNode: final state(unconfigured)

I am sure these messages are generated by https://github.com/ros2/rclcpp/pull/2528 as explained https://github.com/ros2/rclcpp/issues/2547#issuecomment-2140512199, and https://github.com/ros2/rclcpp/pull/2545 can fix it.

note: https://github.com/ros2/rclcpp/pull/2528 IS NOT merged to jazzy. (https://github.com/ros2/rclcpp/pull/2542 staging but merged), so we should not meet these message with jazzy.

24.04 and seen in both Jazzy and current rolling. There has been some extended discussion wrt the unit tests in the comments:

these 3 issues are with rmw_fastrtps, right? this looks the same problem with https://github.com/ros2/rmw_fastrtps/issues/761

SteveMacenski commented 1 month ago

The second comment is definitely the same as this.

The first is perhaps? Its also another 'things failing on shutdown' example and you see similar logs with the pre-shutdown (the third is a summarized version of the same). I didn't assume it was a Fast-DDS specific problem, but :shrug:

    [INFO] [1716243365.746525522] [costmap]: Running Nav2 LifecycleNode rcl preshutdown (costmap)
    [INFO] [1716243365.746611543] [costmap]: Destroying bond (costmap) to lifecycle manager.
    Failed to delete datawriter, at ./src/publisher.cpp:45 during '__function__'
    cannot publish data, at ./src/rmw_publish.cpp:66 during '__function__'
    Fail in delete datareader, at ./src/rmw_service.cpp:89 during '__function__'
fujitatomoya commented 1 month ago

@SteveMacenski

https://github.com/ros2/rclcpp/pull/2545 has been merged to rolling, the following waning and error should not be printed even with graceful shutdown.

[component_container_isolated-8] [ERROR] [1717089793.875015316] [controller_server]: Unable to start transition 5 from current state shuttingdown: Could not publish transition: publisher's context is invalid, at ./src/rcl/publisher.c:423, at ./src/rcl_lifecycle.c:368 [component_container_isolated-8] [WARN] [1717089793.875026497] [rclcpp_lifecycle]: Shutdown error in destruction of LifecycleNode: final state(unconfigured)

SteveMacenski commented 1 month ago

Great! Currently in the middle of other firefighting with getting Nav2 out for Jazzy, but I'll test this out once the nightly docker containers update and I'm between burning buildings. If you're confident that it addresses the issue, feel free to close

fujitatomoya commented 3 weeks ago

@SteveMacenski after https://github.com/ros2/rclcpp/pull/2545 is merged in rolling, are you guys still having the other issue?

    [INFO] [1716243365.746525522] [costmap]: Running Nav2 LifecycleNode rcl preshutdown (costmap)
    [INFO] [1716243365.746611543] [costmap]: Destroying bond (costmap) to lifecycle manager.
    Failed to delete datawriter, at ./src/publisher.cpp:45 during '__function__'
    cannot publish data, at ./src/rmw_publish.cpp:66 during '__function__'
    Fail in delete datareader, at ./src/rmw_service.cpp:89 during '__function__'