Lifecycle destructor calls shutdown while in shuttingdown intermediate state

oysstu commented 2 months ago

Bug report

Required Info:

Operating System: Ubuntu 24.04
Installation type: Both
Version or commit hash: iron latest
DDS implementation: default
Client library (if applicable): N/A

Steps to reproduce issue

Shutdown transition was added to the lifecycle node destructor in #2450 (iron backport: #2490). This currently triggers shutdown if the node state is anything other than finalized. Would it make sense for this to check whether the node is in one of the primary states before sending the shutdown transition? I've seen the following warning quite a bit when terminating a launched node using ctrl+c.

Unable to start transition 7 from current state shuttingdown: Could not publish transition: publisher's context is invalid, at ./src/rcl/publisher.c:423, at ./src/rcl_lifecycle.c:368

Not sure if it is related to the publisher being invalid or not, but the node is clearly in the shuttingdown state, and not one of the primary states.

Expected behavior

Only attempt transition in primary states unconfigured, inactive, and active.

Actual behavior

Shutdown transition is attempted for intermediate transitions (e.g., shutting down).

Ref. the following destructor snippet https://github.com/ros2/rclcpp/blob/5f912eb58e0f6f1718592ece0935f9a0115d03ec/rclcpp_lifecycle/src/lifecycle_node.cpp#L156-L169

fujitatomoya commented 1 month ago

Would it make sense for this to check whether the node is in one of the primary states before sending the shutdown transition?

i think this makes sense. we cannot change the state when it is in transition state anyway. if that is not in the primary state, we can print the warning that it cannot shutdown during destructor.

Could not publish transition: publisher's context is invalid, at ./src/rcl/publisher.c:423, at ./src/rcl_lifecycle.c:368

lifecycle node com interface cannot be finalized yet, https://github.com/ros2/rcl/blob/f19067726b9bc2d0b90ac0d0f246d1d7821d41ee/rcl_lifecycle/src/rcl_lifecycle.c#L264 since std::unique_ptr<LifecycleNodeInterfaceImpl> impl_; should not be destroyed at LifecycleNode's dtor.

fujitatomoya commented 1 month ago

https://github.com/ros2/rclcpp/pull/2522 reverts the https://github.com/ros2/rclcpp/pull/2450, i will try once jazzy release is settled.

fujitatomoya commented 1 month ago

Just posting what needs to be done after jazzy release.

revert https://github.com/ros2/rclcpp/pull/2522 (including jazzy, not iron/humble)
add the fix to call shutdown only when lifecycle node is in the primary state. (this fix needs to be applied to iron and humble)

fujitatomoya commented 1 month ago

So cont to https://github.com/ros2/rclcpp/issues/2520#issuecomment-2099615372, i took a quick check at the implementation.

I've seen the following warning quite a bit when terminating a launched node using ctrl+c.

this could be different problem for destruction order. @oysstu can you provide the minimal example to reproduce the issue?

Unable to start transition 7 from current state shuttingdown: Could not publish transition: publisher's context is invalid, at ./src/rcl/publisher.c:423, at ./src/rcl_lifecycle.c:368

this is NOT because the node is transition state. this is because the context is already destroyed before lifecycle node dtor.

this is the same problem that https://github.com/ros2/rclcpp/pull/2522#issue-2286526246 reported. this can be observed with rclcpp_lifecycle/test_lifecycle_publisher. and this is the bug in the test code. rclcpp::shutdown() (TearDown) will be called before LifecycleNode::~LifecycleNode, that said the context will be invalid at node dtor, so https://github.com/ros2/rclcpp/pull/2450 generates the error because it tries to publish the state after transition.

1st we need to fix this, https://github.com/ros2/rclcpp/pull/2527

fujitatomoya commented 1 month ago

@oysstu it would be really appreciated if you can take a look at,

oysstu commented 1 month ago

I'm traveling for the next couple of weeks, but I'll try to follow up as best as I can. I noticed now that the rclcpp context is shut down by the signal handler, and not by me after the executor stops spinning. See the following output:

^C[WARNING] [launch]: user interrupted with ctrl-c (SIGINT)
[WARNING] [launch_ros.actions.lifecycle_node]: Abandoning wait for the '/namespace/node/change_state' service response, due to shutdown.
[node_exec-1] [INFO] [1715459945.276584858] [rclcpp]: signal_handler(signum=2)
[node_exec-1] [DEBUG] [1715459945.276609123] [rclcpp]: signal_handler(): notifying deferred signal handler
[node_exec-1] [DEBUG] [1715459945.276634450] [rclcpp]: deferred_signal_handler(): woken up due to SIGINT/SIGTERM or uninstall
[node_exec-1] [DEBUG] [1715459945.276641853] [rclcpp]: deferred_signal_handler(): shutting down
[node_exec-1] [DEBUG] [1715459945.276674554] [rclcpp]: deferred_signal_handler(): shutting down rclcpp::Context @ 0x5ef029a5b700, because it had shutdown_on_signal == true
[node_exec-1] [DEBUG] [1715459945.276678071] [rcl]: Shutting down ROS client library, for context at address: 0x5ef029a5ba40
[node_exec-1] [DEBUG] [1715459945.276689672] [rcl]: Finalizing publisher
[node_exec-1] [DEBUG] [1715459945.276920120] [rcl]: Publisher finalized
[node_exec-1] [DEBUG] [1715459945.276944746] [rclcpp]: deferred_signal_handler(): waiting for SIGINT/SIGTERM or uninstall
[ERROR] [node_exec-1]: process[node_exec-1] failed to terminate '5' seconds after receiving 'SIGINT', escalating to 'SIGTERM'
[INFO] [node_exec-1]: sending signal 'SIGTERM' to process[node_exec-1]
[node_exec-1] [INFO] [1715459950.335149560] [rclcpp]: signal_handler(signum=15)
[node_exec-1] [DEBUG] [1715459950.335167252] [rclcpp]: signal_handler(): notifying deferred signal handler
[node_exec-1] [DEBUG] [1715459950.335197609] [rclcpp]: deferred_signal_handler(): woken up due to SIGINT/SIGTERM or uninstall
[node_exec-1] [DEBUG] [1715459950.335203910] [rclcpp]: deferred_signal_handler(): shutting down
[node_exec-1] [DEBUG] [1715459950.335207287] [rclcpp]: deferred_signal_handler(): waiting for SIGINT/SIGTERM or uninstall
[node_exec-1] [ERROR] [1715459954.328207103] [namespace.node]: Failed to finish transition 1. Current state is now: unconfigured (Could not publish transition: publisher's context is invalid, at /tmp/makepkg/ros2-iron-base/src/ros2/src/ros2/rcl/rcl/src/rcl/publisher.c:423, at /tmp/makepkg/ros2-iron-base/src/ros2/src/ros2/rcl/rcl_lifecycle/src/rcl_lifecycle.c:368)

Shutdown is called from here: https://github.com/ros2/rclcpp/blob/343b29b617b163ad72b9fe3f6441dd4ed3d3af09/rclcpp/src/rclcpp/signal_handler.cpp#L271

oysstu commented 1 month ago

Some possible solutions come to mind.

Check if the context/publisher is still valid in the destructor and skip transition if invalid
Register a shutdown callback for the context in the lifecycle node constructor and handle the transitioning there. Remove callback in destructor. This is something I've used to do shutdown transitioning previously.
A combination of the two to ensure transitioning both when the context is shutdown first or destructor is called first

fujitatomoya commented 1 month ago

I believe that either context is valid or invalid, we should clean up the lifecycle node to finalize it to avoid leaving the devices unknown state, that is the original issue https://github.com/ros2/rclcpp/issues/997.

adding a shutdown callback for the context would work, but if user also adds the preshutdown callback on the context, it would not work as expected order. (we might call Lifecycle::shutdown before user's cleanup callbacks.)

I think we can call the LifecycleNode::shutdown in the destructor, but publishing state is best effort? (not error, but warning) any thoughts?

oysstu commented 1 month ago

adding a shutdown callback for the context would work, but if user also adds the preshutdown callback on the context, it would not work as expected order. (we might call Lifecycle::shutdown before user's cleanup callbacks.)

Assuming that the shutdown callbacks are called in the reverse order of registration, the callback registered in the constructor would be the last one to be called relative to the lifetime of the node (i.e., any references to the node would be obtained by the user after the constructor). If the shutdown callback handles the shutdown transition on rclcpp::shutdown, and the node destructor handles it otherwise, are not all cases covered? Assuming that the node destructor does nothing if the context/publisher is invalid.

I think we can call the LifecycleNode::shutdown in the destructor, but publishing state is best effort? (not error, but warning) any thoughts?

Hmm, I don't know enough about the lifecycle QoS and it's intended design to comment on this.

fujitatomoya commented 1 month ago

Assuming that the shutdown callbacks are called in the reverse order of registration,

this is true.

i was thinking that user would do something like, have the pointers for nodes, and reset those pointers in context shutdown pre-shutdown hook to clean everything up since the context is shutting down. and then this LicecycleNode system callback would be calling the invalid memory segment? that was my concern off the top of my head...

maybe there could be more ideas, lets keep this open for the fix! thanks for iterating, i will try to bring this issue for some WG meeting.

oysstu commented 1 month ago

i was thinking that user would do something like, have the pointers for nodes, and reset those pointers in context shutdown pre-shutdown hook to clean everything up since the context is shutting down. and then this LicecycleNode system callback would be calling the invalid memory segment?

I've used a weak_ptr to refer to the node without extending it's lifetime. This only works when the node is managed by a shared_ptr though. If the lifecycle interface was instead defined in a NodeLifecycleInterface class and contained in the node as a shared_ptr, this could be guaranteed (i.e., similar to the other interfaces such as NodeBaseInterface).

I've also de-registered the context shutdown callback in the lifecycle destructor, so that also should avoid dangling callbacks. That is good if an application creates and deletes a bunch of nodes. This alone is not sufficient though, since there could be a race condition between the destructor and shutdown callback (which is why a shared_ptr/weak_ptr is needed).

fujitatomoya commented 1 month ago

@oysstu just fyi,

Only attempt transition in primary states unconfigured, inactive, and active. Shutdown transition is attempted for intermediate transitions (e.g., shutting down).

this original problem should be now completed as LifecycleNode class.

https://github.com/ros2/rclcpp/issues/2520#issuecomment-2106028772 signal case can be discussed more here. (when the context is shutdown gracefully with deferred signal thread)

fujitatomoya commented 1 month ago

iron backport: https://github.com/ros2/rclcpp/pull/2543
humble backport: https://github.com/ros2/rclcpp/pull/2544

fujitatomoya commented 1 month ago

I end up having https://github.com/ros2/rclcpp/pull/2545 as follow up for https://github.com/ros2/rclcpp/pull/2528.

the following backport PRs need to include https://github.com/ros2/rclcpp/pull/2545, so holding.

g-arjones commented 3 weeks ago

I believe that either context is valid or invalid, we should clean up the lifecycle node to finalize it to avoid leaving the devices unknown state, that is the original issue https://github.com/ros2/rclcpp/issues/997.

As demonstrated in #2553 that issue cannot be addressed from the dtor because the subclasses' dtors are called before the base class' and it's illegal to call methods after the dtor so I think the proper way to address that would be to use the context's shutdown callback as suggested by @oysstu which is what we have been using (and is completely broken now with that change)

fujitatomoya commented 3 weeks ago

I think the proper way to address that would be to https://github.com/ros2/rclcpp/issues/2520#issuecomment-2106080946which is what we have been using (and is completely broken now with that change)

@g-arjones thanks for the information and issue. in that case, what about the context is still in valid? can we just leave the device or sensor with unknown state until context is destroyed?

g-arjones commented 3 weeks ago

I think that's the best that can be done by the base class (that was actually what the author of the feature request was referring to since he mentioned CTRL+C). If a more fine-grained control of the state is required then the applications must handle the transitions themselves.

Calling a subclass method from a base class destructor is just not possible in C++ (which also explains #2554).

oysstu commented 3 weeks ago

@g-arjones thanks for the information and issue. in that case, what about the context is still in valid? can we just leave the device or sensor with unknown state until context is destroyed?

The C++ RAII paradigm would suggest that since the subclass initializes the resource, it is up to the subclass dtor to handle releasing the resource properly. On a more distributed level, we use a manager node to handle transitioning when not terminating through exceptions or CTRL+C (e.g. something like the nav2 lifecycle manager). For us, the nominal shutdown pattern used is deactivate-cleanup-shutdown unless one of the intermediate transitions fail, in which case the direct shutdown transition is used (e.g. active-shutdown).

g-arjones commented 3 weeks ago

Same here :+1:

g-arjones commented 3 weeks ago

Just to expand on that, since it's the subclasses that are initializing/handling devices and sensor states they are the ones that should be responsible for ensuring the states are not unknown (in their dtors and/or on_shutdown)

fujitatomoya commented 3 weeks ago

@oysstu @g-arjones @gabrielfpacheco thanks for the discussion and opinions.

all comments here make sense to me.

Calling a subclass method from a base class destructor is just not possible in C++ (which also explains https://github.com/ros2/rclcpp/issues/2554).

this is also true.

Just to expand on that, since it's the subclasses that are initializing/handling devices and sensor states they are the ones that should be responsible for ensuring the states are not unknown (in their dtors and/or on_shutdown)

hmm, so that is user's responsibility, and base class never calls shutdown.

i am not sure at this moment, i am okay to roll back everything but before that i can bring this up with other maintainers.

g-arjones commented 3 weeks ago

i am okay to roll back everything but before that i can bring this up with other maintainers.

Thank you, I would really appreciate that. This change created a lot of problems for us.

fujitatomoya commented 3 weeks ago

yeah sorry about that happens, i really appreciate your feedback and comments.

g-arjones commented 3 weeks ago

No reason to apologize. Keep up the good work! :+1:

fujitatomoya commented 3 weeks ago

@oysstu @g-arjones @gabrielfpacheco CC: @mjcarroll @wjwwood @clalancette

so i came to the conclusion, that is i will roll back call shutdown in LifecycleNode dtor including all backports.

the reason is pretty much we discussed here. we are generating the another complication (hard to expect and user cannot avoid https://github.com/ros2/rclcpp/issues/2554) to address the one issue(https://github.com/ros2/rclcpp/issues/2553#issuecomment-2155380727). until we can figure out more fine grained control for user application, we should roll back the behavior so that user can be responsible for the lifecycle management.

so at this moment, that is user application responsibility to make sure to call shutdown the device or sensor to avoid leaving them in unknown state. (either calling shutdown in sub-class or using context shutdown callback.)

LifecycleNode can register a shutdown callback for the context in the lifecycle node constructor and handle the shutdown there. but i think this only works if the node is managed by shared_ptr, so that the context can use weak_ptr to see if the object is still in valid. i think it would be simpler to leave this to user application instead of providing something sometimes works but sometimes does not.

as follow-up, I guess something we can improve is doc section that user needs to call shutdown if necessary and warning message just to check the current state and if that is not shut down, we can warn the user that LifecycleNode is not shut down, ... to let the user know this.

what do you think? any concerns and thoughts?

again, thanks for pointing out my oversight and sharing!

g-arjones commented 3 weeks ago

@fujitatomoya Thank you for the update :+1:

i think it would be simpler to leave this to user application instead of providing something sometimes works but sometimes does not.

Yeah, that's what makes the most sense to me too.

I guess something we can improve is doc section that user needs to call shutdown if necessary

Definitely...

and warning message just to check the current state and if that is not shut down, we can warn the user that LifecycleNode is not shut down, ... to let the user know this.

Yeah, I guess that's fine. I'm just not sure which loglevel that should go in since for many applications not finalizing a node before terminating the process (or destroying its instance) is perfectly fine and even common practice within many examples and tutorials.

fujitatomoya commented 3 weeks ago

I'm just not sure which loglevel that should go in since for many applications not finalizing a node before terminating the process (or destroying its instance) is perfectly fine and even common practice within many examples and tutorials.

i guess DEBUG. i am not sure if any user application intentionally leaves without shutting down, but i understand your concern that warning could be surprising for user all the sudden.

oysstu commented 3 weeks ago

I think this is the correct move right now. The context shutting down before the node is cleaned up points to either a programming/logic error or something exceptional happening. It might be best to leave it to the user to decide the actions to take.

LifecycleNode can register a shutdown callback for the context in the lifecycle node constructor and handle the shutdown there. but i think this only works if the node is managed by shared_ptr, so that the context can use weak_ptr to see if the object is still in valid. i think it would be simpler to leave this to user application instead of providing something sometimes works but sometimes does not.

As I mentioned, it's possible to do this if the lifecycle functionality was implemented in a base interface contained in a shared pointer in the lifecycle class. I believe implementing this would enable some generic programming by treating the lifecycle node like a regular node through its base interfaces, but then have the additional lifecycle interface for the specific lifecycle functionality.

Right now, the lifecycle API is a weird mix of functional programming and inheritance. I think it would be better to have the lifecycle class intended for use with inheritance, and disallow the direct callback functions. If the lifecycle functionality was implemented in a base interface it could be added to any node if users prefer the functional style.

gabrielfpacheco commented 3 weeks ago

I appreciate your effort for looking into this, @fujitatomoya.

i think it would be simpler to leave this to user application instead of providing something sometimes works but sometimes does not.

It makes sense to me that the user is the one responsible for rightfully finalizing the node since this may be application-specific.

Rolling back and improving documentation seems the right move for me as well.

fujitatomoya commented 3 weeks ago

@oysstu @g-arjones @gabrielfpacheco thanks for the quick response.

i will start reverting, i will try some doc and debug print follow-up after reverting.

fujitatomoya commented 3 weeks ago

rollback PRs,

Rolling: https://github.com/ros2/rclcpp/pull/2557
Jazzy: https://github.com/ros2/rclcpp/pull/2558
Iron: https://github.com/ros2/rclcpp/pull/2559
Humble: https://github.com/ros2/rclcpp/pull/2560

to close the following issues,

this issue, https://github.com/ros2/rclcpp/issues/2520
https://github.com/ros2/rclcpp/issues/2553
https://github.com/ros2/rclcpp/issues/2554
https://github.com/ros2/rclcpp/issues/997, i would not reopen this instead add comments that user needs to call shutdown at this moment.

fujitatomoya commented 3 weeks ago

see https://github.com/ros2/rclcpp/issues/2520#issuecomment-2159253427, all related to PRs are rolled back and merged.

i will go ahead to close this issue, feel free to reopen if i miss anything.

fujitatomoya commented 3 weeks ago

@oysstu @g-arjones @gabrielfpacheco just FYI, this is closed.

g-arjones commented 3 weeks ago

@fujitatomoya Thanks a lot! Any idea when the PR to rosdistro will be open?

fujitatomoya commented 3 weeks ago

that is something i am not sure, you are talking about humble, right? CC: @audrow

g-arjones commented 3 weeks ago

you are talking about humble, right?

Exactly...

fujitatomoya commented 3 weeks ago

@audrow @clalancette do we have any plan for humble sync or plan?

audrow commented 2 weeks ago

@audrow @clalancette do we have any plan for humble sync or plan?

I'll be doing syncs every two weeks or so. (A sync is in progress right now.)

As for a patch, I'm going to aim for the week of July 22nd. Let me know if I should do a patch sooner.

fujitatomoya commented 2 weeks ago

@audrow thanks for sharing the plan. CC: @g-arjones @gabrielfpacheco

g-arjones commented 2 weeks ago

I would really appreciate if this could be released sooner. It's an important bug fix (kind of a critical one, at least for us). Is that feasible?

ros-discourse commented 2 weeks ago

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/new-packages-for-humble-hawksbill-2024-06-20/38277/2

fujitatomoya commented 2 weeks ago

@audrow can you confirm https://discourse.ros.org/t/new-packages-for-humble-hawksbill-2024-06-20/38277/2 ?

g-arjones commented 2 weeks ago

It looks like rclcpp wasn't included, unfortunately. Still broken...

fujitatomoya commented 2 weeks ago

@g-arjones i had a chat with @audrow , that was just a sync, so just community packages only. next patch is being scheduled in middle of July at this moment.

g-arjones commented 1 week ago

I would have expected something like this to have a higher priority but I guess we will have to wait. Thank you for the notice!

ros2 / rclcpp