ros2 / design

Design documentation for ROS 2.0 effort
http://design.ros2.org/
Apache License 2.0
215 stars 194 forks source link

Lifecycle primary state error transitions #283

Open thebyohazard opened 4 years ago

thebyohazard commented 4 years ago

The purpose of this PR is clarify where error transitions from primary states should be in the node lifecycle design. The original conversation is over in rcl_interfaces PR#97. In that conversation, I suggested having an additional transition from Inactive to ErrorProcessing, having in mind a node that connects to external hardware. If there is an error in the external hardware when the node is inactive, then there is error recovery code that needs to be executed, and a transition to ErrorProcessing is the natural choice for that. I have edited the diagrams and the article to show and describe such a transition.

Also in the other PR, @tfoote suggested we discuss here whether or not there should also be a transition from Unconfigured to ErrorProcessing:

Originally there wasn't expected to be anything happening in the inactive state or configured states, but there may be other threads or activity that could cause transitions (such as hardware errors) that would change the status. So to that end adding a transition from Unconfigured to ErrorProcessing would also make sense since something could go wrong in that state too, potentially recoverable with some error recovery.

The reason I didn't include this at first in the other PR was that I have been designing my lifecycle nodes to where nothing happens in the node until configuration and no connections to the hardware are made until configuration, so it was not possible to have an error occur in the unconfigured state. Also, the design says regarding the Unconfigured state,

In this state there is expected to be no stored state

So my contrived example of a way such a transition could maybe be undesirable: if the onError transition expects some object to be filled out, it may not be filled out if the node isn't configured yet. But to that I'd say that an onError transition should probably check that anyway. Besides, it's the code in the lifecycle node that would raise the error in the first place, so if you don't need to raise an error, don't raise it.

On the argument for the transition, the design currently says regarding the ErrorProcessing state:

It is possible to enter this state from any state where user code will be executed.

And I think there definitely could be some kind of user code being executed in the Unconfigured state if a node has some combination of lifecycle components and also non-lifecycle components. The non-lifecycle components could always be running even when the node is unconfigured and then cause the errors.

So I guess my vote is to add the Unconfigured to ErrorProcessing error transition, but I honestly haven't delved too deeply into lifecycle to know what else it might break.

thebyohazard commented 4 years ago

One other thing I noticed: the original diagram doesn't show transitions for onDeactivate[FAILURE], onCleanup[FAILURE], onShutdown[FAILURE], or onError[Error Raised], which are in the implementation. Shall I add them to this PR, or should that be a separate branch/PR for documentation purposes?

fujitatomoya commented 4 years ago

@thebyohazard

Shall I add them to this PR, or should that be a separate branch/PR for documentation purposes?

I believe those should be added along with implementation.

sloretz commented 4 years ago

@Karsten1987 @wjwwood friendly ping :)

thebyohazard commented 4 years ago

@fujitatomoya I agree with you that a shutdown failure should go back to the previous state to retry. You could be relying on the shutdown code to execute and that can't be guaranteed right now. However, this and the related PRs are enhancements, but making that change to the implementation could potentially be a breaking change for existing systems. If we were to fix it, I definitely think a separate PR in rcl_lifecycle should be opened other than the one that adds the error transitions. I also don't like the idea of the design not matching the implementation, so I would vote to leave the design matching the current implementation and making another set of PRs to work on fixing shutdown failure to have the expected behavior.

ivanpauno commented 3 years ago

@Karsten1987 @wjwwood friendly ping

fujitatomoya commented 3 years ago

related PRs based on this design,

I think that some follow-up will be required (rebase, requested changes, and test code is missing), if @thebyohazard is not responding and nobody working on this, I'm willing to do this.

UomoJallo commented 2 years ago

@fujitatomoya @Karsten1987 @wjwwood any update about this PR? Or what do you suggest in case of a detected error while in ACTIVE state?

Should the node try first to deactivate itself and then cleanup to reach the UNCONFIGURED state? Even if it should be better to avoid self transitions...

fujitatomoya commented 2 years ago

could be related to https://github.com/ros2/rclcpp/issues/1880 (just leaving the reference)

ros-discourse commented 1 year ago

This pull request has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/deferrable-canceleable-lifecycle-transitions/32318/1