Layered handling of node and (sub-)system errors

chcorbato commented 4 years ago

from (#47 )

This is in the context of our exemplary case of the laser_driver error. We want to elaborate on the layered approach we discussed in the last MROS meeting. This is how I interpret our desired design (please comment if something is not correct or clear):

First the laser_driver code for handling errors tries to recover from the error in the ErrorProcessing transition state.

(from here it is a related but different issue)

If it does not succeed (I guess that means node does not transition to Active), the ModeManager tries to recover from the error using the feature/rules. For this, @jginesclavero is adding a rule in the SystemModes file of our system.

If there is no rule, or there is but after applying it the alternative MODE(s) of the laser_driver are not reached either, the ModeManager reports to the MROS Metacontroller that the corresponding (sub)system(s) MODE(s) are not reachable. (see issue for the continuation of the handling of errors at the higher layers)

continuation

Currently this will be implemented in a passive way, by offering that information (see https://github.com/micro-ROS/system_modes/issues/43) But, since the current target MODE cannot be reached... we were thinking (in a discussion with TUD and URJC) if the ModeManager should report this actively system wide, for the operator or any supervisory system (e.g. MROS Metacontroller) to handle it.

Proposal: Since not being able to reach the target MODE is a deviation of expected and desired behaviour, we propose that the ModeManager uses diagnostics to report this. The MROS Metacontroller will subscribe such diagnostic messages. (@fmrico @jginesclavero @marioney please comment if I missed something or did not convey it correctly)

What do you think @norro ?

norro commented 4 years ago

What the mode manager will actually already sense is the deviation between the requested state/mode and the actual state/mode. This is not yet merged to master, but available in the feature/rules branch, because it is necessary in order to decide when to apply rules. See feature/rules:mode_inference.cpp. Reporting these deviations to diagnostics is an interesting idea.

This is again a question of timing, though. When a state/mode transition is requested, there is always and immediately a deviation, since systems/nodes will take some time to perform the transition. So the mode manager will have to decide, when to report the deviation, i.e. when to assume that the transition takes to long and the deviation therefore can be considered an erroneous deviation. Do you have an idea how/when to do this? After half a second? A second? ... @chcorbato

norro commented 4 years ago

Suggestion:

When a deviation is detected, wait a certain time t_0 before considering it an erroneous deviation
After t_0, try to apply a rule, if an appropriate rules exists. If no rule exists, try to recover the node/system
Wait a certain time t_1 and if nothing happened, report the erroneous deviation, e,g., through diagnostics

(t_0 and t_1 have to be configurable obviously) /cc @chcorbato @ralph-lange

chcorbato commented 4 years ago

I like very much your suggestion of a configurable time limit for each management layer!

Do you have suggestions for these times in the case of navigation2 @fmrico @marioney @jginesclavero @lbajo ?

norro commented 4 years ago

@chcorbato The feature/rules is merely a micro-ROS experiment by now btw. For "2. If it does not succeed [...] tries to recover from the error using rules" I consider metacontrol (reconfiguration actions?) in charge. We are even happy to drop the system modes rules feature completely once the metacontrol part for this task is integrated with system modes.

chcorbato commented 4 years ago

I see. Currently @jginesclavero is trying to get results with that feature this week, by adding such a rule in Pilot-URJC system model.

I propose we keep this test for this week and analyse the result afterwards (usefulness, problems...) to then make an informed decision to move the feature to the metacontrol part.

What do you think @norro @jginesclavero ? @norro are you available to keep supporting @jginesclavero on this today and tomorrow?

norro commented 4 years ago

Yes, I am available today and tomorrow to help with upcoming issues.

jginesclavero commented 4 years ago

Hi @norro @chcorbato ! I was testing the feature/rules branch and it works as we expected. In short, I have defined a rule that changes to DEGRADED mode (navigation with pointcloud_to_laser) if the laser_driver is not in active state. The mode is changed immediately, works really nice. I have done some navigation tests where I force a laser failure and the mode change correctly, the laser is replaced by the pointcloud and the navigation continues.

marioney commented 4 years ago

Do you have suggestions for these times in the case of navigation2 @fmrico @marioney @jginesclavero @lbajo ?

From the metacontroller point of view, the reasoning cycle is very slow (about 2 sec) so we're safe with half of that I guess. I'm not sure how that time affects the navigation 2, but I'm guessing it does not.

norro commented 3 years ago

Closing this issue soon as it has successfully been shown in the MROS pilots.

micro-ROS / system_modes

Layered handling of node and (sub-)system errors #48