osrf / capabilities

Implements the concept of capabilities as part of the robots-in-concert system.
Other
8 stars 26 forks source link

Fix stop capability exceptions #70

Closed wjwwood closed 10 years ago

wjwwood commented 10 years ago

This should address #68.

@bit-pirate can you test this branch and see if it addresses the issue for you?

I tried out your scenario and the PID problems have gone away.

I can give a postmortem of sorts. So the problem was that I was calling the internal stop on capabilities which had already been stopped because I didn't consider the fact that stopping one of a capability's dependents (reverse dependency) could also stop other dependents in the list. Therefore I would call stop on the first dependent and then when that was done call it on the second dependent, but the second dependent had already been stopped by the first. So the fix was just to add a check inside the loop to make sure the capability I am about to stop is already stopped. Since calling stop on the capability which had already been stopped raised an error it prevented the rest of the dependents from being shutdown, resulting in the inconsistent state.

I believe this pull request addresses that issue, however, I was still getting this in my testing:

[INFO] [WallTime: 1400897287.546466] Request to free usage of capability 'std_capabilities/RGBDSensor' (bond id '26891f9e-e2e8-11e3-801f-902b34db3493')
[ERROR] [WallTime: 1400897287.546847] RuntimeError: No Capability 'std_capabilities/RGBDSensor' in use
[ERROR] [WallTime: 1400897287.547099] Rapp Manager : Failed to stop capabilities. Errors: ["Error occurred while stopping capability 'std_capabilities/RGBDSensor': Exception raised while stopping cap 'std_capabilities/RGBDSensor': service [/capability_server/free_capability] responded with an error: error processing request: No Capability 'std_capabilities/RGBDSensor' in use"]

I believe what is happening here is that the app manager is calling free on std_capabilities/DifferentialMobileBase and that is causing std_capabilities/RGBDSensor to implicitly shutdown:

[INFO] [WallTime: 1400897268.640145] Request to free usage of capability 'std_capabilities/DifferentialMobileBase' (bond id '26891f9e-e2e8-11e3-801f-902b34db3493')
[INFO] [WallTime: 1400897268.640661] Capability 'std_capabilities/DifferentialMobileBase' being stopped because it has zero references
[INFO] [WallTime: 1400897268.641024] Capability 'turtlebot_capabilities/robot_state_publisher' is being stopped because it's dependency 'std_capabilities/DifferentialMobileBase' has been stopped.
[INFO] [WallTime: 1400897268.641341] Capability 'turtlebot_capabilities/turtlebot2_bringup' is being stopped because it's dependency 'std_capabilities/RobotStatePublisher' has been stopped.
[INFO] [WallTime: 1400897268.641646] Capability 'turtlebot_capabilities/rgbd_sensor' is being stopped because it's dependency 'turtlebot_capabilities/TurtleBotBringup' has been stopped.

Then later the app manager is calling free on the std_capabilities/RGBDSensor capability which was already stopped, resulting in this:

[INFO] [WallTime: 1400897287.546466] Request to free usage of capability 'std_capabilities/RGBDSensor' (bond id '26891f9e-e2e8-11e3-801f-902b34db3493')
[ERROR] [WallTime: 1400897287.546847] RuntimeError: No Capability 'std_capabilities/RGBDSensor' in use
[ERROR] [WallTime: 1400897287.547099] Rapp Manager : Failed to stop capabilities. Errors: ["Error occurred while stopping capability 'std_capabilities/RGBDSensor': Exception raised while stopping cap 'std_capabilities/RGBDSensor': service [/capability_server/free_capability] responded with an error: error processing request: No Capability 'std_capabilities/RGBDSensor' in use"]
bit-pirate commented 10 years ago

After a week full of meetings and talks I finally got around testing this and I can confirm your results.

Now, how should we solve the rapp manager problem?!

  1. parse the raised exception and treat this specific error as no error -> don't like this solution - error-prone due to string parsing
  2. don't throw an exception when freeing a not running capability -> not sure, if this is good either
  3. return error codes in the free_capability service -> more robust then 1.; could be a solution
  4. rapp manager checks which capabilities are running before calling free_capability -> quite an overhead

Do you see another way handling this? Which one would be your preferred choice? I'm slightly favouring 3.

wjwwood commented 10 years ago

Sorry for the latency on my response, I actually think option 1 is not too bad, I agree that it is more fragile than option 3, but the type of the Exception should not change and is in the string version of the error, so I think that would be ok to parse.

I think option 2 is no good, because it makes detecting that case very difficult.

Option 4 is also not great because the capability_server is doing that check for you anyways.

One other thought is that we can have two versions of the free_capability service, one that raises an exception and one that doesn't. That would allow the rapp_manager to free without worry, but allow others to detect the case where they are freeing something that should not be.

I will try to come up with a solution and roll that into the capabilities Client API. Is the rapp_manager using the Client API or calling the service directly?

bit-pirate commented 10 years ago

One other thought is that we can have two versions of the free_capability service, one that raises an exception and one that doesn't. That would allow the rapp_manager to free without worry, but allow others to detect the case where they are freeing something that should not be

Maybe instead of having two services, we add an optional parameter to the free_capability service?

I will try to come up with a solution and roll that into the capabilities Client API. Is the rapp_manager using the Client API or calling the service directly?

The rapp manager is using the Client API.

wjwwood commented 10 years ago

Ok, @bit-pirate I have pushed 5647b5f which added some custom exceptions to the client API, which should be more reliable for you to catch selectively. I am doing string parsing, but since the client API and the ROS service API should always be in lock-step I think that is ok. I added a note at the point where the exception is raised to indicate the message format was being used in the client module.

What do you think of this solution?

bit-pirate commented 10 years ago

@wjwwood sorry for this major delay in my response.

I have adjusted the app manager code to adhere to your newest changes and it looks good. The only comment I have is a cosmetic one.

When shutting done the teleop rapp, the capability server shows this message as an error:

[ERROR] [WallTime: 1403576239.079705] RuntimeError: Cannot free Capability 'std_capabilities/RGBDSensor', because it is not running

Could we reduce this to a warning, since this is an expected situation and does not need to be erroneous?

wjwwood commented 10 years ago

I downgraded it to a warning in 2e67044, though I'm not 100% sure it makes sense. I'm happy with it if you guys are.

bit-pirate commented 10 years ago

though I'm not 100% sure it makes sense

Good point. Ideally, the client should not get into the situation where it asks to stop/free an already stopped/freed cap.

In the current case the rapp rocon_apps/robot_teleop depends on both the RGBDSensor and the DifferentialMobileBase cap. However, down the dependency tree of the RGBDSensor cap there is also the dependency on the DifferentialMobileBase cap. Stopping the DifferentialMobileBase cap would now also stop all caps depending on it. And here there might be an error.

The app manager does not stop, but free the capabilities. So when the rapp is stopped, it first frees the DifferentialMobileBase cap.

[INFO] [WallTime: 1403701919.301008] Rapp Manager : stopped rapp [rocon_apps/teleop]'.
[INFO] [WallTime: 1403701919.301422] Rapp Manager : Stopping required capabilities.
[INFO] [WallTime: 1403701919.306375] Request to free usage of capability 'std_capabilities/DifferentialMobileBase' (bond id 'eb8e9780-fc63-11e3-a9ca-f46d04929542')

And the cap server responds:

[INFO] [WallTime: 1403701919.306902] Capability 'std_capabilities/DifferentialMobileBase' being stopped because it has zero references

However, at this point the RGBDSensor cap should still have an indirect dependency on it.

[INFO] [WallTime: 1403701919.307257] Capability 'turtlebot_capabilities/robot_state_publisher' being stopped because its dependency 'std_capabilities/DifferentialMobileBase' is being stopped.
[INFO] [WallTime: 1403701919.307622] Capability 'turtlebot_capabilities/turtlebot2_bringup' being stopped because its dependency 'std_capabilities/RobotStatePublisher' is being stopped.
[INFO] [WallTime: 1403701919.307938] Capability 'turtlebot_capabilities/rgbd_sensor' being stopped because its dependency 'turtlebot_capabilities/TurtleBotBringup' is being stopped.
[INFO] [WallTime: 1403701921.447684] Capability Provider 'turtlebot_capabilities/rgbd_sensor' for Capability 'std_capabilities/RGBDSensor' has terminated.
[INFO] [WallTime: 1403701922.014306] Capability Provider 'turtlebot_capabilities/turtlebot2_bringup' for Capability 'turtlebot_capabilities/TurtleBotBringup' has terminated.
[INFO] [WallTime: 1403701922.014992] Stopping the 'turtlebot_capabilities/diagnostics' provider of the 'std_capabilities/Diagnostics' interface, because it has no dependents left.
[INFO] [WallTime: 1403701922.015352] Rapp Manager : Stopped required capability 'std_capabilities/DifferentialMobileBase'.
[INFO] [WallTime: 1403701922.019269] Request to free usage of capability 'std_capabilities/RGBDSensor' (bond id 'eb8e9780-fc63-11e3-a9ca-f46d04929542')
[WARN] [WallTime: 1403701922.019799] RuntimeError: Cannot free Capability 'std_capabilities/RGBDSensor', because it is not running
[INFO] [WallTime: 1403701922.020218] Rapp Manager : Stopped required capability 'std_capabilities/RGBDSensor'.
[INFO] [WallTime: 1403701922.020605] Rapp Manager : All required capabilities have been stopped.

Do you have a test case covering this situation?

With this in mind it might be better to revert this exception to an error. Currently can't think of a situation, where freeing a capability would throw this exception and it would be expected ... then again its already late over here. :-)

wjwwood commented 10 years ago

I don't know, I need to think about it in more detail.

In the mean time should we merge this and open a new issue to track this additional discussion? That way I can make a new release which contains the actual bug fixes in this pull request?

bit-pirate commented 10 years ago

In the mean time should we merge this and open a new issue to track this additional discussion? That way I can make a new release which contains the actual bug fixes in this pull request?

:+1: