Closed jonbinney closed 9 years ago
Two options, one is that you could use the /capability_server/event
topic, but the other is that we just add a restart_capability
service, which may or may not be easier to implement.
See: https://github.com/osrf/capabilities/blob/master/src/capabilities/launch_manager.py#L242-248
Yeah I saw that event, but there's a race condition still, right? The instant that terminated event is published, the capability instance is still in the __capability_instances
variable of the server. So if you immediately call the start_capability service, it will fail silently. There's an unknown (probably tiny, but still unknown) delay between when the terminate event gets published and when the server actually gets it and removes the capability instance.
@jonbinney That's true, we could also add an additional event, which is fired after terminate which is only used by external parties.
Basically you could detect if it was going to be removed here:
https://github.com/osrf/capabilities/blob/master/src/capabilities/server.py#L547-551
And then publish another event message a little further down, after the self.__update_graph()
:
https://github.com/osrf/capabilities/blob/master/src/capabilities/server.py#L553
An additional event would work, but it might be also be good to modify the start_capability to return failure code if it doesn't actually start a capability for any reason. Then the client can retry.
Yeah, I think you're right, it should return something if it fails to add it, question is raise or return a bool or return a enumerated return code of some sort.
The service definition for StartCapability does have a bool return flag:
string capability
string preferred_provider
---
bool successful
I think a return code with a little information would be good:
string capability
string preferred_provider
---
uint8 result
uint8 RESULT_SUCCESS=0
uint8 RESULT_ALREADY_RUNNING=1 # This capability is currently running
uint8 RESULT_ALREADY_STOPPED=2 # This capability is in the process of stopping
uint8 RESULT_ALREADY_TERMINATED=3 # This capability has been terminated, but not yet removed from the graph
but pretend i didn't define all of them as zero since that makes no sense...
:lollipop:
I don't know what you're talking about :wink:
Maybe:
RESULT_SUCCESS
RESULT_RUNNING
RESULT_STOPPING
RESULT_TERMINATED
?
That works. @wjwwood if that sounds reasonable I can put together a PR
@jonbinney sounds reasonable, that would be great thanks.
Now that start_capability returns an error in this case, clients can handle it.
We have the need to restart capabilities sometimes, and the way we do it is:
__capability_instances
here: https://github.com/osrf/capabilities/blob/c6c71a9ac17e310787242f5b75bd1e4d86349844/src/capabilities/server.py#L693Alternatively, we could try listening to the events topic for a terminate event, and then call start_capability after we see the terminate. This still has a race condition though, because even thought
LaunchManager.__monitor_process
has published a terminate event, the capability server may not have yet received it, processed it, and updated the graph to remove that capability instance.Is there another way to restart a capability that would be more reliable than this? The easiest fix I can see is that if you call the start_capability service on a capability which is already running/stopped/terminated but still in capability_instances, the service should return an error code that at least lets you know it failed. Then you have the option to wait a bit and try again.
@tlau