Starting a capability fails silently if it is currently shutting down

jonbinney commented 9 years ago

We have the need to restart capabilities sometimes, and the way we do it is:

Call stop_capability service
Wait until the capability is stopped
Call start_capability service to start capability again I haven't found any reliable way to do step (2) though. Currently we poll the get_running_capabilities service, but this stops including the capability instance once it is in the "stopped" state. If we then call the start_capability service, it silently fails to start the capability because, even though it is "stopped", it is still in __capability_instances here: https://github.com/osrf/capabilities/blob/c6c71a9ac17e310787242f5b75bd1e4d86349844/src/capabilities/server.py#L693

Alternatively, we could try listening to the events topic for a terminate event, and then call start_capability after we see the terminate. This still has a race condition though, because even thought LaunchManager.__monitor_process has published a terminate event, the capability server may not have yet received it, processed it, and updated the graph to remove that capability instance.

Is there another way to restart a capability that would be more reliable than this? The easiest fix I can see is that if you call the start_capability service on a capability which is already running/stopped/terminated but still in capability_instances, the service should return an error code that at least lets you know it failed. Then you have the option to wait a bit and try again.

@tlau

wjwwood commented 9 years ago

Two options, one is that you could use the /capability_server/event topic, but the other is that we just add a restart_capability service, which may or may not be easier to implement.

See: https://github.com/osrf/capabilities/blob/master/src/capabilities/launch_manager.py#L242-248

jonbinney commented 9 years ago

Yeah I saw that event, but there's a race condition still, right? The instant that terminated event is published, the capability instance is still in the __capability_instances variable of the server. So if you immediately call the start_capability service, it will fail silently. There's an unknown (probably tiny, but still unknown) delay between when the terminate event gets published and when the server actually gets it and removes the capability instance.

wjwwood commented 9 years ago

@jonbinney That's true, we could also add an additional event, which is fired after terminate which is only used by external parties.

wjwwood commented 9 years ago

Basically you could detect if it was going to be removed here:

https://github.com/osrf/capabilities/blob/master/src/capabilities/server.py#L547-551

And then publish another event message a little further down, after the self.__update_graph():

https://github.com/osrf/capabilities/blob/master/src/capabilities/server.py#L553

jonbinney commented 9 years ago

An additional event would work, but it might be also be good to modify the start_capability to return failure code if it doesn't actually start a capability for any reason. Then the client can retry.

wjwwood commented 9 years ago

Yeah, I think you're right, it should return something if it fails to add it, question is raise or return a bool or return a enumerated return code of some sort.

jonbinney commented 9 years ago

The service definition for StartCapability does have a bool return flag:

string capability                                                                                                                                                                                                                                                                
string preferred_provider                                                                                                                                                                                                                                                        
---                                                                                                                                                                                                                                                                              
bool successful

I think a return code with a little information would be good:

string capability                                                                                                                                                                                                                                                                
string preferred_provider                                                                                                                                                                                                                                                        
---                                                                                                                                                                                                                                                                              
uint8 result
uint8 RESULT_SUCCESS=0
uint8 RESULT_ALREADY_RUNNING=1 # This capability is currently running
uint8 RESULT_ALREADY_STOPPED=2 # This capability is in the process of stopping
uint8 RESULT_ALREADY_TERMINATED=3 # This capability has been terminated, but not yet removed from the graph

jonbinney commented 9 years ago

but pretend i didn't define all of them as zero since that makes no sense...

wjwwood commented 9 years ago

:lollipop:

wjwwood commented 9 years ago

I don't know what you're talking about :wink:

tlau commented 9 years ago

Maybe:

RESULT_SUCCESS
RESULT_RUNNING
RESULT_STOPPING
RESULT_TERMINATED

?

jonbinney commented 9 years ago

That works. @wjwwood if that sounds reasonable I can put together a PR

wjwwood commented 9 years ago

@jonbinney sounds reasonable, that would be great thanks.

jonbinney commented 9 years ago

Now that start_capability returns an error in this case, clients can handle it.

osrf / capabilities

Starting a capability fails silently if it is currently shutting down #78