Improve timeout strategy

asmodehn commented 8 years ago

Sometime timeout should repeatedly warn, and loop without dying. Some other time it's better to die quickly to detect an exception like node missing.

We need to improve the way we manage timeouts between client and rosnode, as well as from rosnode to the ros system ( exposed services, etc. )

Something like having a double timeout parameter might be what we want :

retry_timeout : timeout and then retry
except_timeout : timeout and then except

A method call should combine both to loop retrying until a final except ( if specified ).

This might be combined with custom futures to be able to keep doing stuff while we re stuck on some timeouts...

dhirajdhule commented 7 years ago

This will be really useful. Yeah sometimes if the node is not up then there is no use of wasting time with timeout. Do you have any thoughts how can this be taken forward?

asmodehn commented 7 years ago

Not much at the moment...

There are multiple point for which I m not fully certain yet :

1) this is a pyros issue, but it sounds like it depends on pyzmp communication and could be useful for any pyzmp user -> how about implementing this in pyzmp instead and having pyros follow whatever timeout API pyzmp currently has ?

2 ) pyzmp communication is done via service. The advantage of service is that your client can be anything, it doesnt have to be a node, and it doesnt have to know if the node is up or not. It only need to know the url ( can be hardcoded ) => quickly excepting because node doesnt exist might not be a valid choice here... However it is a valid choice if the client is a node and somehow has access to the list of nodes it can access... => maybe there should be 2 ways to call a service ?? One with harcoded zmq url, one with discovery before getting the url ? Having some kind of node name service ?

3 ) always looping might also be a bad choice, if nobody is watching the log and nothing changes on the system because noone knows something is broken. Thats the case of an admin starting the pyros server node and going to do something else. If pyros is running we assume it is " working", whatever that means need to be refined I think... For example a webserver can run but always return empty response or 404, but the webserver itself is "working". And the request always timeout. The choice to retry or not is left to the first initiator of the service call... and the error is meaningful enough to allow him to choose...

pyros-dev / pyros

Improve timeout strategy #59