project8 / dripline

Slow controls for medium scale physics experiments based on AMQP centralized messaging
http://www.project8.org/dripline
1 stars 0 forks source link

requests should timeout #78

Closed laroque closed 9 years ago

laroque commented 9 years ago

While services are now able to automatically reconnect if the connection to AMQP breaks, there are a few scenarios in which a request may never receive a response:

  1. The request is processed, but the responding process has an error sending its reply. Here the responding service should be able to continue receiving and responding to requests. The fix would be for it to detect that a reconnect has happened and/or that a reply failed and try again.
  2. The request is processed, meanwhile the requestor's connection is broken and reconnects. Here the requestor's reply queue has been deleted and replaced with a different queue, the reply was probably send and undeliverable. The responding service should have just moved on. The requesting service should either recognize no response is coming and return from any blocking call (if it is a service itself), or advice the user of the situation (dripline_agent) and return.
  3. The request is queued but is never received by the consumer and is lost when the consumer reconnects (when the consumer dies, its queue is purged and removed, reconnecting creates a fresh queue).

The simplest solution is a hard timeout (maybe 10 seconds?) requiring a reply within that time. If not reply is received, assume none is coming and raise an exception or otherwise respond. A more sophisticated solution depends on the situation (see cases above and should be implemented on top of a timeout, so that it can still catch others).

This isn't great, since it is hard to know what a reasonable timeout should really be.

laroque commented 9 years ago

In particular, right now if the ethernet repeater reconnects, the repeater provider hangs waiting for a response that will never come.

laroque commented 9 years ago

I always do this... (ie create some comprehensive issue then close it when the first bit is done).... which I'm doing again. Connection.send_request() now supports timeouts. The above cases are all still valid and it would be nice to handle them more gracefully. Each should be its own feature requesting issue; all are being kicked to less urgent for now.