Retrying Messages Until Success

I understand that the system will attempt to re-send a message once if an Actor throws an uncaught exception, and then send a Poison message back if the exception is thrown again. I have several instances where an actor could die not due to bad message, but because of systems out of my control (network, database, etc), and I need to keep attempting to process the message until it succeeds.

My specific example is where I have a simple actor that saves an object to a database. It is critical that it succeeds. If the database is unavailable an exception is thrown when it tries to save and if it is still down when the retry is attempted, the object is not saved and a poison message is generated. In the past I've handled this myself by keeping a queue and catching all possible exceptions tell my actor to wakeup in some duration and attempt the save again.

was wondering if there was a common pattern for a situation like this. Possibly one that leveraged the builtin-in failure strategy.

There are a couple of techniques I could recommend for this type of situation.

First is that if an operation should be persistently retried, you can always use a try/except block inside the Actor's receiveMessage to catch any errors. The built-in failure handling is a good default and can help the overall system be tolerant of failures and support a level of self-healing, but it's not wrong to put more in place for critical operations. Once you have captured the error, the handling depends on what type of error it is:

if it's a brief transient failure that could be retried almost immediately, you could simply loop. The downside of this is that the Actor is still busy and won't handle other requests while it is looping, so it can negatively affect overall responsiveness.
if it's expected to be a brief transient failure but the failure is limited to that request (as opposed to the connection to the database, for example) then you can simply re-send the message to yourself (self.send(self.myAddress, msg)). This will allow other messages to be processed (especially if they have already been received and are waiting to be processed) and maintains more responsiveness. I'd attach a counter to the message to ensure you stopped after too many retries.
If the error is expected to take longer (e.g. a network outage to the DB) then I would queue the message internally and use a self.wakeupAfter() call to retry again in a reasonable period of time. In this mode, all new incoming messages should probably be added to the tail of the queue and the first message retried. This is what you indicated you have done in the past, and I think it's a reasonable approach.

Another technique is for the requesting Actor to look at the returned message within the PoisonMessage and extract and re-send that returned message when appropriate.

Although you stated that it's "critical that it succeeds", there will be some cases where failure will still need to be dealt with, so I would build that intelligence into the system, either in the DB-saving Actor or the requesting Actor or (probably) both. I would also probably then have confirmation from the DB-saving Actor back to the requesting Actor to confirm the success of the save, so that the requesting Actor can implement its own timeout and/or recovery.

The preceding paragraph leads to another pattern: the Monitor pattern. If the Actor A is performing work on behalf of Actor B, the work can be routed through the Monitor. The Monitor's job is not to do the work, this is deferred to Actor A, but the Monitor can observe A's success or failure relative to each request and the Monitor is responsible for ensuring that the work gets (re)tried or failed as appropriate. This is essentially the single-responsibility principle.

Please feel free to suggest other patterns: I am collecting these to help put together a general guide to these types of approaches, so I'm always looking for other good ideas to add.

Regards, Kevin

kquick / Thespian

Retrying Messages Until Success #18