kquick / Thespian

Python Actor concurrency library
MIT License
189 stars 24 forks source link

Retrying Messages Until Success #18

Open kevincolyar opened 6 years ago

kevincolyar commented 6 years ago

I understand that the system will attempt to re-send a message once if an Actor throws an uncaught exception, and then send a Poison message back if the exception is thrown again. I have several instances where an actor could die not due to bad message, but because of systems out of my control (network, database, etc), and I need to keep attempting to process the message until it succeeds.

My specific example is where I have a simple actor that saves an object to a database. It is critical that it succeeds. If the database is unavailable an exception is thrown when it tries to save and if it is still down when the retry is attempted, the object is not saved and a poison message is generated. In the past I've handled this myself by keeping a queue and catching all possible exceptions tell my actor to wakeup in some duration and attempt the save again.

was wondering if there was a common pattern for a situation like this. Possibly one that leveraged the builtin-in failure strategy.

kquick commented 6 years ago

There are a couple of techniques I could recommend for this type of situation.

First is that if an operation should be persistently retried, you can always use a try/except block inside the Actor's receiveMessage to catch any errors. The built-in failure handling is a good default and can help the overall system be tolerant of failures and support a level of self-healing, but it's not wrong to put more in place for critical operations. Once you have captured the error, the handling depends on what type of error it is:

Another technique is for the requesting Actor to look at the returned message within the PoisonMessage and extract and re-send that returned message when appropriate.

Although you stated that it's "critical that it succeeds", there will be some cases where failure will still need to be dealt with, so I would build that intelligence into the system, either in the DB-saving Actor or the requesting Actor or (probably) both. I would also probably then have confirmation from the DB-saving Actor back to the requesting Actor to confirm the success of the save, so that the requesting Actor can implement its own timeout and/or recovery.

The preceding paragraph leads to another pattern: the Monitor pattern. If the Actor A is performing work on behalf of Actor B, the work can be routed through the Monitor. The Monitor's job is not to do the work, this is deferred to Actor A, but the Monitor can observe A's success or failure relative to each request and the Monitor is responsible for ensuring that the work gets (re)tried or failed as appropriate. This is essentially the single-responsibility principle.

Please feel free to suggest other patterns: I am collecting these to help put together a general guide to these types of approaches, so I'm always looking for other good ideas to add.

Regards, Kevin