oxidecomputer / steno

distributed sagas
Apache License 2.0
111 stars 10 forks source link

Sagas need better handling of undo actions that fail #26

Open bnaecker opened 2 years ago

bnaecker commented 2 years ago

The current implementation of sagas unwraps any failures from an undo action. This is not great for distributed systems where the saga actions cannot always control the state of the system their operating on. For example, one might run a saga recovery with a different version of software than ran the saga in the first place. In these cases, we'd probably like to design more nuanced error-handling that distinguishes types of such operational errors, indicates whether they're fatal or retryable, and maybe more.

It's also not clear how sagas handle invariants that they would like to assert. This would normally just abort/unwind the program, according to the disposition it was built with. One could imagine catching these and having some policy around retrying the operations, potentially up to some count, specified at creation time. It'll take some care to make sure we don't block multiple sagas, or worse, prevent those later sagas from ever running to completion if an earlier one fails.

davepacheco commented 2 years ago

For undo actions failing, I imagine we'll want to put them into a NeedsSupport state that would eventually raise a phone-home support request.

This makes me wonder if the Error type for UndoAction should be something more specific that reflects this. Right now, you could return a generic error thinking maybe it'll be retried or something, and then we wind up stopping the whole saga. On the other hand, maybe you should have to return UndoError::NeedsSupport(my_error) or something.

bnaecker commented 2 years ago

Yeah, that sounds right. An error variant that distinguishes "retryable", "fatal (and maybe raise an alert)", and probably others would be very useful. I imagine an "ignore" variant is tempting, but might be too easy to abuse.

davepacheco commented 2 years ago

I've separately been thinking that we might want to build a backoff-like policy into nodes (both actions and undo actions), so that they could indicate if the problem is believed to be transient or permanent and on with what parameters they want to retry. The NeedsSupport thing would be one of the variants you could pick if you were reporting a permanent error. I'm not sure if we want to layer these or what.