sagas may need more ways to fail, especially when interacting with external services

(FYI to @leftwo and @jmpesp, whom I chatted with about this scenario)

Suppose we have the following saga DAG:

Action: Record "new resource" in a database, with state "creating". Undo: Delete "new resource" from database.
Action: Make a request to an external service to provision the resource. Undo: Make a request to an external service to delete said resource.
Action: Record "new resource" in the database as "created". (No undo action)

In this example saga graph, we can happily move "forward" and "backwards" through saga states, but can enter an awkward state if we fail saga execution while communicating with the external service.

Suppose we do the following:

Action for (1) (record the new resource in the DB)
Action for (2) (we send the request to the external service, but do not yet write the result to the saga log)
Crash
Action for (2) (we send the request to the external service, but this time, suppose the external service is not responding)

In this scenario, if we simply perform the "Node 1 undo action", there is a chance we're leaking resources. Concretely, let's suppose the "external service request" would be to provision storage from Crucible, or to provision an instance on a sled. If we simply delete our database record, the resource is consumed, but Nexus is unaware. In this particular case, it may be more correct to record that the provisioned resources exist, but are in a "failed" state.

Proposal: I think we need a way of identifying a "different" error pathway for actions, allowing us to distinguish between "clean errors" and "fatal errors" - akin to how we are recording the results of "successful actions", it seems equally useful to record the results of "unsuccessful actions", so we can know how to treat the database records associated with resources (either deletion or marking the state as dirty).

If I'm understanding right, the problem you're describing comes from the fact that when the external service is not responding, we don't know what the state on that service is. I believe the crash and the re-execution of action (2) are a red herring -- you have the same problem even if it's the very first time that you're executing action (2) and the remote service doesn't respond. It's always possible there was a partition, then the request did complete (on the external service), but the action doesn't know that. It might time out or the request might fail due to a network error.

In this case, why not have the action (2) continue waiting and/or retrying until it has a definite answer from the remote service?

If I'm understanding right, the problem you're describing comes from the fact that when the external service is not responding, we don't know what the state on that service is.

Yeah, agreed.

I believe the crash and the re-execution of action (2) are a red herring -- you have the same problem even if it's the very first time that you're executing action (2) and the remote service doesn't respond. It's always possible there was a partition, then the request did complete (on the external service), but the action doesn't know that. It might time out or the request might fail due to a network error.

The idempotency of saga actions also relies on the idempotency of external services - for example, issuing a POST in a saga operation would be a bad idea, but a PUT would be reasonable.

In that sense, when we issue a query, at any point, regardless of whether or not it's the first time, there are are a handful of possible outcomes:

The request succeeds
The request fails, but in a way that we do know the state of the resource (e.g., informative error code back to client)
The request fails, but in a way that we do not know the state of the resource

My expectation is that if we ever get a response back that puts us into a "known" state, we can move forward (next action) or backward (undo action) appropriately. So, agreed that the "unknown" case is really what we care about.

In this case, why not have the action (2) continue waiting and/or retrying until it has a definite answer from the remote service?

I think there are a few reasons why this might be a risky idea - it's possible the remote service won't ever come back. What if the service is a crucible downstairs acting on behalf of a disk, and that disk is removed? What if the service is a sled agent, and the whole sled is removed?

If we do decide that "in those cases, it would be fine to fully unwind", do you think this is something we should be doing uniformally within saga nodes talking to external services? Just add a retry loop, possibly with a backoff?

@davepacheco basically said exactly what I was going to say. You can always get in this situation if the remote service doesn't respond. In many ways, I think this is actually related to #26. The problem is what to do when you decide to unwind, and the unwind fails. You only likely want to retry so much, for reasons like what @smklein mentioned.

My gut has been telling me when thinking about these types of issues, that we should have some separate "reliable persistent workflow" system whose job is to read the saga log and attempt to garbage collect leaked resources when an undo operation fails. We record that fact, and then attempt to cleanup after ourselves periodically. This system can also report general status of what it's doing. I do not think that we should attempt to solve this problem with a steno saga itself, as its just not the same type of thing. We also have need for such a system elsewhere at Oxide, such as for rebalancing, and failure detection.

I believe the crash and the re-execution of action (2) are a red herring -- you have the same problem even if it's the very first time that you're executing action (2) and the remote service doesn't respond. It's always possible there was a partition, then the request did complete (on the external service), but the action doesn't know that. It might time out or the request might fail due to a network error.

The idempotency of saga actions also relies on the idempotency of external services - for example, issuing a POST in a saga operation would be a bad idea, but a PUT would be reasonable.

That's right. For that reason I think we want to implement our internal APIs to be idempotent.

In that sense, when we issue a query, at any point, regardless of whether or not it's the first time, there are are a handful of possible outcomes:
* The request succeeds

* The request fails, but in a way that we _do_ know the state of the resource (e.g., informative error code back to client)

* The request fails, but in a way that we _do not_ know the state of the resource

Agreed. I think we want to structure internal APIs so that we only hit this third case when there's a network problem or the remote system is behaving pathologically (i.e., it's hanging or producing a response that we don't understand). There are two cases within this one: (1) the current failure is probably retryable (e.g. a 503 response or ECONNRESET), or (2) we have no reason to believe the failure will go away if we try again (e.g., if we get a 500 or some other response that we don't understand). I do think it'd be useful if saga actions could specify a backoff/retry policy and at the same time report errors to Steno in a way that distinguishes (1) retryable error, (2) permanent but understood error that should trigger saga unwinding, (3) unknown state that should leave the saga for an operator or support. #26 talks about this. (In the absence of explicit support for this, I think you can achieve this today with the pieces already in Nexus.) However, I don't think it solves this problem: see below.

My expectation is that if we ever get a response back that puts us into a "known" state, we can move forward (next action) or backward (undo action) appropriately. So, agreed that the "unknown" case is really what we care about.

In this case, why not have the action (2) continue waiting and/or retrying until it has a definite answer from the remote service?

I think there are a few reasons why this might be a risky idea - it's possible the remote service won't ever come back. What if the service is a crucible downstairs acting on behalf of a disk, and that disk is removed? What if the service is a sled agent, and the whole sled is removed?

If we do decide that "in those cases, it would be fine to fully unwind", do you think this is something we should be doing uniformally within saga nodes talking to external services? Just add a retry loop, possibly with a backoff?

Yeah, those things could happen. The control plane should know when those things have happened. (I think that's generally true of the kinds of permanent failures you're talking about, not just these two examples.) This sounds a lot like the idea of "retirement" in FMA. Straw man: when the control plane identifies that a Sled has been removed, then any attempts to make requests to the corresponding Sled Agent fail immediately with a code that reflects that the error is permanent. This would be straightforward to implement by checking the last known status of the Sled whenever we fetch a Sled Agent client. (I came to this because I think it's similar to how FMA retirement works in the kernel and it seems to be a nice pattern there.)

The obvious alternative is to have a sort of retry/backoff policy that eventually gives up, but I think that's not right. The duration of the problem does not determine whether it's retryable or not. The Sled might be offline for 12 hours because of a transient failure or it might have been permanently removed 10 seconds ago.

Getting this wrong in both directions is potentially expensive (in terms of dollars). I've worked on systems that were too eager to give up in the face of transient failures. The result was a lot of manual intervention required to recover from failures that were intended to be survivable, but just happened to last longer than we initially thought. For us, I could imagine this manifesting as an unexpected Sled reboot leaving a raft of sagas in some "need-to-be-kicked-by-an-operator" state. Of course, it's also going to generate a support call if a thing retries forever when it the operation will clearly never complete. These cases seem a lot rarer to me, but worth dealing with if we can.

Different sagas might want different policies, too. A typical "instance provision" saga probably shouldn't retry a single Sled for very long because a user is waiting. An "upgrade SSD firmware" saga probably should retry transient errors as long as we still want to upgrade the firmware.

@davepacheco basically said exactly what I was going to say. You can always get in this situation if the remote service doesn't respond. In many ways, I think this is actually related to #26. The problem is what to do when you decide to unwind, and the unwind fails. You only likely want to retry so much, for reasons like what @smklein mentioned.

I think you're right that this all applies to the unwind process as well. I think too in that case whether you want to keep trying isn't a function of how long it's been failing but the nature of the failure and what else you know (i.e., that the Sled is gone).

We just discussed this a bunch offline. This is an important point that's new to me that I think we need to elevate: it's basically not safe to have a Steno action that returns an error when it doesn't know the definitive state of whatever it's trying to do. To make it concrete: suppose you have an action that inserts a database record, and suppose the action returns an error when it cannot reach the database due to a transient failure. Now suppose you have this sequence:

Execute the action. The database INSERT succeeds. The SEC crashes just before recording the "action done" log message.
The SEC restarts and re-executes the action. Now you get a transient error while connecting to the database (e.g., ECONNREFUSED).
The action returns an error back to Steno (wrapping the transient error).

This would be incorrect! At step 3, Steno will unwind the saga, but not the node that just failed. Thus, you'll be left with the extra record inserted in the database that will never be cleaned up.

The takeaway is that an action must not return to Steno (success or failure) unless it knows definitively that it has completed successfully or failed, and if it failed, then it has definitely made no net change on the system (neither in this invocation nor any previous one). That sounds like a tall order, but it just means retrying transient failures until the remote system gives you a clear response.

This is not obvious. It'd be easy to think you could return transient errors back to Steno and things would be fine (and they would be, most of the time).

I think we should try to communicate this more clearly by:

documenting this along with the other constraints on action implementations (maybe in Action)
renaming ActionError::ActionFailed and ActionError::action_failed in a way that makes it more explicit that you're saying "this action has failed and made no net change to the system". (Transient vs. permanent isn't exactly the condition we care about here; it's really that you're saying "I failed and I know no changes have been made. [It's time to unwind the saga, but not this node.]"). We haven't landed on a name that I love yet -- ideas included Unwind (meaning: I'm telling you, the SEC to start unwinding), UnwindPreviousNodes (same, but clearer that we're not unwinding this one), PermanentError (which I think could be confused for "it was transient, I just gave up"), PermanentErrorKnownState (which is verbose), and Terminal (which I think is still ambiguous).
documenting this clearly with the error variants

From the discussion, I think we settled on some takeaways:

As I said in my previous comment: the current Steno ActionFailed variant should only be used for permanent (i.e., non-retryable) errors when the action knows that it hasn't made any net change to the system. This variant triggers a graceful unwinding of the saga, undoing anything that's been done so far, but it does not undo the current node.
In Nexus sagas, transient errors should always be retried indefinitely.
In Nexus, there are cases where retries will never succeed. In the cases we considered, the control plane has enough information to know this. (e.g., a Sled has been explicitly removed by an operator, so requests to that Sled Agent will never succeed). There are clean ways to provide this information to the saga (e.g., attempting to acquire a Sled Agent client could return an Error that gets converted to ActionFailed (or whatever it gets renamed to)).
Separate from all of this, we will also want an InterventionRequired variant that tells Steno that the action has determined that it's in an undefined state, and so it's not safe to proceed nor unwind. This is related to what #26 talks about.

As I said in my previous comment: the current Steno ActionFailed variant should only be used for permanent (i.e., non-retryable) errors when the action knows that it hasn't made any net change to the system. This variant triggers a graceful unwinding of the saga, undoing anything that's been done so far, but it does not undo the current node.

Based on reading this, I wonder if the variant should be changed to capture that intent: something like DesiredChangeFailed. Returning this means the action knows it didn't make the desired change to the system, and that it's not transient.

oxidecomputer / steno

sagas may need more ways to fail, especially when interacting with external services #66