oxidecomputer / steno

distributed sagas
Apache License 2.0
117 stars 10 forks source link

sagas may need more ways to fail, especially when interacting with external services #66

Open smklein opened 2 years ago

smklein commented 2 years ago

(FYI to @leftwo and @jmpesp, whom I chatted with about this scenario)

Suppose we have the following saga DAG:

  1. Action: Record "new resource" in a database, with state "creating". Undo: Delete "new resource" from database.
  2. Action: Make a request to an external service to provision the resource. Undo: Make a request to an external service to delete said resource.
  3. Action: Record "new resource" in the database as "created". (No undo action)

In this example saga graph, we can happily move "forward" and "backwards" through saga states, but can enter an awkward state if we fail saga execution while communicating with the external service.

Suppose we do the following:

In this scenario, if we simply perform the "Node 1 undo action", there is a chance we're leaking resources. Concretely, let's suppose the "external service request" would be to provision storage from Crucible, or to provision an instance on a sled. If we simply delete our database record, the resource is consumed, but Nexus is unaware. In this particular case, it may be more correct to record that the provisioned resources exist, but are in a "failed" state.

Proposal: I think we need a way of identifying a "different" error pathway for actions, allowing us to distinguish between "clean errors" and "fatal errors" - akin to how we are recording the results of "successful actions", it seems equally useful to record the results of "unsuccessful actions", so we can know how to treat the database records associated with resources (either deletion or marking the state as dirty).

davepacheco commented 2 years ago

If I'm understanding right, the problem you're describing comes from the fact that when the external service is not responding, we don't know what the state on that service is. I believe the crash and the re-execution of action (2) are a red herring -- you have the same problem even if it's the very first time that you're executing action (2) and the remote service doesn't respond. It's always possible there was a partition, then the request did complete (on the external service), but the action doesn't know that. It might time out or the request might fail due to a network error.

In this case, why not have the action (2) continue waiting and/or retrying until it has a definite answer from the remote service?

smklein commented 2 years ago

If I'm understanding right, the problem you're describing comes from the fact that when the external service is not responding, we don't know what the state on that service is.

Yeah, agreed.

I believe the crash and the re-execution of action (2) are a red herring -- you have the same problem even if it's the very first time that you're executing action (2) and the remote service doesn't respond. It's always possible there was a partition, then the request did complete (on the external service), but the action doesn't know that. It might time out or the request might fail due to a network error.

The idempotency of saga actions also relies on the idempotency of external services - for example, issuing a POST in a saga operation would be a bad idea, but a PUT would be reasonable.

In that sense, when we issue a query, at any point, regardless of whether or not it's the first time, there are are a handful of possible outcomes:

My expectation is that if we ever get a response back that puts us into a "known" state, we can move forward (next action) or backward (undo action) appropriately. So, agreed that the "unknown" case is really what we care about.

In this case, why not have the action (2) continue waiting and/or retrying until it has a definite answer from the remote service?

I think there are a few reasons why this might be a risky idea - it's possible the remote service won't ever come back. What if the service is a crucible downstairs acting on behalf of a disk, and that disk is removed? What if the service is a sled agent, and the whole sled is removed?

If we do decide that "in those cases, it would be fine to fully unwind", do you think this is something we should be doing uniformally within saga nodes talking to external services? Just add a retry loop, possibly with a backoff?

andrewjstone commented 2 years ago

@davepacheco basically said exactly what I was going to say. You can always get in this situation if the remote service doesn't respond. In many ways, I think this is actually related to #26. The problem is what to do when you decide to unwind, and the unwind fails. You only likely want to retry so much, for reasons like what @smklein mentioned.

My gut has been telling me when thinking about these types of issues, that we should have some separate "reliable persistent workflow" system whose job is to read the saga log and attempt to garbage collect leaked resources when an undo operation fails. We record that fact, and then attempt to cleanup after ourselves periodically. This system can also report general status of what it's doing. I do not think that we should attempt to solve this problem with a steno saga itself, as its just not the same type of thing. We also have need for such a system elsewhere at Oxide, such as for rebalancing, and failure detection.

davepacheco commented 2 years ago

I believe the crash and the re-execution of action (2) are a red herring -- you have the same problem even if it's the very first time that you're executing action (2) and the remote service doesn't respond. It's always possible there was a partition, then the request did complete (on the external service), but the action doesn't know that. It might time out or the request might fail due to a network error.

The idempotency of saga actions also relies on the idempotency of external services - for example, issuing a POST in a saga operation would be a bad idea, but a PUT would be reasonable.

That's right. For that reason I think we want to implement our internal APIs to be idempotent.

In that sense, when we issue a query, at any point, regardless of whether or not it's the first time, there are are a handful of possible outcomes:

* The request succeeds

* The request fails, but in a way that we _do_ know the state of the resource (e.g., informative error code back to client)

* The request fails, but in a way that we _do not_ know the state of the resource

Agreed. I think we want to structure internal APIs so that we only hit this third case when there's a network problem or the remote system is behaving pathologically (i.e., it's hanging or producing a response that we don't understand). There are two cases within this one: (1) the current failure is probably retryable (e.g. a 503 response or ECONNRESET), or (2) we have no reason to believe the failure will go away if we try again (e.g., if we get a 500 or some other response that we don't understand). I do think it'd be useful if saga actions could specify a backoff/retry policy and at the same time report errors to Steno in a way that distinguishes (1) retryable error, (2) permanent but understood error that should trigger saga unwinding, (3) unknown state that should leave the saga for an operator or support. #26 talks about this. (In the absence of explicit support for this, I think you can achieve this today with the pieces already in Nexus.) However, I don't think it solves this problem: see below.

My expectation is that if we ever get a response back that puts us into a "known" state, we can move forward (next action) or backward (undo action) appropriately. So, agreed that the "unknown" case is really what we care about.

In this case, why not have the action (2) continue waiting and/or retrying until it has a definite answer from the remote service?

I think there are a few reasons why this might be a risky idea - it's possible the remote service won't ever come back. What if the service is a crucible downstairs acting on behalf of a disk, and that disk is removed? What if the service is a sled agent, and the whole sled is removed?

If we do decide that "in those cases, it would be fine to fully unwind", do you think this is something we should be doing uniformally within saga nodes talking to external services? Just add a retry loop, possibly with a backoff?

Yeah, those things could happen. The control plane should know when those things have happened. (I think that's generally true of the kinds of permanent failures you're talking about, not just these two examples.) This sounds a lot like the idea of "retirement" in FMA. Straw man: when the control plane identifies that a Sled has been removed, then any attempts to make requests to the corresponding Sled Agent fail immediately with a code that reflects that the error is permanent. This would be straightforward to implement by checking the last known status of the Sled whenever we fetch a Sled Agent client. (I came to this because I think it's similar to how FMA retirement works in the kernel and it seems to be a nice pattern there.)

The obvious alternative is to have a sort of retry/backoff policy that eventually gives up, but I think that's not right. The duration of the problem does not determine whether it's retryable or not. The Sled might be offline for 12 hours because of a transient failure or it might have been permanently removed 10 seconds ago.

Getting this wrong in both directions is potentially expensive (in terms of dollars). I've worked on systems that were too eager to give up in the face of transient failures. The result was a lot of manual intervention required to recover from failures that were intended to be survivable, but just happened to last longer than we initially thought. For us, I could imagine this manifesting as an unexpected Sled reboot leaving a raft of sagas in some "need-to-be-kicked-by-an-operator" state. Of course, it's also going to generate a support call if a thing retries forever when it the operation will clearly never complete. These cases seem a lot rarer to me, but worth dealing with if we can.

Different sagas might want different policies, too. A typical "instance provision" saga probably shouldn't retry a single Sled for very long because a user is waiting. An "upgrade SSD firmware" saga probably should retry transient errors as long as we still want to upgrade the firmware.

davepacheco commented 2 years ago

@davepacheco basically said exactly what I was going to say. You can always get in this situation if the remote service doesn't respond. In many ways, I think this is actually related to #26. The problem is what to do when you decide to unwind, and the unwind fails. You only likely want to retry so much, for reasons like what @smklein mentioned.

I think you're right that this all applies to the unwind process as well. I think too in that case whether you want to keep trying isn't a function of how long it's been failing but the nature of the failure and what else you know (i.e., that the Sled is gone).

davepacheco commented 2 years ago

We just discussed this a bunch offline. This is an important point that's new to me that I think we need to elevate: it's basically not safe to have a Steno action that returns an error when it doesn't know the definitive state of whatever it's trying to do. To make it concrete: suppose you have an action that inserts a database record, and suppose the action returns an error when it cannot reach the database due to a transient failure. Now suppose you have this sequence:

  1. Execute the action. The database INSERT succeeds. The SEC crashes just before recording the "action done" log message.
  2. The SEC restarts and re-executes the action. Now you get a transient error while connecting to the database (e.g., ECONNREFUSED).
  3. The action returns an error back to Steno (wrapping the transient error).

This would be incorrect! At step 3, Steno will unwind the saga, but not the node that just failed. Thus, you'll be left with the extra record inserted in the database that will never be cleaned up.

The takeaway is that an action must not return to Steno (success or failure) unless it knows definitively that it has completed successfully or failed, and if it failed, then it has definitely made no net change on the system (neither in this invocation nor any previous one). That sounds like a tall order, but it just means retrying transient failures until the remote system gives you a clear response.

This is not obvious. It'd be easy to think you could return transient errors back to Steno and things would be fine (and they would be, most of the time).

I think we should try to communicate this more clearly by:

davepacheco commented 2 years ago

From the discussion, I think we settled on some takeaways:

jmpesp commented 2 years ago

As I said in my previous comment: the current Steno ActionFailed variant should only be used for permanent (i.e., non-retryable) errors when the action knows that it hasn't made any net change to the system. This variant triggers a graceful unwinding of the saga, undoing anything that's been done so far, but it does not undo the current node.

Based on reading this, I wonder if the variant should be changed to capture that intent: something like DesiredChangeFailed. Returning this means the action knows it didn't make the desired change to the system, and that it's not transient.