oxidecomputer / steno

distributed sagas
Apache License 2.0
107 stars 10 forks source link

undo action error type could be more explicit [incomplete] #283

Open davepacheco opened 2 months ago

davepacheco commented 2 months ago

Following up on #138: this PR makes a few changes:

The goal here is to make it more obvious to authors of undo action functions that it's rather a big deal to return an error from an undo action and they should only do that for permanent errors. This isn't a change in behavior.

This is the sort of change where we'll definitely want to convert important consumers before committing to this (i.e., landing this PR). I've started converting Omicron, and I'm honestly not sure if this is worthwhile. There are over 50 undo actions, and each may have a few code paths that need adjustments, and I'm not sure enough about this change to go do that.

Some design notes about this:

We'd want to check that this change doesn't break compatibility with previously-serialized saga logs (or else that we don't care about that).


There are plenty of bigger questions here, too, like: should Steno first-class the idea of retryable failures? I want to defer this discussion from this PR, unless we feel it invalidates this step. There are a lot of tricky questions around doing this. (What policy do we use? How does Steno tell whether an error is retryable or not? How does it report on transient failures?) I think we can best approach this by prototyping with a helper function (i.e., in Omicron) that we use to wrap undo actions. If we get to the point where we've got a stable abstraction that's a clear win and it makes sense to put it into Steno, then we can do it.

davepacheco commented 2 months ago

I started updating Omicron in oxidecomputer/omicron#5908.