oxidecomputer / steno

distributed sagas
Apache License 2.0
116 stars 10 forks source link

`SecStore` has no read methods #156

Open XAMPPRocky opened 1 year ago

XAMPPRocky commented 1 year ago

Hello 👋 While implementing a SecStore for our own persistence layer I noticed that there's no way to for a SecClient to read from a SecStore, it can only write to it. This means that it's up to the clients to implement crash recovery for their app. This seems like ideal to be implemented by the SecClient instead, which would require additional methods to SecStore to be able to read from the SecStore.

Is this something that you'd be interested in adding to steno?

davepacheco commented 1 year ago

I just paged this back in and am summarizing for myself as much as anything: I think the reason things look the way they do now is that the "write" side of this necessarily has to be driven by Steno while the "read" side can be driven by the consumer, and I tend to prefer library-like interfaces over the inversion of control associated with a framework. That's not a hard rule, though -- just how we got here.

To elaborate: Steno looks like a framework during normal operation. The consumer creates a saga and then hands control over to Steno. Steno calls back into the consumer in narrow ways to execute actions and undo actions and to persist state. We do this using traits -- SecStore in this case providing the methods to persist state.

For us, recovery happens when the SEC process starts up. This involves reading whatever was written (namely, a list of running sagas and their complete logs) and then calling SecClient::saga_resume() for each one. We implement that here: https://github.com/oxidecomputer/omicron/blob/main/nexus/db-queries/src/db/saga_recovery.rs

This part is less framework-y and more library-y in that the consumer is driving the work and handing things off to Steno.

As you suggest, I could imagine making this more framework-y by adding read methods to SecStore for listing sagas and logs and then providing a recover() function in Steno that uses those methods and calls saga_resume() for each saga found. Is that useful? The only thing I'd be worried about is that there may be critical policy choices hidden here that we'd want to let the consumer decide, like whether to list everything in one go (vs. paginated queries), whether to be able to resume sagas as they're found vs. wait for all of them, and whether to immediately call saga_start() for all the recovered sagas. This would probably come out in the implementation or else by comparing the result to the recovery code above.

XAMPPRocky commented 1 year ago

The only thing I'd be worried about is that there may be critical policy choices hidden here that we'd want to let the consumer decide

Yeah I agree, though I think there's no real downside to having these policy choices driven through configuration and traits. For example instead of adding "read" methods to SecStore, steno could have a SecDriver trait with those read methods (similar to how libs like serde have Serialize and Deserialize), that way you can decouple those policy choices from the persistence layer and have multiple kinds of drivers for a given SecStore depending on how mission critical the service is.

I also don't see a reason why steno couldn't have both right? Like if there was SecDriver trait, you could have a async fn drive<S: SecDriver>() function on SecClient, and if you're someone who just wants to let steno handle it you call that, and if you're someone who wants absolute control, you can still call it the same way you do today, you just don't call drive. That would also be a way to implement this while allowing dependents to upgrade to the newer version without needing to rewrite their recovery logic.