Expose reset events to the client... somehow

uber / cadence

Cadence is a distributed, scalable, durable, and highly available orchestration engine to execute asynchronous long-running business logic in a scalable and resilient way.

https://cadenceworkflow.io

MIT License

8.21k stars 791 forks source link

Expose reset events to the client... somehow #2947

Open Groxx opened 4 years ago

Groxx commented 4 years ago

While investigating some issues on our end, I realized it'd be nice to be able to run code whenever a workflow is reset. In particular it could be used to update external data with the new run-id. This way you could have a reliable way to maintain "this" workflow's info through resets, since they occupy a different kind of behavioral-space compared to e.g. continue-as-new (which is really a "complete and start a new one", and it will always run the first lines of the workflow).

Currently the only way I can really think to do this is to add if GetInfo(ctx).Execution.RunId != previous checks everywhere, which would be kinda ridiculous, and wouldn't be friendly to building workflow-related helper libraries (since the lib would have to add this kind of logic too).

I'm not particularly sure how to expose this in a future-friendly way, but perhaps it could be treated like a signal? Then you could do things like this in a workflow:

func work(ctx context.Context) error {
  workflow.Go(func(ctx context.Context) {
    for {
      workflow.ExecuteActivity(ctx, "update-db-to-current-runid").Get(ctx)
      workflow.NewSelector(ctx).
        AddChannel(workflow.ResetSignalChannel, func (...) { ... }). // no-op, just unblocks the loop
        Select(ctx)
    }
  })
  // the rest of the impl
}

A possible alternative could be to expose a way to run code on every decision task, and add a workflow.Yield() func or something. Then this would be a "loop. check runid. if same, yield, else update" thread/goroutine. I don't think there's a way to achieve this currently, and it could have other uses as well... but probably more dangers too.

mfateev commented 4 years ago

This is a tough one. Here are some thoughts:

Cadence can create a special decision task that includes the full history of the workflow before the reset and the reset point information.
The decider might execute the code up to the end of the workflow collecting appropriate "rollback handlers". Then execute all the rollback handlers that were collected as part of the reset decision handling.
Unfortunately the above approach is not 100% robust and safe as sometimes rollback is caused by a completely broken deployment.
Another approach would be creating some sort of special handler that would receive history that is going to be reset and directly looks at the history for actions to rollback. For example if history contains two activities athat are going to be reset then the rollback code might somehow run compensation activities. This approach is more low level, but it is more robust to bad code deployments.

wxing1292 commented 4 years ago

I guess we can add a dedicated workflow reset event to handle this case.

Before:

<before reset>
...some events...
decision task started
decision task completed / failed / timeout <---- reset point
...some events...

<after reset>
...some events...
decision task started
decision task failed (WITH addition reset info)
...some events reapplied...

After:

<before reset>
...some events...
decision task started
decision task completed / failed / timeout <---- reset point
...some events...

<after reset>
...some events...
decision task started
decision task failed (WITHOUT the addition reset info)
workflow reset (new event which contains reset related info)
...some events reapplied...

on the client side, event state machine can special handling the workflow reset event one advantage is that, workflow reset will be independent of the presence of decision, i.e. if a workflow has no decision & terminated, we can still reset it.

mfateev commented 4 years ago

@wxing1292 in your proposal the history after reset is lost. So the workflow cannot make any reasonable decision about rollback. My idea is that the workflow sees the part of the history which will be discarded and makes the decision on how to deal with it. I think it can even take over reapplication logic. Basically it gets the history to be discarded and return the list of decisions/events to reapply.

wxing1292 commented 4 years ago

@mfateev actually, I was just mentioning that since we are doing something related to workflow reset, it will be a good idea to make it work for all possible case, i.e. missing decision task started event. (according to the issue, there is not request to handle the thrown away events).

since we are on the events reapplication topic, I guess it will be possible for server to return the branch token & start event ID for client lib & client logic to select what activity to be reapplied.

longquanzheng commented 4 years ago

Let me first understand what you are proposing. @Groxx is proposing a handler in workflow code to be triggered when a workflow is reset.

@mfateev is proposing even further, it will allow handlers(rollback handlers) to be triggered per activities when each activity is reset. This is for more generic case, what we wants to do is based on not only reset, but also what has be lost after last reset.

@wxing1292 is talking about some internal improvements for reset implementation(retiring reset attr in decisionTaskFailed by reset dedicated event).

In my opinion,

@mfateev I think your idea will be better if it combined with @Groxx 's, and it won't have the issues about completely broken bad deployment --> Instead of putting rollback handlers beside the associated activities, we could simply send the information of the activities as a list that has been rollbacked as part of "Reset signal" like @Groxx mentioned. The information will contains both activity func and parameters, let the customer code in that reset handler to do compensation with the list of reset activities. For example, customer can decide not the execute the same activity(with same parameters) after that reset. This will also solve the tough problem that we were trying to solve with activity cache. It's also more generic for ChildWorkflows, SignalExternal, Timer, etc.

As for implementation, looks like it could be a purely client side feature. Reset will also schedule a non-sticky decision task which means it will achieve all history from beginning include the reset event -- so it must know it's the first decision after reset. And the reset event today already includes the reset point(base runID and the reset eventID). So client can request all the history starting from the reset point to achieve all activities/timer/childWF that has been reset/lost during this reset.

longquanzheng commented 4 years ago

@Groxx Also, I think you can simply send a signal to the workflows that has been reset to do the same thing today for your case. Since I believe you should already know which workflows has been reset. If you use reset-batch command with ES query, you can use the same query to send signals to workflows by batch.

mfateev commented 4 years ago

@longquanzheng I missed the point that after the reset the previous version of the history can be easily accessed using getHistory. So you are correct that it can be implemented as a purely client side feature.

I'm not convinced that signal channel @Groxx proposed is the best way to expose the reset. Let met think a little bit more about this.

Groxx commented 4 years ago

Yeah, my plan for the moment is to manually signal stuff I reset. It's functional, but does require two steps instead of one (so it's hard to ensure everyone always runs it), and it's dramatically more complex for binary-based resets. I'll keep the batch commands in mind for the future though, I haven't looked at that in detail yet - memos and ES queries look fantastic, I've just been too busy to build anything useful with them yet.

re how to build this request / if it's worth doing: no idea! very happy to brainstorm/discuss approaches tho :)

longquanzheng commented 4 years ago

@longquanzheng I missed the point that after the reset the previous version of the history can be easily accessed using getHistory. So you are correct that it can be implemented as a purely client side feature.

I'm not convinced that signal channel @Groxx proposed is the best way to expose the reset. Let met think a little bit more about this.

That proposal is not ideal but we could use the same idea -- allow putting a reset handler in RegisterWorkflow() API. This would save the code of Selector/Receive/Forloop and the handler signature will be more formal and clear.