temporalio / features

Behavior and history compatibility testing for Temporal SDKs
13 stars 16 forks source link

[Feature Request] Specify reset point in workflow code #69

Open mfateev opened 2 years ago

mfateev commented 2 years ago

Is your feature request related to a problem? Please describe.

Sometimes users want to update the workflow code rolling back some progress. Trivial example:

if patch("patch1") {
   sleep(10 days);
} else {
   sleep(100 days);
}

This patch is useless as if a workflow is already blocked on 100-day sleep, the patch is always going to return false, and the sleep is not going to be interrupted.

Describe the solution you'd like

Ability to reset workflow to any point by specifying the reset point in the code. Given that in many cases reset point can be reached through different code paths it is practically impossible to find the reset point just looking at a history. It is even less possible if millions of such workflows have to be reset through a batch call.

The strawman API proposal

Code before the change:

sleep(100 days);

Code after the change:

resetScope = reset("resetId");
sleep(10 days);
resetScope.close();

Any workflow that is blocked inside the resetScope (which can be modeled with real scopes and closures as well) will be rolled automatically back to the line just before reset call. After all the workflows executed the rollback the reset code can be removed.

The open question is what causes reset to happen for blocked workflows. It might require an automatically generated batch job that pokes all such blocked workflows after the code deployment.

bergundy commented 2 years ago

If we need a batch job to probe the workflows why not send a signal in the batch job to wake the workflow up?

try {
  await CancellationScope.cancellable(async () => {
    setHandler(updateSignal, () => CancellationScope.current.cancel())
    await sleep('10 days')
  })
} catch {
  // ignore cancellation
}
mfateev commented 2 years ago

Because it assumes that a workflow was coded from the beginning to listen for this signal and react accordingly.

This issue supports unplanned changes to workflow code.

bergundy commented 2 years ago

It doesn't assume that. With the technique I mentioned you can safely patch (not to be confused with the patched API) a running workflow.

bergundy commented 2 years ago

The issue with my approach is that it requires signals to be sent to new workflows as well as already running ones. But that can be fixed with:

/** TODO: find a better name for this */
async function skipOnSignal<T>(fn: () => Promise<T>, returnValueIfSkipped: T, signalName: string): Promise<T> {
  try {
    if (patched(`skip-${Math.random()}`)) {
      return returnValueIfSkipped
    }
    return await CancellationScope.cancellable(async () => {
      setHandler(defineSignal(signalName), () => CancellationScope.current.cancel())
      return await fn()
    })
  } catch (error) {
    if (isCancellation(error)) {
      return returnValueIfSkipped
    }
    throw error
  }
}

async function myWorkflowBeforeChange() {
  await sleep('10 days')
}

async function myWorkflowAfterChange() {
  await skipOnSignal(async () => {
    await sleep('10 days')
  }, undefined, 'skip-sleep')
}
mfateev commented 2 years ago

I think you missed my point that this is not about just waking up. It is to be able to fix any part of the workflow code and roll back all open workflows to a specific point in the past.

bergundy commented 2 years ago

I'd need more details here. IIUC, what you're suggesting requires the reset point to be specified ahead which is why I claimed that it's not as useful but maybe I'm misunderstanding.

mfateev commented 2 years ago

The use case is that a workflow has some bug/issue/change that requires changing the past and redoing part of a workflow. This is possible through reset. The actual reset point is at some point in the workflow code. Calculating the reset point looking at the history and correlating it to the workflow code is almost impossible. Thus the proposal is to specify it directly in the code. Sending signal in this case doesn't add any value as the newly deployed code already contain all the needed info to perform reset.

bergundy commented 2 years ago

I think I follow now.

Seems like there should be something external trigger to get the workflow to replay so it can emit this information to the runtime. Maybe we could use the replayer for that and catch this special "reset point" command. Also not sure why it should be a scope instead of a simple method, what's the reason to "end" the scope?

mfateev commented 2 years ago

An external trigger is still needed to force workflow reevaluation. I don't have a concrete design ready. It can use replayer or normal worker to determine the reset point. Then we need to decide how the reset is actually triggered. It can be in the response to the workflow task or a direct API call.

I agree that scope is not needed.

bergundy commented 2 years ago

Seems pretty straight forward to do with just a replayer and a special exception but if we think this is valuable enough we can bake it into the SDK.