microsoft / durabletask-go

The Durable Task Framework is a lightweight, embeddable engine for writing durable, fault-tolerant business logic (orchestrations) as ordinary code.
Apache License 2.0
178 stars 25 forks source link

Orchestration ID reuse policies #42

Open cgillum opened 8 months ago

cgillum commented 8 months ago

This idea comes from https://github.com/dapr/dapr/issues/7101, as it relates to reusing existing workflow IDs.

When scheduling the creation of a new orchestration, the following options should be available:

As an optional extension of this, we can consider exposing an "Execution ID" or "Generation ID" property to the orchestration which can help distinguish different instantiations of the given instance ID.

clintsinger commented 8 months ago

My vote for having the extra ID to be able to identify different runs of the workflow in logs.

olitomlinson commented 7 months ago

@cgillum

Would it be reasonable to add a hint of : SKIP_IF_RUNNING_OR_COMPLETE_OR_PAUSED

An example, scenarios where workflows are being started via queue with at-least-once guarantees (which is quite common) it would be useful to have this hint because :

clintsinger commented 7 months ago

The running state is exactly the one I want to interrupt for the scenario that I originally proposed. I agree that once it is complete then rescheduling is just a matter of starting a new instance.

The original definition of TerminateIfExists didn't care if it was just scheduled or already running. The workflow would be terminated. SkipIfExists wouldn't care either way either.

olitomlinson commented 7 months ago

Yes this is why I'm proposing an additional policy, so that it doesn't interfere with the needs of your use-case (which I think is TERMINATE_IF_EXISTS)

clintsinger commented 7 months ago

So you are suggesting that one combines, for example, TERMINATE_IF_EXISTS and SKIP_IF_RUNNING_OR_COMPLETE?

As a way to differentiate what to do if it is just scheduled vs already running?

clintsinger commented 7 months ago

I am assuming this doesn't have much meaning for SKIP_IF_EXISTS and SKIP_IF_RUNNING_OR_COMPLETE since it would already be skipped from the first policy.

olitomlinson commented 7 months ago

This is how I interpret the current proposal :

Start = start/restart the orchestration No-op = leave the orchestration in its current state (do not start/restart)

Doesn't Exist Running Completed Failed Paused
TERMINATE_IF_EXISTS Start Start Start Start Start
THROW_IF_EXISTS Start No-op No-op No-op No-op
SKIP_IF_EXISTS Start No-op No-op No-op No-op

Assuming my interpretation above is correct, I would like to see the following policy added SKIP_IF_RUNNING_OR_COMPLETE_OR_PAUSED

Doesn't Exist Running Completed Failed Paused
SKIP_IF_RUNNING_OR_COMPLETE_OR_PAUSED Start No-op No-op Start No-op
kaibocai commented 7 months ago

Thanks @olitomlinson for the explanation on SKIP_IF_RUNNING_OR_COMPLETE_OR_PAUSED I think it makes sense to support idempotent design. @cgillum what do you think about this OPTION?

cgillum commented 7 months ago

Catching up on this discussion. If I understand the proposal from @olitomlinson, the problem you're trying to solve is that you want to only reuse the orchestration ID if an existing instance is in a failed state. I think that makes sense. Would it just be "failed", or could it also include "Terminated" (and perhaps some future "Canceled" state)?

clintsinger commented 7 months ago

Wouldn't it be easier just to say RESTART_IF_FAILED (or RESCHEDULE_IF_FAILED)?

It appears to only be a rule for when the workflow is in a failed state.

olitomlinson commented 7 months ago

@cgillum I agree, it should be SKIP_IF_RUNNING_OR_COMPLETE_OR_PAUSED_OR_TERMINATED - I would bucket paused and terminate together as they are both deliberate human actions, which shouldn't be automatically restarted.

Although I think going down this path is going to lead to a trying to support a huge set of highly specific, Policies which isn't sustainable.

Suggestion

Divorce the ACTION (Terminate, Skip, Throw) from the applicable STATE/S, such that a user can dial in the exact requirement.

TERMINATE_IF_EXISTS would be expressed as...

WorkflowOptions { 
    ReusePolicy = new ReusePolicy(action: Action.Terminate, states: new[]{ State.All })
}

SKIP_IF_RUNNING_OR_COMPLETE_OR_PAUSED_OR_TERMINATED would be expressed as...

WorkflowOptions { 
    ReusePolicy  = new ReusePolicy(action: Action.Skip, states: new[]{ State.Running, State.Complete, State.Paused, State.Terminated})
}
clintsinger commented 7 months ago

I'd recommend using enum flags for the states field.

cgillum commented 7 months ago

I'm in agreement with the more fine-grained policy. It introduces more conceptual complexity, IMO, but it's more flexible, precise, and better aligned with existing APIs in the .NET version of the Durable Task Framework.

olitomlinson commented 7 months ago

I'd recommend using enum flags for the states field.

I've never used Enum Flags so no first hand experience, but they look great for this use-case!

kaibocai commented 7 months ago

When rethinking this proposal, I found there are some cases that are not quite clear. For ex:

ReusePolicy  = new ReusePolicy(action: Action.Skip, states: new[]{ State.Pending})

For this option, we are creating a new instance with action SKIP and target status Pending, which means it will try to create a new instance, but if it finds there is an instance in Pending status then it will skip creating the new instance. What if we have an instance in Running status, should we skip creating the new instance or should we terminate the running instance and create the new one?

Summarize the confusion in below pic

image

@clintsinger @cgillum @olitomlinson

olitomlinson commented 7 months ago

First thoughts, the first two seem straight forward :

Instance exists -> skip -> status not in target status set = terminate existing instance and create new one Instance exists -> throw -> status not in target status set = terminate existing instance and create new one

However I'm stuck on this one...

Instance exists -> terminate -> status not in target status set = ?????

clintsinger commented 7 months ago

If we think of the second condition as a "when/where" clause then perhaps we can simplify things a bit and just eliminate the "skip" condition since that could be the default for the other two if the predicate isn't matched.

Or some variation on that.

clintsinger commented 7 months ago

The "variation on that" could be that we keep the "throw" as the default path as it is today.

cgillum commented 7 months ago

Does this pseudocode seem like the right behavior?

ok = not exists

if exists:
  if skip and status_matches:
    # log a warning
    ok = True
  if terminate and status_matches:
    # terminate the instance
    ok = True

if not ok:
   # throw

if not skip:
  # create the instance
kaibocai commented 7 months ago

I think it should be

if not exists:
  # create instance
else:
  ok = false
  if skip and status_matches:
    # log a warning
    ok = True
  if terminate and status_matches:
    # terminate the instance
    ok = True

  if not ok:
    # throw
  if not skip:
    # create instance

The reason is that when customer create a instance using ctx.createOrhcestration(instanceID, action.SKIP, [status.Pending, status.Runnning]) I interpret as they still want to create an instance if there is no instance exists.
So that is to say, if no instance exists, the behavior of createOrhcestration should always be creating a new instance no matter what options used here.

Another thing to point out for the above pseudocode (Chris's as well) is that if the customer chooses the action THROW then no matter what statuses they set, it will always Throw if an instance already exists. I just want to make sure this is what we want.

@cgillum

cgillum commented 7 months ago

Yeah, it looks like my skip logic wasn't quite right, but I suppose that's what unit tests are for. :)

And I see your point about throw. Since it's the default behavior anyways, using it with status filters is meaningless. Does that suggest we could get by without an explicit throw action, since that's the default behavior anyways? It seems we only need to opt-into skip and terminate.

joebowbeer commented 2 weeks ago

Is this related to supporting a start-with-signal pattern, or is that a separate concern?

start-with-signal will send an event to a specific workflow instance and make sure that instance is running. In that order, to avoid a race.

If the reuse policy can be attached to a specific start request, then I imagine this pattern could be implemented in two separate requests (signal + start) as I described.