Open cgillum opened 8 months ago
My vote for having the extra ID to be able to identify different runs of the workflow in logs.
@cgillum
Would it be reasonable to add a hint of : SKIP_IF_RUNNING_OR_COMPLETE_OR_PAUSED
An example, scenarios where workflows are being started via queue with at-least-once guarantees (which is quite common) it would be useful to have this hint because :
running
state.
running
workflow may be wasteful and incur unwanted / unanticipated side-effects when repeated from the start.complete
state
completion
paused
state
However, it would allow a workflow in a failed
state to be rescheduled, which could be desirable.
Unless I'm mistaken, It's my understanding that none of the 3 new reuse policies would accommodate the above scenario?
The running
state is exactly the one I want to interrupt for the scenario that I originally proposed. I agree that once it is complete
then rescheduling is just a matter of starting a new instance.
The original definition of TerminateIfExists
didn't care if it was just scheduled or already running. The workflow would be terminated. SkipIfExists
wouldn't care either way either.
Yes this is why I'm proposing an additional policy, so that it doesn't interfere with the needs of your use-case (which I think is TERMINATE_IF_EXISTS
)
So you are suggesting that one combines, for example, TERMINATE_IF_EXISTS
and SKIP_IF_RUNNING_OR_COMPLETE
?
As a way to differentiate what to do if it is just scheduled vs already running?
I am assuming this doesn't have much meaning for SKIP_IF_EXISTS
and SKIP_IF_RUNNING_OR_COMPLETE
since it would already be skipped from the first policy.
This is how I interpret the current proposal :
Start
= start/restart the orchestration
No-op
= leave the orchestration in its current state (do not start/restart)
Doesn't Exist | Running | Completed | Failed | Paused | |
---|---|---|---|---|---|
TERMINATE_IF_EXISTS |
Start | Start | Start | Start | Start |
THROW_IF_EXISTS |
Start | No-op | No-op | No-op | No-op |
SKIP_IF_EXISTS |
Start | No-op | No-op | No-op | No-op |
Assuming my interpretation above is correct, I would like to see the following policy added SKIP_IF_RUNNING_OR_COMPLETE_OR_PAUSED
Doesn't Exist | Running | Completed | Failed | Paused | |
---|---|---|---|---|---|
SKIP_IF_RUNNING_OR_COMPLETE_OR_PAUSED |
Start | No-op | No-op | Start | No-op |
Thanks @olitomlinson for the explanation on SKIP_IF_RUNNING_OR_COMPLETE_OR_PAUSED
I think it makes sense to support idempotent design. @cgillum what do you think about this OPTION?
Catching up on this discussion. If I understand the proposal from @olitomlinson, the problem you're trying to solve is that you want to only reuse the orchestration ID if an existing instance is in a failed state. I think that makes sense. Would it just be "failed", or could it also include "Terminated" (and perhaps some future "Canceled" state)?
Wouldn't it be easier just to say RESTART_IF_FAILED
(or RESCHEDULE_IF_FAILED
)?
It appears to only be a rule for when the workflow is in a failed state.
@cgillum I agree, it should be SKIP_IF_RUNNING_OR_COMPLETE_OR_PAUSED_OR_TERMINATED
- I would bucket paused
and terminate
together as they are both deliberate human actions, which shouldn't be automatically restarted.
Although I think going down this path is going to lead to a trying to support a huge set of highly specific, Policies which isn't sustainable.
Suggestion
Divorce the ACTION (Terminate
, Skip
, Throw
) from the applicable STATE/S, such that a user can dial in the exact requirement.
TERMINATE_IF_EXISTS
would be expressed as...
WorkflowOptions {
ReusePolicy = new ReusePolicy(action: Action.Terminate, states: new[]{ State.All })
}
SKIP_IF_RUNNING_OR_COMPLETE_OR_PAUSED_OR_TERMINATED
would be expressed as...
WorkflowOptions {
ReusePolicy = new ReusePolicy(action: Action.Skip, states: new[]{ State.Running, State.Complete, State.Paused, State.Terminated})
}
I'd recommend using enum flags for the states field.
I'm in agreement with the more fine-grained policy. It introduces more conceptual complexity, IMO, but it's more flexible, precise, and better aligned with existing APIs in the .NET version of the Durable Task Framework.
I'd recommend using enum flags for the states field.
I've never used Enum Flags so no first hand experience, but they look great for this use-case!
When rethinking this proposal, I found there are some cases that are not quite clear. For ex:
ReusePolicy = new ReusePolicy(action: Action.Skip, states: new[]{ State.Pending})
For this option, we are creating a new instance with action SKIP
and target status Pending
, which means it will try to create a new instance, but if it finds there is an instance in Pending
status then it will skip creating the new instance. What if we have an instance in Running
status, should we skip creating the new instance or should we terminate the running instance and create the new one?
Summarize the confusion in below pic
@clintsinger @cgillum @olitomlinson
First thoughts, the first two seem straight forward :
Instance exists
-> skip
-> status not in target status set
= terminate existing instance and create new one
Instance exists
-> throw
-> status not in target status set
= terminate existing instance and create new one
However I'm stuck on this one...
Instance exists
-> terminate
-> status not in target status set
= ?????
If we think of the second condition as a "when/where" clause then perhaps we can simplify things a bit and just eliminate the "skip" condition since that could be the default for the other two if the predicate isn't matched.
Or some variation on that.
The "variation on that" could be that we keep the "throw" as the default path as it is today.
Does this pseudocode seem like the right behavior?
ok = not exists
if exists:
if skip and status_matches:
# log a warning
ok = True
if terminate and status_matches:
# terminate the instance
ok = True
if not ok:
# throw
if not skip:
# create the instance
I think it should be
if not exists:
# create instance
else:
ok = false
if skip and status_matches:
# log a warning
ok = True
if terminate and status_matches:
# terminate the instance
ok = True
if not ok:
# throw
if not skip:
# create instance
The reason is that when customer create a instance using
ctx.createOrhcestration(instanceID, action.SKIP, [status.Pending, status.Runnning])
I interpret as they still want to create an instance if there is no instance exists.
So that is to say, if no instance exists, the behavior of createOrhcestration
should always be creating a new instance no matter what options used here.
Another thing to point out for the above pseudocode (Chris's as well) is that if the customer chooses the action THROW then no matter what statuses they set, it will always Throw if an instance already exists. I just want to make sure this is what we want.
@cgillum
Yeah, it looks like my skip
logic wasn't quite right, but I suppose that's what unit tests are for. :)
And I see your point about throw
. Since it's the default behavior anyways, using it with status filters is meaningless. Does that suggest we could get by without an explicit throw
action, since that's the default behavior anyways? It seems we only need to opt-into skip
and terminate
.
Is this related to supporting a start-with-signal pattern, or is that a separate concern?
start-with-signal will send an event to a specific workflow instance and make sure that instance is running. In that order, to avoid a race.
If the reuse policy can be attached to a specific start request, then I imagine this pattern could be implemented in two separate requests (signal + start) as I described.
This idea comes from https://github.com/dapr/dapr/issues/7101, as it relates to reusing existing workflow IDs.
When scheduling the creation of a new orchestration, the following options should be available:
TerminateIfExists
- Terminates any existing orchestrations with the same instance ID and then schedules a new instance as one atomic action, similar to on-demand ContinueAsNew.SkipIfExists
- If there is an existing orchestration already scheduled, then the scheduler does nothing.ThrowIfExists
- If there is an existing orchestration then the scheduler throws an exception (this is the current behavior).As an optional extension of this, we can consider exposing an "Execution ID" or "Generation ID" property to the orchestration which can help distinguish different instantiations of the given instance ID.