Add new Workflow Id Reuse Policy: Allow Duplicate with Queueing

jsjeannotte commented 1 year ago

Is your feature request related to a problem? Please describe. N/A

Describe the solution you'd like Currently, when a duplicate Workflow ID is submitted and the "Reject Duplicate" policy is used (default), the SDK raises an exception, which is the desired behavior. It would be useful to add a new reuse policy like "Queue Duplicate" where the workflow would be queued (with a new state like "Queued").

Describe alternatives you've considered The alternative is to use the "master-workflow + signal + child-workflow" recipe, which reduces the observability (i.e. the UI can't be used to see what workflows are queued) and increases the design complexity for simple use-cases.

Additional context N/A

samarabbas commented 1 year ago

@jsjeannotte can you provide more details about the use case? You referenced Reject Duplicate which guarantees a single execution for that WorkflowID irrespective of the Close Status. Why can't you use Allow Duplicate which would allow a new execution regardless of how the previous execution closed. Or the scenario you have in mind is to allow queuing up multiple executions simultaneously and then process then sequentially one after another.
Can you provide more details about the source where the StartWorkflowExecution requests are coming from?

jsjeannotte commented 1 year ago

I can't use Allow Duplicate because I don't want to allow duplicates, but instead, want to queue them. We're trying to replicate a feature we leverage on Jenkins and our Spinnaker Pipelines where we can configure our executions to "Queue concurrent requests". Without that feature in Temporal, we'll have to build a queueing system ourselves and keep trying to submit new executions until Temporal completes the pending one.

So yes, the scenario is exactly that: Allow queuing up multiple executions simultaneously and then process them sequentially one after another.

Can you provide more details about the source where the StartWorkflowExecution requests are coming from? Mostly from a user requesting a Workflow execution to be queued.

Simplified fictional example: Let’s say we have a Temporal Workflow to perform an Offline maintenance on a Database node. Assume we can only have one Database node at a time. Assume the Workflow ID is "offline-replace-node-database-". The user makes a request to trigger the Offline maintenance on Database A for Node X. All good. Another user makes a request to trigger an offline maintenance of Node Y on the same Database A. Since we use Reject Duplicate, the user gets a warning and the only options are to try again later.

Options for us are: 1) Build a queuing system so the second user receives an "All good, your request was queued" message instead of a 429 ... 2) Temporal supports Queue Duplicate so the second user receives the same "All good, your request was queued" response, but we didn't have to build a queuing system :)

samarabbas commented 1 year ago

Have you considered modeling this using SignalWithStart? Basically the idea would be any operation on a Node will be communicated to a workflow through a signal. SignalWithStart will allow a workflow execution to be created if none exists. If a workflow execution already exists and there is an operation in flight you just queue up new operation within the workflow itself. I might actual try and build a sample which showcases this approach. Basically you are building a serialization mechanism for a resource, and doing it within a single execution is much simpler than spreading it across multiple executions.

I also want to clarify some confusions around Allow Duplicate.

If no policy is specified then the default should be Allow Duplicate instead of Reject Duplicate. Here are the docs.
Can you clarify what you meant by this comment because I don't want to allow duplicates, but instead, want to queue them? From my understanding of the ticket, you want even a more loose version of Allow Duplicate where we allow another execution (queue'ed up) when there is already one in flight. Where today Allow Duplicate only allows the execution when the current one finishes.

jsjeannotte commented 1 year ago

Yes, I often get confused about Allow Duplicate vs Reject Duplicate and the fact that Allow Duplicate actually means: Allow Duplicate but fail if that workflow ID is still running. So yes, it would be more like Allow Duplicate with Queueing (loose version of Allow Duplicate).

Have you considered modeling this using SignalWithStart? I haven't. Having an example might help me wrap my head around what you are suggesting indeed.

For example, you mention that "you just queue up a new operation ..." which if I understand correctly means that I still need to build and maintain queues right?

I also failed to mention that I provide a platform that includes Temporal as a way for our users to write their Ops automation (and more) and so having the simplest interface possible helps them onboard to Temporal (for example, not having to understand Signals or Long Running Workflow with ContinueAsNew for doing very basic things). A lot of their use-cases would be solved by a single activity wrapped in a workflow (since these use-cases are each using a single Jenkins job running a single Python script).

We've even build an abstraction that allows them to wrap a single Python function into a Schedule + Workflow + Activity so that for extremely simple use-cases, our users don't even have to understand how Temporal works:

The user only writes this:

register_periodic_worker(
        PeriodicWorker(
            name="demo_test_hourly_with_arg",
            interval=timedelta(hours=1),
            start_to_close_timeout=timedelta(minutes=5),
            task=partial(test_callable_with_arg, "param1"),
            maximum_attempts=3,
        )

So having the ability for them to do something like:

 register_queued_worker(
        QueuedWorker(
            name="demo_test_hourly_with_arg",
            start_to_close_timeout=timedelta(minutes=5),
            task=partial(test_callable_with_arg, "param1"),
            maximum_attempts=3,
        )

... which would configure the Workflow Id Reuse Policy to Allow Duplicate with Queuing, would be extremely useful.

mjameswh commented 1 year ago

First thing first: you mention that these Workflows only get started from Schedules. If that's correct, then the easiest way to serialize Workflow execution so that no more than one execution is running at any time would be to simply set that schedule's policy.overlap option to ScheduleOverlapPolicy.BUFFER_ALL (see docs).

Now, I understand this might not be sufficient for your needs, as the Schedules API currently provides no way to inspect buffered executions. I opened a feature request for this here.

Regarding implementing execution queuing by yourself, I would generally recommend the approach mentioned by Samar (that is a single workflow on which you do signalWithStart, and everything happens in that unique workflow, or in child workflows started by that single workflow), but I again understand that this doesn't cover some of your needs.

Instead, it may make sense to reverse this pattern: start a different workflow execution for each task, then have each workflow do signalWithStart on a controller Workflow, and wait for the controller to signal back when its ok to proceed. That obviously means the task workflow need to signal the controller again when it completes. That approach adds some of overhead compared to the single workflow pattern described previously, but that overhead pays off in improved visibility, as queued tasks are now visible in workflow listing.

For example, you could have something like this:

Workflow Id                            Workflow Type                Status
-------------------------------------  ---------------------------  ---------
replace-database-mydb-20231012-045623  ReplaceNodeDatabaseWorkflow  Completed
replace-database-mydb-20231012-051276  ReplaceNodeDatabaseWorkflow  Running...
replace-database-mydb-20231012-051276  ReplaceNodeDatabaseWorkflow  Running...
replace-database-mydb-20231012-064712  ReplaceNodeDatabaseWorkflow  Running...
replace-database-mydb                  OneAtATimeCoordinator        Running...

This also makes it possible for users to interact directly with task Workflows, so they can for example cancel a queued execution, or inspect result/history of a specific completed task. This pattern also works better with Schedules than the previous suggestion.

To avoid making it harder for your users to write their own workflows, you may easily extract that coordination work (ie. signalWithStart the coordination workflow, wait for a signal from it, and sent it back an unlock signal once the task workflow completes), for example by moving this to a Workflow interceptor, having them wrapping their own workflow code into some wrapper function, or using the dynamic Workflow feature.

Does that make sense to you?

jsjeannotte commented 1 year ago

First thing first: you mention that these Workflows only get started from Schedules

No. Could be on a Schedule, or could be on-demand.

Instead, it may make sense to reverse this pattern: start a different workflow execution for each task, then have each workflow do signalWithStart on a controller Workflow, and wait for the controller to signal back when its ok to proceed. That obviously means the task workflow need to signal the controller again when it completes. That approach adds some of overhead compared to the single workflow pattern described previously, but that overhead pays off in improved visibility, as queued tasks are now visible in workflow listing.

That's something I was thinking about this week :) This would indeed help with visibility. And it might be easier to extract as a building block. I'll play with this a bit.

Thanks all! But again, would still appreciate if Allow Duplicate with Queueing was supported ;)

jsjeannotte commented 1 year ago

@mjameswh By the way, I'm also from Montreal :)

temporalio / temporal

Add new Workflow Id Reuse Policy: Allow Duplicate with Queueing #4386