microsoft / durabletask-mssql

Microsoft SQL storage provider for Durable Functions and the Durable Task Framework
MIT License
87 stars 32 forks source link

A good way to migrate a running orchestration instance? #87

Open jundayin opened 2 years ago

jundayin commented 2 years ago

Hi,

Our team is implementing a solution for migrating in-progress orchestration instance from one database to another one and it can be resumed and picked up by the worker after. We believe it would be better to pause the entire orchestration instance before running the SQL command moving the data.

The definition of pausing the migration is quite simple: The state of the orchestration instance will remained still and the worker will not pick it up until we allow it to.

There is no official API available by the SQL Provider for this type of "pause" operation thus we are looking to existing schema and see if we can utilize any of them. For now, LockExpiration in [dt].[Instances] and [dt].[NewTasks] and the VisibleTime in [dt].[NewEvents] seems strong candidates to logically pause the orchestration. We have some PoC that setting LockExpiration and VisibleTime for NewEvents and NewTasks to some far time could keep the worker from picking them up, but we have not tested if it works for Instances as well.

So my question to the owner of SQL Provider is does this approach lead to any unexpected behavior for worker and do you have any suggestion on the migration?

jundayin commented 2 years ago

tagging @cgillum for SQL Provider owner and @mws19901118

cgillum commented 2 years ago

You might be fighting with the system if you try to use LockExpiration for pausing since the system will be constantly trying to lock (and unlock) these instances anyways.

One intention I had for supporting this was to introduce a 'Suspended' runtime status in the database. We actually already check for this status value (which isn't official yet) in at least one place in our stored procedures. The idea is that it will prevent the orchestration from getting picked up and executed by any workers until it's put back into the "Running" status. You may want to play with this approach instead of messing with LockExpiration.

BTW, if you find something that works, we'd love to possibly accept it as a contribution to this project. :) Perhaps there could be a stored procedure for pausing/suspending and another for resuming a paused/suspended instance? If you have a generic way of migrating an instance as well, that may also be very interesting to this project.

FYI @davidmrdavid since we're investigating a feature like this (suspend/resume) for the other storage backends as well.

jundayin commented 2 years ago

Thanks for the advice and I will take a look at it

jundayin commented 2 years ago

We actually did some PoC from our side. From what we observed the current logic of DTF does not block worker scheduling new sub-orchestration/activity when we change the runtime status of one Running orchestration instance to Suspended. Is this behavior by design or we are experimenting with wrong approach (we didn't add timer between tasks inside orchestration but just simply set breakpoint there, I assume worker will continue process the activity regardless, right?)

Say Orchestration Test. It contains activity A and activity B

  1. Worker picks it up Test
  2. Worker schedule activity A
  3. We changed the runtime status of Test to Suspended
  4. What we expected: activity B won't be scheduled. What actually happens: activity B was scheduled and succeed
cgillum commented 2 years ago

This matches my expectation and is the same outcome I'd expect if you terminate an orchestration - changing the status of an orchestration instance doesn't have any effect on any already scheduled activity tasks.

If we want to expand the scope of "Suspended" to also halt any previously scheduled activity tasks, then we'd need to update the dt._LockNextTask stored procedure to do a join on the dt.Instances table to check for a non-suspended status for the orchestration instance that scheduled the task (and verify that this change doesn't introduce any new deadlocks or substantial performance degradation).