microsoft / durabletask-mssql

Microsoft SQL storage provider for Durable Functions and the Durable Task Framework
MIT License
87 stars 32 forks source link

Adds config for min and delta backoff poll intervals #174

Closed dmetzgar closed 1 year ago

dmetzgar commented 1 year ago

PR for #172. Makes the minimum and delta backoff intervals configurable for both activity and orchestration instance polling. For situations where a lot of pods have TaskHubWorkers, the default poll interval of 50ms uses a significant amount of DTU. This can become more prominent when SQL Azure decides to change the query plan due to parameter sniffing. In production, we see 12 polls per second on NewTasks and Instances tables each during averaged over 24 hours.

A behavior we've seen that you might be able to confirm @cgillum is on new task polling. It appears that the behavior of activities scheduled by an orchestration are inserted to the NewTasks table with a lock. The TaskHubWorker that is running the orchestration instance is then also expected to run the activities and only after the lock expires could they be picked up by other workers. I think this is the behavior because when I put orchestration and activity into separate TaskHubWorkers, the activities don't get executed. From my point of view, the time when tasks need to be picked up is when the TaskHubWorker goes down due to failure or deployment or if the task is a timer set further into the future than the lock expiration. Therefore, we have increased the intervals on activity polling to much higher than instance polling.

dmetzgar commented 1 year ago

@microsoft-github-policy-service agree company="UiPath"

cgillum commented 1 year ago

It appears that the behavior of activities scheduled by an orchestration are inserted to the NewTasks table with a lock.

I'm pretty confident that this is not the case. There should be no lock when rows are added to the NewTasks table. We would much prefer that these activities can be load balanced across multiple workers.

I think this is the behavior because when I put orchestration and activity into separate TaskHubWorkers, the activities don't get executed.

We actually require that all TaskHubWorkers register the exact same set of activities and orchestrations. If you don't do this, you can expect runtime exceptions complaining about how either an activity wasn't found or an orchestration wasn't found on the worker which didn't register that orchestration or activity. I'd like to add a feature that allows splitting them up, but I haven't been able to prioritize it yet.

If you're observing that activities typically run on the same worker as the orchestrations that schedule them, it's likely because we reset the backoff polling interval on the local worker when we detect an orchestration has scheduled activities (or sub-orchestrations). We do this to minimize the latency between scheduling tasks and having them start running. The side-effect of this is that tasks are biased to run on the same worker that scheduled them. Distribution typically happens when load is higher and the local worker can't fetch the tasks and execute them as quickly.