temporalio / sdk-go

Temporal Go SDK
https://docs.temporal.io/application-development?lang=go
MIT License
523 stars 208 forks source link

Delay in starting a scheduled workflow #928

Open muralisrini opened 1 year ago

muralisrini commented 1 year ago

Expected Behavior

The delay between Scheduled Workflow task and actual start of the workflow should be quite small. It certainly should not be order of a minute.

Actual Behavior

When running multiple workflows in a loop see some of workflows waiting for a long time in scheduled state (order of a minute) before starting. For example from temporals Web UI:

image

That task was scheduled at 17:07:28:35 and ran at 17:08:28:12.

This was seen with a modified Mutex Sample. See Steps below for details please.

Few points:

There is also some discussion on the forum : https://community.temporal.io/t/delay-in-scheduling-a-workflow/5996

Steps to Reproduce the Problem

Setup

  1. Pull https://github.com/muralisrini/samples-go/tree/mutex_add_workflows
  2. cd to the mutex/worker directory and go build
  3. cd to the mutex/starter directory and go build
  4. run ./temporalite start --namespace default
  5. run worker/worker # call this Worker window

Start

  1. run starter/starter # call this Starter window

The starter runs the original starter code with 2 workflows - but in a loop of 8 to make 18 works in total. You should see the "Worker Window" stall for a long time for some workflows. The Web UI should show delays similar to the one in the picture above.

Specifications

muralisrini commented 1 year ago

Anyone take a look at this ?

tsurdilo commented 1 year ago

I believe this is due to your starter explicitly waiting on workflow completion which would block on the lock:

https://github.com/muralisrini/samples-go/blob/mutex_add_workflows/mutex/starter/main.go#L52-L60

If you remove this (start executions async and dont wait for result) you should not see the delay you are explaining in this issue

manetumatt commented 1 year ago

I am seeing this issue as well without the start executions waiting for results at all. The worker seems to stall/stop receiving tasks under certain conditions outlined here. I have observed the following:

I have reproduced this with simple changes to the mutex sample:

starter/main.go https://github.com/manetumatt/samples-go/blob/debug1/mutex/starter/main.go

I also shortened the critical section delay to 500ms:

mutex_workflow.go https://github.com/manetumatt/samples-go/blob/debug1/mutex/mutex_workflow.go#L208

This sample code executes 5 workflows which start processing immediately. The code then sleeps for 10 seconds and then executes a second set of 5 workflows. The second set will not start processing for nearly a minute.

tsurdilo commented 1 year ago

Thanks for update. Can you share the server version you are running against?

manetumatt commented 1 year ago

Thank you for the quick reply. First encountered the issues against v1.18.0. Switched to Temporalite v0.3.0 (server 1.19.0 I think) for local debugging and testing. Saw the worker stall issue against both backends.

from Temporalite: (base) mltmbp:samples mtaylor$ tctl adm cl d { "supportedClients": { "temporal-cli": "\u003c2.0.0", "temporal-go": "\u003c2.0.0", "temporal-java": "\u003c2.0.0", "temporal-php": "\u003c2.0.0", "temporal-server": "\u003c2.0.0", "temporal-typescript": "\u003c2.0.0", "temporal-ui": "\u003c3.0.0" }, "serverVersion": "1.19.0", "membershipInfo": { "currentHost": { "identity": "127.0.0.1:7233"

manetumatt commented 1 year ago

Still seeing this issue with latest version. Anyone had a chance to look at this?

dnr commented 1 year ago

I believe this is fixed by https://github.com/temporalio/temporal/pull/4562, which is included in server 1.21.1 and above. Please try the latest server version and see if you can still reproduce it.