temporalio / sdk-go

Temporal Go SDK
https://docs.temporal.io/application-development?lang=go
MIT License
532 stars 210 forks source link

Weird workflow task failure #813

Open yiminc opened 2 years ago

yiminc commented 2 years ago

We see one weird workflow task failure from SDK that keeps retrying and eventually succeed after 13K retry attempts. The workflow logic is it schedules 16 activities and wait for all of them to complete. The history shows one of the workflow task timeout due to start_to_schedule timeout (the 5s sticky timeout). After that, a new workflow task is scheduled but it failed with SDK panic complaining activity ID not found. After about 13K retry attempts, it magically succeed eventually. The binary checksum is the same before and after the failure.

cretz commented 2 years ago

Have you been able to replicate this? I am afraid the information given is not enough to go on. I can write a workflow that schedules 16 activities and waits for them to complete. I can simulate a workflow task timeout. But I fear those won't replicate. It is really important we replicate the bug to confirm it exists.

cretz commented 2 years ago

The internal situation where this happened was due to a workflow task timeout before even started (schedule-to-start sticky timeout), but the workflow task failure continually failed across multiple workers which makes it unlikely to be caused by cached workflow issues.

Also, an attempted replay with the history on the workflow did not replicate. this was with trimmed history and advanced logs to confirm that the activity ID was found.

cretz commented 2 years ago

Still unable to replicate, can anyone else?