temporalio / sdk-go

Temporal Go SDK
https://docs.temporal.io/application-development?lang=go
MIT License
540 stars 214 forks source link

Unable to reset workflow with completed childworkflow whose child workflowID is generated by SDK #723

Open yycptt opened 2 years ago

yycptt commented 2 years ago

Expected Behavior

When reseting a workflow with no pending child execution. (Reset with pending child currently is not supported.) After reset, parent workflow should continue execution without any error

Actual Behavior

After reset, parent workflow will encounter a non-deterministic error during replay. The error happens when processing the child workflow init event and can't find the corresponding child workflow command/state machine.

The root cause is we use child workflow ID to find the corresponding command. However if child workflowID is not specified in child option, SDK will automatically generate one based on the workflow's runID. After reset, the workflow's runID changes but in workflow history the child workflow ID is still based on the original runID. Hence the not found and non-deterministic error.

Steps to Reproduce the Problem

  1. Run a workflow with a child workflow. Do NOT specify the childworkflow ID in child option.
  2. Reset the workflow to an event_id after the child workflow completed event. Any workflow task close event (completed/failed/timeout) will work.
  3. Check the new workflow's history, the workflow task will fail due to non-deterministic error.

Specifications

yycptt commented 2 years ago

Server stores a workflow's original runID (in workflow start event) which won't change during reset. So one potential solution is generate child workflow ID based on this original runID.

But this solution may conflict with some future work related to reset. @yiminc Would you mind provide some insights here? Thanks.

askreet commented 9 months ago

I hit this in #1385. I'm curious if there's any risk in using the OriginalRunId by default in the meantime.