temporalio / temporal

Temporal service
https://docs.temporal.io
MIT License
11.51k stars 823 forks source link

CompleteUpdate message is sometimes not honored when in same WFT completion as ContinueAsNew #6375

Open dandavison opened 1 month ago

dandavison commented 1 month ago

Expected Behavior

CompleteUpdate command should be honored when submitted in the same workflow task completion as ContinueAsNew.

Actual Behavior

Occasionally, there is a server error and it is not honored, and the update is applied to the post-ContinueAsNew run.

I have a workflow that CANs in the same WFT as an update is completed. I ensure that the update is sent in the first WFT and, most of the time, I see what I expect:

  1. worker receives WFT: [doUpdate, startWorkflow]
  2. worker sends WFT completion: [accept&completeUpdate, CAN]
  3. update caller gets successful update response
  4. worker receives WFT after CAN: [startWorkflow]
  5. workflow completes

However, sometimes (~1/20) I see

  1. \<as above>
  2. \
  3. server error when handling UpdateWorkflowExecution: unable to locate current workflow execution (update caller does not get a response)
  4. WFT after CAN contains the update for a second time, in addition to startWorkflow

I assume what's happening is that the server error causes an internal retry (FE => History) of the UpdateWorkflowExecution, and this retry lands on the 2nd run, after CAN. But I haven't yet dug into why the server error happens

    // for close workflow we need to check if it is still the current run
    // since it's possible that the workflowID has a newer run before it's locked 

wcache.GetCurrentRunID(...) != wfContext.GetWorkflowKey().RunID

Steps to Reproduce the Problem

This was discovered while writing tests for sdk-typescript. In lieu of a simpler repro, it can be repro'd as follows

steps to repro ``` git clone git@github.com:temporalio/sdk-typescript.git cd sdk-typescript git fetch origin 1459-server-error-repro git checkout 1459-server-error-repro git submodule update --init --recursive # Build temporal CLI against server 99f15c4dabf9a22763aaa13653c83a832bd0fffa # Look at last commit, modify to use path to your temporal CLI executable rustup update npm install npm run build for i in `seq 1 100`; do (cd packages/test && npm run test -- -m 'unfinished update handler with continue-as-new waiting for all handlers to finish') || break; done ```
Log in happy case ``` [runUnfinishedHandlersWorkflowTerminationTypeWorkflow(a12a0761-d96e-4111-973b-1e54027baaed)] 🟠 job: doUpdate {"doUpdate":{"id":"update-id","protocolInstanceId":"update-id","name":"unfinishedHandlersWorkflowTermina tionTypeUpdate","meta":{"updateId":"update-id","identity":"57045@dan-2.local"},"runValidator":true}} [runUnfinishedHandlersWorkflowTerminationTypeWorkflow(a12a0761-d96e-4111-973b-1e54027baaed)] 🟠 job: startWorkflow {"startWorkflow":{"workflowType":"runUnfinishedHandlersWorkflowTerminationTypeWorkflow","workflowId ":"a12a0761-d96e-4111-973b-1e54027baaed","arguments":[{"metadata":{"encoding":"anNvbi9wbGFpbg=="},"data":"ImNvbnRpbnVlLWFzLW5ldyI="},{"metadata":{"encoding":"anNvbi9wbGFpbg=="},"data":"IndhaXQtYWxsLWhhbmRsZXJzLWZpb mlzaGVkIg=="}],"randomnessSeed":"3270138216709017763","identity":"57045@dan-2.local","workflowTaskTimeout":{"seconds":"10"},"lastCompletionResult":{},"firstExecutionRunId":"8a25effa-1499-46b6-be24-0a3d23c1fb66","at tempt":1,"cronScheduleToScheduleInterval":{},"memo":{},"startTime":{"seconds":"1722604980","nanos":463150000}}} [runUnfinishedHandlersWorkflowTerminationTypeWorkflow(a12a0761-d96e-4111-973b-1e54027baaed)] 🟢 activationCompletion: [{"updateResponse":{"protocolInstanceId":"update-id","accepted":{}}},{"updateResponse":{"protoco lInstanceId":"update-id","completed":{"metadata":{"encoding":{"0":106,"1":115,"2":111,"3":110,"4":47,"5":112,"6":108,"7":97,"8":105,"9":110}},"data":{"0":34,"1":117,"2":112,"3":100,"4":97,"5":116,"6":101,"7":45,"8" :114,"9":101,"10":115,"11":117,"12":108,"13":116,"14":34}}}},{"continueAsNewWorkflowExecution":{"workflowType":"runUnfinishedHandlersWorkflowTerminationTypeWorkflow","arguments":[{"metadata":{"encoding":{"0":106,"1 ":115,"2":111,"3":110,"4":47,"5":112,"6":108,"7":97,"8":105,"9":110}},"data":{"0":34,"1":114,"2":101,"3":116,"4":117,"5":114,"6":110,"7":34}}],"headers":{},"taskQueue":"unfinished_update_handler_with_continue-as-ne w_waiting_for_all_handlers_to_finish","versioningIntent":0}}] ✅ caller got update result [runUnfinishedHandlersWorkflowTerminationTypeWorkflow(a12a0761-d96e-4111-973b-1e54027baaed)] 🟠 job: startWorkflow {"startWorkflow":{"workflowType":"runUnfinishedHandlersWorkflowTerminationTypeWorkflow","workflowId ":"a12a0761-d96e-4111-973b-1e54027baaed","arguments":[{"metadata":{"encoding":"anNvbi9wbGFpbg=="},"data":"InJldHVybiI="}],"randomnessSeed":"17192061066528283883","workflowRunTimeout":{},"workflowTaskTimeout":{"seco nds":"10"},"continuedFromExecutionRunId":"8a25effa-1499-46b6-be24-0a3d23c1fb66","continuedInitiator":"CONTINUE_AS_NEW_INITIATOR_WORKFLOW","lastCompletionResult":{},"firstExecutionRunId":"8a25effa-1499-46b6-be24-0a3 d23c1fb66","attempt":1,"cronScheduleToScheduleInterval":{"nanos":501851000},"memo":{},"startTime":{"seconds":"1722604980","nanos":938805000}}} [runUnfinishedHandlersWorkflowTerminationTypeWorkflow(a12a0761-d96e-4111-973b-1e54027baaed)] 🟢 activationCompletion: [{"completeWorkflowExecution":{"result":{"metadata":{"encoding":{"0":98,"1":105,"2":110,"3":97," 4":114,"5":121,"6":47,"7":110,"8":117,"9":108,"10":108}}}}}] ```
Log in unhappy case ``` [runUnfinishedHandlersWorkflowTerminationTypeWorkflow(f1f37cd3-1c11-4aa7-ba58-f12f8880f594)] 🟠 job: doUpdate {"doUpdate":{"id":"update-id","protocolInstanceId":"update-id","name":"unfinishedHandlersWorkflowTermina tionTypeUpdate","meta":{"updateId":"update-id","identity":"57075@dan-2.local"},"runValidator":true}} [runUnfinishedHandlersWorkflowTerminationTypeWorkflow(f1f37cd3-1c11-4aa7-ba58-f12f8880f594)] 🟠 job: startWorkflow {"startWorkflow":{"workflowType":"runUnfinishedHandlersWorkflowTerminationTypeWorkflow","workflowId ":"f1f37cd3-1c11-4aa7-ba58-f12f8880f594","arguments":[{"metadata":{"encoding":"anNvbi9wbGFpbg=="},"data":"ImNvbnRpbnVlLWFzLW5ldyI="},{"metadata":{"encoding":"anNvbi9wbGFpbg=="},"data":"IndhaXQtYWxsLWhhbmRsZXJzLWZpb mlzaGVkIg=="}],"randomnessSeed":"3129644504916450612","identity":"57075@dan-2.local","workflowTaskTimeout":{"seconds":"10"},"lastCompletionResult":{},"firstExecutionRunId":"d47f00b8-1462-4a45-821d-5e72c853f631","at tempt":1,"cronScheduleToScheduleInterval":{},"memo":{},"startTime":{"seconds":"1722604985","nanos":101629000}}} [runUnfinishedHandlersWorkflowTerminationTypeWorkflow(f1f37cd3-1c11-4aa7-ba58-f12f8880f594)] 🟢 activationCompletion: [{"updateResponse":{"protocolInstanceId":"update-id","accepted":{}}},{"updateResponse":{"protoco lInstanceId":"update-id","completed":{"metadata":{"encoding":{"0":106,"1":115,"2":111,"3":110,"4":47,"5":112,"6":108,"7":97,"8":105,"9":110}},"data":{"0":34,"1":117,"2":112,"3":100,"4":97,"5":116,"6":101,"7":45,"8" :114,"9":101,"10":115,"11":117,"12":108,"13":116,"14":34}}}},{"continueAsNewWorkflowExecution":{"workflowType":"runUnfinishedHandlersWorkflowTerminationTypeWorkflow","arguments":[{"metadata":{"encoding":{"0":106,"1 ":115,"2":111,"3":110,"4":47,"5":112,"6":108,"7":97,"8":105,"9":110}},"data":{"0":34,"1":114,"2":101,"3":116,"4":117,"5":114,"6":110,"7":34}}],"headers":{},"taskQueue":"unfinished_update_handler_with_continue-as-ne w_waiting_for_all_handlers_to_finish","versioningIntent":0}}] time=2024-08-02T09:23:05.598 level=ERROR msg="service failures" operation=UpdateWorkflowExecution wf-namespace=default error="unable to locate current workflow execution" [runUnfinishedHandlersWorkflowTerminationTypeWorkflow(f1f37cd3-1c11-4aa7-ba58-f12f8880f594)] 🟠 job: doUpdate {"doUpdate":{"id":"update-id","protocolInstanceId":"update-id","name":"unfinishedHandlersWorkflowTermina tionTypeUpdate","meta":{"updateId":"update-id","identity":"57075@dan-2.local"},"runValidator":true}} [runUnfinishedHandlersWorkflowTerminationTypeWorkflow(f1f37cd3-1c11-4aa7-ba58-f12f8880f594)] 🟠 job: startWorkflow {"startWorkflow":{"workflowType":"runUnfinishedHandlersWorkflowTerminationTypeWorkflow","workflowId ":"f1f37cd3-1c11-4aa7-ba58-f12f8880f594","arguments":[{"metadata":{"encoding":"anNvbi9wbGFpbg=="},"data":"InJldHVybiI="}],"randomnessSeed":"17700601131409760754","workflowRunTimeout":{},"workflowTaskTimeout":{"seco nds":"10"},"continuedFromExecutionRunId":"d47f00b8-1462-4a45-821d-5e72c853f631","continuedInitiator":"CONTINUE_AS_NEW_INITIATOR_WORKFLOW","lastCompletionResult":{},"firstExecutionRunId":"d47f00b8-1462-4a45-821d-5e7 2c853f631","attempt":1,"cronScheduleToScheduleInterval":{"nanos":566360000},"memo":{},"startTime":{"seconds":"1722604985","nanos":513977000}}} [runUnfinishedHandlersWorkflowTerminationTypeWorkflow(f1f37cd3-1c11-4aa7-ba58-f12f8880f594)] 🟢 activationCompletion: [{"updateResponse":{"protocolInstanceId":"update-id","accepted":{}}},{"completeWorkflowExecution ":{"result":{"metadata":{"encoding":{"0":98,"1":105,"2":110,"3":97,"4":114,"5":121,"6":47,"7":110,"8":117,"9":108,"10":108}}}}}] ```

Specifications

dandavison commented 1 month ago

It seems that there are a total of 3 UpdateWorkflowRequests made to History service.

UpdateWorkflowExecution
namespace_id:"98d00113-d1ae-4623-8b71-f24f5aac51cd"  request:{namespace:"default"  workflow_execution:{workflow_id:"f95ace75-8b2e-4c78-933c-8b813e4c7b48"}  first_execution_run_id:"d0442481-44d1-451b-9f14-50ad7d63b2bb"  wait_policy:{lifecycle_stage:UPDATE_WORKFLOW_EXECUTION_LIFECYCLE_STAGE_COMPLETED}  request:{meta:{update_id:"update-id"  identity:"66459@dan-2.local"}  input:{header:{}  name:"unfinishedHandlersWorkflowTerminationTypeUpdate"  args:{}}}}
<nil>, Workflow Update was aborted.

UpdateWorkflowExecution
namespace_id:"98d00113-d1ae-4623-8b71-f24f5aac51cd"  request:{namespace:"default"  workflow_execution:{workflow_id:"f95ace75-8b2e-4c78-933c-8b813e4c7b48"}  first_execution_run_id:"d0442481-44d1-451b-9f14-50ad7d63b2bb"  wait_policy:{lifecycle_stage:UPDATE_WORKFLOW_EXECUTION_LIFECYCLE_STAGE_COMPLETED}  request:{meta:{update_id:"update-id"  identity:"66459@dan-2.local"}  input:{header:{}  name:"unfinishedHandlersWorkflowTerminationTypeUpdate"  args:{}}}}
<nil>, unable to locate current workflow execution

UpdateWorkflowExecution
namespace_id:"98d00113-d1ae-4623-8b71-f24f5aac51cd"  request:{namespace:"default"  workflow_execution:{workflow_id:"f95ace75-8b2e-4c78-933c-8b813e4c7b48"}  first_execution_run_id:"d0442481-44d1-451b-9f14-50ad7d63b2bb"  wait_policy:{lifecycle_stage:UPDATE_WORKFLOW_EXECUTION_LIFECYCLE_STAGE_COMPLETED}  request:{meta:{update_id:"update-id"  identity:"66459@dan-2.local"}  input:{header:{}  name:"unfinishedHandlersWorkflowTerminationTypeUpdate"  args:{}}}}
<nil>, workflow execution already completed <-- this happens because of the nature of the Typescript test being used.
mfateev commented 1 month ago

Wouldn't the same happen with a signal?

alexshtin commented 1 month ago

It is kinda similar observable effect but probability to face it in real life are different, and I have hard times to understand which one is more likely.

Signals can be delivered twice when event was successfully written into the history by history service, and then it crashed w/o replying to frontend, or frontend service lost connection to history service, and workflow, while receiving and processing this signal replies with CAN, and then frontend, or API caller retries and 2nd attempt lands on 2nd run. Something really bad should happen on the server to trigger it.

Updates are different in a sense that any error with workflow (in the @dandavison's case it was matching tried to start workflow task 2nd time) leads to lost update, and trigger frontend retry. Retries for updates are part of design: server relies on retries to recreate update after those errors. This sounds like much more likely to happen. With one caveat though: updates are mostly delivered with speculative workflow task, and this task is also lost with any error. It means that server wouldn't process update acceptance for the 1st run. Update can go on normal workflow task only if it piggybacks existing workflow task created by something else (this is exactly what happened in @dandavison`s case where update was delivered on first workflow task). This fact, seems to me, significantly reduce chances of facing this issue in real life.