Open dandavison opened 1 month ago
It seems that there are a total of 3 UpdateWorkflowRequests
made to History service.
UpdateWorkflowExecution
namespace_id:"98d00113-d1ae-4623-8b71-f24f5aac51cd" request:{namespace:"default" workflow_execution:{workflow_id:"f95ace75-8b2e-4c78-933c-8b813e4c7b48"} first_execution_run_id:"d0442481-44d1-451b-9f14-50ad7d63b2bb" wait_policy:{lifecycle_stage:UPDATE_WORKFLOW_EXECUTION_LIFECYCLE_STAGE_COMPLETED} request:{meta:{update_id:"update-id" identity:"66459@dan-2.local"} input:{header:{} name:"unfinishedHandlersWorkflowTerminationTypeUpdate" args:{}}}}
<nil>, Workflow Update was aborted.
UpdateWorkflowExecution
namespace_id:"98d00113-d1ae-4623-8b71-f24f5aac51cd" request:{namespace:"default" workflow_execution:{workflow_id:"f95ace75-8b2e-4c78-933c-8b813e4c7b48"} first_execution_run_id:"d0442481-44d1-451b-9f14-50ad7d63b2bb" wait_policy:{lifecycle_stage:UPDATE_WORKFLOW_EXECUTION_LIFECYCLE_STAGE_COMPLETED} request:{meta:{update_id:"update-id" identity:"66459@dan-2.local"} input:{header:{} name:"unfinishedHandlersWorkflowTerminationTypeUpdate" args:{}}}}
<nil>, unable to locate current workflow execution
UpdateWorkflowExecution
namespace_id:"98d00113-d1ae-4623-8b71-f24f5aac51cd" request:{namespace:"default" workflow_execution:{workflow_id:"f95ace75-8b2e-4c78-933c-8b813e4c7b48"} first_execution_run_id:"d0442481-44d1-451b-9f14-50ad7d63b2bb" wait_policy:{lifecycle_stage:UPDATE_WORKFLOW_EXECUTION_LIFECYCLE_STAGE_COMPLETED} request:{meta:{update_id:"update-id" identity:"66459@dan-2.local"} input:{header:{} name:"unfinishedHandlersWorkflowTerminationTypeUpdate" args:{}}}}
<nil>, workflow execution already completed <-- this happens because of the nature of the Typescript test being used.
Wouldn't the same happen with a signal?
It is kinda similar observable effect but probability to face it in real life are different, and I have hard times to understand which one is more likely.
Signals can be delivered twice when event was successfully written into the history by history service, and then it crashed w/o replying to frontend, or frontend service lost connection to history service, and workflow, while receiving and processing this signal replies with CAN, and then frontend, or API caller retries and 2nd attempt lands on 2nd run. Something really bad should happen on the server to trigger it.
Updates are different in a sense that any error with workflow (in the @dandavison's case it was matching tried to start workflow task 2nd time) leads to lost update, and trigger frontend retry. Retries for updates are part of design: server relies on retries to recreate update after those errors. This sounds like much more likely to happen. With one caveat though: updates are mostly delivered with speculative workflow task, and this task is also lost with any error. It means that server wouldn't process update acceptance for the 1st run. Update can go on normal workflow task only if it piggybacks existing workflow task created by something else (this is exactly what happened in @dandavison`s case where update was delivered on first workflow task). This fact, seems to me, significantly reduce chances of facing this issue in real life.
Expected Behavior
CompleteUpdate command should be honored when submitted in the same workflow task completion as ContinueAsNew.
Actual Behavior
Occasionally, there is a server error and it is not honored, and the update is applied to the post-ContinueAsNew run.
I have a workflow that CANs in the same WFT as an update is completed. I ensure that the update is sent in the first WFT and, most of the time, I see what I expect:
[doUpdate, startWorkflow]
[accept&completeUpdate, CAN]
[startWorkflow]
However, sometimes (~1/20) I see
unable to locate current workflow execution
(update caller does not get a response)I assume what's happening is that the server error causes an internal retry (FE => History) of the UpdateWorkflowExecution, and this retry lands on the 2nd run, after CAN. But I haven't yet dug into why the server error happens
Steps to Reproduce the Problem
This was discovered while writing tests for
sdk-typescript
. In lieu of a simpler repro, it can be repro'd as followssteps to repro
``` git clone git@github.com:temporalio/sdk-typescript.git cd sdk-typescript git fetch origin 1459-server-error-repro git checkout 1459-server-error-repro git submodule update --init --recursive # Build temporal CLI against server 99f15c4dabf9a22763aaa13653c83a832bd0fffa # Look at last commit, modify to use path to your temporal CLI executable rustup update npm install npm run build for i in `seq 1 100`; do (cd packages/test && npm run test -- -m 'unfinished update handler with continue-as-new waiting for all handlers to finish') || break; done ```Log in happy case
``` [runUnfinishedHandlersWorkflowTerminationTypeWorkflow(a12a0761-d96e-4111-973b-1e54027baaed)] 🟠job: doUpdate {"doUpdate":{"id":"update-id","protocolInstanceId":"update-id","name":"unfinishedHandlersWorkflowTermina tionTypeUpdate","meta":{"updateId":"update-id","identity":"57045@dan-2.local"},"runValidator":true}} [runUnfinishedHandlersWorkflowTerminationTypeWorkflow(a12a0761-d96e-4111-973b-1e54027baaed)] 🟠job: startWorkflow {"startWorkflow":{"workflowType":"runUnfinishedHandlersWorkflowTerminationTypeWorkflow","workflowId ":"a12a0761-d96e-4111-973b-1e54027baaed","arguments":[{"metadata":{"encoding":"anNvbi9wbGFpbg=="},"data":"ImNvbnRpbnVlLWFzLW5ldyI="},{"metadata":{"encoding":"anNvbi9wbGFpbg=="},"data":"IndhaXQtYWxsLWhhbmRsZXJzLWZpb mlzaGVkIg=="}],"randomnessSeed":"3270138216709017763","identity":"57045@dan-2.local","workflowTaskTimeout":{"seconds":"10"},"lastCompletionResult":{},"firstExecutionRunId":"8a25effa-1499-46b6-be24-0a3d23c1fb66","at tempt":1,"cronScheduleToScheduleInterval":{},"memo":{},"startTime":{"seconds":"1722604980","nanos":463150000}}} [runUnfinishedHandlersWorkflowTerminationTypeWorkflow(a12a0761-d96e-4111-973b-1e54027baaed)] 🟢 activationCompletion: [{"updateResponse":{"protocolInstanceId":"update-id","accepted":{}}},{"updateResponse":{"protoco lInstanceId":"update-id","completed":{"metadata":{"encoding":{"0":106,"1":115,"2":111,"3":110,"4":47,"5":112,"6":108,"7":97,"8":105,"9":110}},"data":{"0":34,"1":117,"2":112,"3":100,"4":97,"5":116,"6":101,"7":45,"8" :114,"9":101,"10":115,"11":117,"12":108,"13":116,"14":34}}}},{"continueAsNewWorkflowExecution":{"workflowType":"runUnfinishedHandlersWorkflowTerminationTypeWorkflow","arguments":[{"metadata":{"encoding":{"0":106,"1 ":115,"2":111,"3":110,"4":47,"5":112,"6":108,"7":97,"8":105,"9":110}},"data":{"0":34,"1":114,"2":101,"3":116,"4":117,"5":114,"6":110,"7":34}}],"headers":{},"taskQueue":"unfinished_update_handler_with_continue-as-ne w_waiting_for_all_handlers_to_finish","versioningIntent":0}}] ✅ caller got update result [runUnfinishedHandlersWorkflowTerminationTypeWorkflow(a12a0761-d96e-4111-973b-1e54027baaed)] 🟠job: startWorkflow {"startWorkflow":{"workflowType":"runUnfinishedHandlersWorkflowTerminationTypeWorkflow","workflowId ":"a12a0761-d96e-4111-973b-1e54027baaed","arguments":[{"metadata":{"encoding":"anNvbi9wbGFpbg=="},"data":"InJldHVybiI="}],"randomnessSeed":"17192061066528283883","workflowRunTimeout":{},"workflowTaskTimeout":{"seco nds":"10"},"continuedFromExecutionRunId":"8a25effa-1499-46b6-be24-0a3d23c1fb66","continuedInitiator":"CONTINUE_AS_NEW_INITIATOR_WORKFLOW","lastCompletionResult":{},"firstExecutionRunId":"8a25effa-1499-46b6-be24-0a3 d23c1fb66","attempt":1,"cronScheduleToScheduleInterval":{"nanos":501851000},"memo":{},"startTime":{"seconds":"1722604980","nanos":938805000}}} [runUnfinishedHandlersWorkflowTerminationTypeWorkflow(a12a0761-d96e-4111-973b-1e54027baaed)] 🟢 activationCompletion: [{"completeWorkflowExecution":{"result":{"metadata":{"encoding":{"0":98,"1":105,"2":110,"3":97," 4":114,"5":121,"6":47,"7":110,"8":117,"9":108,"10":108}}}}}] ```Log in unhappy case
``` [runUnfinishedHandlersWorkflowTerminationTypeWorkflow(f1f37cd3-1c11-4aa7-ba58-f12f8880f594)] 🟠job: doUpdate {"doUpdate":{"id":"update-id","protocolInstanceId":"update-id","name":"unfinishedHandlersWorkflowTermina tionTypeUpdate","meta":{"updateId":"update-id","identity":"57075@dan-2.local"},"runValidator":true}} [runUnfinishedHandlersWorkflowTerminationTypeWorkflow(f1f37cd3-1c11-4aa7-ba58-f12f8880f594)] 🟠job: startWorkflow {"startWorkflow":{"workflowType":"runUnfinishedHandlersWorkflowTerminationTypeWorkflow","workflowId ":"f1f37cd3-1c11-4aa7-ba58-f12f8880f594","arguments":[{"metadata":{"encoding":"anNvbi9wbGFpbg=="},"data":"ImNvbnRpbnVlLWFzLW5ldyI="},{"metadata":{"encoding":"anNvbi9wbGFpbg=="},"data":"IndhaXQtYWxsLWhhbmRsZXJzLWZpb mlzaGVkIg=="}],"randomnessSeed":"3129644504916450612","identity":"57075@dan-2.local","workflowTaskTimeout":{"seconds":"10"},"lastCompletionResult":{},"firstExecutionRunId":"d47f00b8-1462-4a45-821d-5e72c853f631","at tempt":1,"cronScheduleToScheduleInterval":{},"memo":{},"startTime":{"seconds":"1722604985","nanos":101629000}}} [runUnfinishedHandlersWorkflowTerminationTypeWorkflow(f1f37cd3-1c11-4aa7-ba58-f12f8880f594)] 🟢 activationCompletion: [{"updateResponse":{"protocolInstanceId":"update-id","accepted":{}}},{"updateResponse":{"protoco lInstanceId":"update-id","completed":{"metadata":{"encoding":{"0":106,"1":115,"2":111,"3":110,"4":47,"5":112,"6":108,"7":97,"8":105,"9":110}},"data":{"0":34,"1":117,"2":112,"3":100,"4":97,"5":116,"6":101,"7":45,"8" :114,"9":101,"10":115,"11":117,"12":108,"13":116,"14":34}}}},{"continueAsNewWorkflowExecution":{"workflowType":"runUnfinishedHandlersWorkflowTerminationTypeWorkflow","arguments":[{"metadata":{"encoding":{"0":106,"1 ":115,"2":111,"3":110,"4":47,"5":112,"6":108,"7":97,"8":105,"9":110}},"data":{"0":34,"1":114,"2":101,"3":116,"4":117,"5":114,"6":110,"7":34}}],"headers":{},"taskQueue":"unfinished_update_handler_with_continue-as-ne w_waiting_for_all_handlers_to_finish","versioningIntent":0}}] time=2024-08-02T09:23:05.598 level=ERROR msg="service failures" operation=UpdateWorkflowExecution wf-namespace=default error="unable to locate current workflow execution" [runUnfinishedHandlersWorkflowTerminationTypeWorkflow(f1f37cd3-1c11-4aa7-ba58-f12f8880f594)] 🟠job: doUpdate {"doUpdate":{"id":"update-id","protocolInstanceId":"update-id","name":"unfinishedHandlersWorkflowTermina tionTypeUpdate","meta":{"updateId":"update-id","identity":"57075@dan-2.local"},"runValidator":true}} [runUnfinishedHandlersWorkflowTerminationTypeWorkflow(f1f37cd3-1c11-4aa7-ba58-f12f8880f594)] 🟠job: startWorkflow {"startWorkflow":{"workflowType":"runUnfinishedHandlersWorkflowTerminationTypeWorkflow","workflowId ":"f1f37cd3-1c11-4aa7-ba58-f12f8880f594","arguments":[{"metadata":{"encoding":"anNvbi9wbGFpbg=="},"data":"InJldHVybiI="}],"randomnessSeed":"17700601131409760754","workflowRunTimeout":{},"workflowTaskTimeout":{"seco nds":"10"},"continuedFromExecutionRunId":"d47f00b8-1462-4a45-821d-5e72c853f631","continuedInitiator":"CONTINUE_AS_NEW_INITIATOR_WORKFLOW","lastCompletionResult":{},"firstExecutionRunId":"d47f00b8-1462-4a45-821d-5e7 2c853f631","attempt":1,"cronScheduleToScheduleInterval":{"nanos":566360000},"memo":{},"startTime":{"seconds":"1722604985","nanos":513977000}}} [runUnfinishedHandlersWorkflowTerminationTypeWorkflow(f1f37cd3-1c11-4aa7-ba58-f12f8880f594)] 🟢 activationCompletion: [{"updateResponse":{"protocolInstanceId":"update-id","accepted":{}}},{"completeWorkflowExecution ":{"result":{"metadata":{"encoding":{"0":98,"1":105,"2":110,"3":97,"4":114,"5":121,"6":47,"7":110,"8":117,"9":108,"10":108}}}}}] ```Specifications
main
99f15c4dabf9a22763aaa13653c83a832bd0fffa