temporalio / sdk-java

Temporal Java SDK
https://temporal.io
Apache License 2.0
214 stars 145 forks source link

Workflow executions frozen after Temporal exception #1921

Closed brunomascijp closed 10 months ago

brunomascijp commented 11 months ago

Expected Behavior

Workflow executions should make progress, retrying, failing or successfully completing steps.

Actual Behavior

I have some executions that got stuck for hours after the following exception, and the state was Running on all of them. We restarted all the workers and the orchestrator seemed to be working good.

[Workflow Executor taskQueue="prod", namespace="ns": 77] [] i.temporal.internal.worker.PollerOptions: uncaught exception java.lang.RuntimeException: Failure processing workflow task. WorkflowId=5b38, RunId=5c9cbad8-8a64-4a84-81bd-64d02474a560, Attempt=473 at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.wrapFailure(WorkflowWorker.java:327) at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.wrapFailure(WorkflowWorker.java:188) at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:98) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:833) Caused by: io.temporal.internal.statemachines.InternalWorkflowTaskException: Failure handling event 28 of type 'EVENT_TYPE_WORKFLOW_TASK_STARTED' during execution. {WorkflowTaskStartedEventId=28, CurrentStartedEventId=28} at io.temporal.internal.statemachines.WorkflowStateMachines.createEventProcessingException(WorkflowStateMachines.java:257) at io.temporal.internal.statemachines.WorkflowStateMachines.handleEventsBatch(WorkflowStateMachines.java:236) at io.temporal.internal.statemachines.WorkflowStateMachines.handleEvent(WorkflowStateMachines.java:208) at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.applyServerHistory(ReplayWorkflowRunTaskHandler.java:208) at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTaskImpl(ReplayWorkflowRunTaskHandler.java:192) at io.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTask(ReplayWorkflowRunTaskHandler.java:147) at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTaskWithQuery(ReplayWorkflowTaskHandler.java:132) at io.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTask(ReplayWorkflowTaskHandler.java:97) at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handleTask(WorkflowWorker.java:336) at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:246) at io.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:188) at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:93) ... 3 common frames omitted Caused by: java.lang.RuntimeException: WorkflowTask: failure executing SCHEDULED->WORKFLOW_TASK_STARTED, transition history is [CREATED->WORKFLOW_TASK_SCHEDULED] at io.temporal.internal.statemachines.StateMachine.executeTransition(StateMachine.java:152) at io.temporal.internal.statemachines.StateMachine.handleHistoryEvent(StateMachine.java:102) at io.temporal.internal.statemachines.EntityStateMachineBase.handleEvent(EntityStateMachineBase.java:68) at io.temporal.internal.statemachines.WorkflowStateMachines.handleSingleEvent(WorkflowStateMachines.java:277) at io.temporal.internal.statemachines.WorkflowStateMachines.handleEventsBatch(WorkflowStateMachines.java:234) ... 13 common frames omitted Caused by: java.lang.NullPointerException: stackTrace[15] at java.base/java.lang.Throwable.setStackTrace(Throwable.java:879) at io.temporal.failure.FailureConverter.failureToException(FailureConverter.java:85) at io.temporal.failure.FailureConverter.failureToExceptionImpl(FailureConverter.java:93) at io.temporal.failure.FailureConverter.failureToException(FailureConverter.java:79) at io.temporal.failure.FailureConverter.failureToExceptionImpl(FailureConverter.java:93) at io.temporal.failure.FailureConverter.failureToException(FailureConverter.java:79) at io.temporal.failure.FailureConverter.failureToExceptionImpl(FailureConverter.java:93) at io.temporal.failure.FailureConverter.failureToException(FailureConverter.java:79) at io.temporal.failure.FailureConverter.failureToExceptionImpl(FailureConverter.java:93) at io.temporal.failure.FailureConverter.failureToException(FailureConverter.java:79) at io.temporal.internal.sync.SyncWorkflowContext$ActivityCallback.lambda$invoke$0(SyncWorkflowContext.java:292) at io.temporal.internal.sync.CancellationScopeImpl.run(CancellationScopeImpl.java:102) at io.temporal.internal.sync.WorkflowThreadImpl$RunnableWrapper.run(WorkflowThreadImpl.java:106) at io.temporal.worker.ActiveThreadReportingExecutor.lambda$submit$0(ActiveThreadReportingExecutor.java:53) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) ... 3 common frames omitted

Particularly, all stuck executions are on WorkflowTaskFailed state and, after a few hours waiting, we decided to terminate them:

{ "message": "Failure handling event 25 of type 'EVENT_TYPE_WORKFLOW_TASK_STARTED' during execution. {WorkflowTaskStartedEventId=25, CurrentStartedEventId=25}", "source": "JavaSDK", "stackTrace": "io.temporal.internal.statemachines.WorkflowStateMachines.createEventProcessingException(WorkflowStateMachines.java:257)\nio.temporal.internal.statemachines.WorkflowStateMachines.handleEventsBatch(WorkflowStateMachines.java:236)\nio.temporal.internal.statemachines.WorkflowStateMachines.handleEvent(WorkflowStateMachines.java:208)\nio.temporal.internal.replay.ReplayWorkflowRunTaskHandler.applyServerHistory(ReplayWorkflowRunTaskHandler.java:208)\nio.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTaskImpl(ReplayWorkflowRunTaskHandler.java:192)\nio.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTask(ReplayWorkflowRunTaskHandler.java:147)\nio.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTaskWithQuery(ReplayWorkflowTaskHandler.java:132)\nio.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTask(ReplayWorkflowTaskHandler.java:97)\nio.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handleTask(WorkflowWorker.java:336)\nio.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:246)\nio.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:188)\nio.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:93)\njava.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\njava.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\njava.base/java.lang.Thread.run(Thread.java:833)\n", "cause": { "message": "WorkflowTask: failure executing SCHEDULED->WORKFLOW_TASK_STARTED, transition history is [CREATED->WORKFLOW_TASK_SCHEDULED]", "source": "JavaSDK", "stackTrace": "io.temporal.internal.statemachines.StateMachine.executeTransition(StateMachine.java:152)\nio.temporal.internal.statemachines.StateMachine.handleHistoryEvent(StateMachine.java:102)\nio.temporal.internal.statemachines.EntityStateMachineBase.handleEvent(EntityStateMachineBase.java:68)\nio.temporal.internal.statemachines.WorkflowStateMachines.handleSingleEvent(WorkflowStateMachines.java:277)\nio.temporal.internal.statemachines.WorkflowStateMachines.handleEventsBatch(WorkflowStateMachines.java:234)\nio.temporal.internal.statemachines.WorkflowStateMachines.handleEvent(WorkflowStateMachines.java:208)\nio.temporal.internal.replay.ReplayWorkflowRunTaskHandler.applyServerHistory(ReplayWorkflowRunTaskHandler.java:208)\nio.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTaskImpl(ReplayWorkflowRunTaskHandler.java:192)\nio.temporal.internal.replay.ReplayWorkflowRunTaskHandler.handleWorkflowTask(ReplayWorkflowRunTaskHandler.java:147)\nio.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTaskWithQuery(ReplayWorkflowTaskHandler.java:132)\nio.temporal.internal.replay.ReplayWorkflowTaskHandler.handleWorkflowTask(ReplayWorkflowTaskHandler.java:97)\nio.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handleTask(WorkflowWorker.java:336)\nio.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:246)\nio.temporal.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:188)\nio.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:93)\njava.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\njava.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\njava.base/java.lang.Thread.run(Thread.java:833)\n", "cause": { "message": "stackTrace[15]", "source": "JavaSDK", "stackTrace": "java.base/java.lang.Throwable.setStackTrace(Throwable.java:879)\nio.temporal.failure.FailureConverter.failureToException(FailureConverter.java:85)\nio.temporal.failure.FailureConverter.failureToExceptionImpl(FailureConverter.java:93)\nio.temporal.failure.FailureConverter.failureToException(FailureConverter.java:79)\nio.temporal.failure.FailureConverter.failureToExceptionImpl(FailureConverter.java:93)\nio.temporal.failure.FailureConverter.failureToException(FailureConverter.java:79)\nio.temporal.failure.FailureConverter.failureToExceptionImpl(FailureConverter.java:93)\nio.temporal.failure.FailureConverter.failureToException(FailureConverter.java:79)\nio.temporal.failure.FailureConverter.failureToExceptionImpl(FailureConverter.java:93)\nio.temporal.failure.FailureConverter.failureToException(FailureConverter.java:79)\nio.temporal.internal.sync.SyncWorkflowContext$ActivityCallback.lambda$invoke$0(SyncWorkflowContext.java:292)\nio.temporal.internal.sync.CancellationScopeImpl.run(CancellationScopeImpl.java:102)\nio.temporal.internal.sync.WorkflowThreadImpl$RunnableWrapper.run(WorkflowThreadImpl.java:106)\nio.temporal.worker.ActiveThreadReportingExecutor.lambda$submit$0(ActiveThreadReportingExecutor.java:53)\njava.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)\njava.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\njava.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\njava.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\njava.base/java.lang.Thread.run(Thread.java:833)\n", "cause": null, "applicationFailureInfo": { "type": "java.lang.NullPointerException", "nonRetryable": false, "details": null } }, "applicationFailureInfo": { "type": "java.lang.RuntimeException", "nonRetryable": false, "details": null } }, "applicationFailureInfo": { "type": "io.temporal.internal.statemachines.InternalWorkflowTaskException", "nonRetryable": false, "details": null } }

Steps to Reproduce the Problem

Not enough information

Specifications

Quinn-With-Two-Ns commented 11 months ago

You're running on an older SDK version. I believe this issue was fixed in this PR https://github.com/temporalio/sdk-java/pull/1795. Can you please upgrade to the latest Java SDK release v1.22.0

brunomascijp commented 11 months ago

Will try, thanks!

Quinn-With-Two-Ns commented 10 months ago

Closing since this is not an SDK issue, feel free to ask general questions on our forum or community slack