temporalio / sdk-java

Temporal Java SDK
https://temporal.io
Apache License 2.0
213 stars 144 forks source link

Add workflow deadlock detector #28

Closed mfateev closed 3 years ago

mfateev commented 4 years ago

Currently if a workflow code by mistake uses Java synchronization primitives it deadlocks as dispatcher relies on cooperative multithreading. Here is an example stack trace that shows that one thread obtains lock (0x00000006c815ddb8) through synchronized method and blocks other thread from making progress.

The proposal is to implement deadlock detection feature to fail decision task with a clear message that Java locking primitives are prohibited inside the workflow code.

    public static void main(String[] args) {
        ""WorkflowStageProgressListener::stageCompleted" signal handler" #78 prio=5 os_prio=0 tid=0x0000560f1fe02800 nid=0x5e waiting for monitor entry [0x00007f142a435000]
        java.lang.Thread.State: BLOCKED (on object monitor)
        at com.somecompany.platform.cadence.workflows.onboarding.HiringWorkFlowImpl.transitionMainTaskToInProgress(HiringWorkFlowImpl.java:327)
        - waiting to lock <0x00000006c815ddb8> (a com.somecompany.platform.cadence.workflows.onboarding.HiringWorkFlowImpl)
        at com.somecompany.platform.cadence.workflows.onboarding.HiringWorkFlowImpl.stageCompleted(HiringWorkFlowImpl.java:282)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.uber.cadence.internal.sync.POJOWorkflowImplementationFactory$POJOWorkflowImplementation.processSignal(POJOWorkflowImplementationFactory.java:312)
        at com.uber.cadence.internal.sync.WorkflowRunnable.processSignal(WorkflowRunnable.java:65)
        at com.uber.cadence.internal.sync.SyncWorkflow.lambda$handleSignal$0(SyncWorkflow.java:106)
        at com.uber.cadence.internal.sync.SyncWorkflow$$Lambda$1231/1900852990.run(Unknown Source)
        at com.uber.cadence.internal.sync.CancellationScopeImpl.run(CancellationScopeImpl.java:102)
        at com.uber.cadence.internal.sync.WorkflowThreadImpl$RunnableWrapper.run(WorkflowThreadImpl.java:85)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
        ""WorkflowStageProgressListener::stageInProgress" signal handler" #77 prio=5 os_prio=0 tid=0x0000560f20a13000 nid=0x5d waiting on condition [0x00007f142a536000]
        java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
                - parking to wait for  <0x00000006c81a5550> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at com.uber.cadence.internal.sync.WorkflowThreadContext.yield(WorkflowThreadContext.java:76)
        at com.uber.cadence.internal.sync.WorkflowThreadImpl.yield(WorkflowThreadImpl.java:400)
        at com.uber.cadence.internal.sync.WorkflowThread.await(WorkflowThread.java:43)
        at com.uber.cadence.internal.sync.CompletablePromiseImpl.get(CompletablePromiseImpl.java:71)
        at com.uber.cadence.internal.sync.ActivityStubBase.execute(ActivityStubBase.java:42)
        at com.uber.cadence.internal.sync.ActivityStubImpl.execute(ActivityStubImpl.java:26)
        at com.uber.cadence.internal.sync.ActivityInvocationHandler.lambda$getActivityFunc$0(ActivityInvocationHandler.java:51)
        at com.uber.cadence.internal.sync.ActivityInvocationHandler$$Lambda$1192/803084529.apply(Unknown Source)
        at com.uber.cadence.internal.sync.ActivityInvocationHandlerBase.invoke(ActivityInvocationHandlerBase.java:76)
        at com.sun.proxy.$Proxy231.updateTaskStatus(Unknown Source)
        at com.somecompany.platform.cadence.workflows.onboarding.HiringWorkFlowImpl.transitionMainTaskToInProgress(HiringWorkFlowImpl.java:328)
        - locked <0x00000006c815ddb8> (a com.somecompany.platform.cadence.workflows.onboarding.HiringWorkFlowImpl)
        at com.somecompany.platform.cadence.workflows.onboarding.HiringWorkFlowImpl.stageInProgress(HiringWorkFlowImpl.java:317)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.uber.cadence.internal.sync.POJOWorkflowImplementationFactory$POJOWorkflowImplementation.processSignal(POJOWorkflowImplementationFactory.java:312)
        at com.uber.cadence.internal.sync.WorkflowRunnable.processSignal(WorkflowRunnable.java:65)
        at com.uber.cadence.internal.sync.SyncWorkflow.lambda$handleSignal$0(SyncWorkflow.java:106)
        at com.uber.cadence.internal.sync.SyncWorkflow$$Lambda$1231/1900852990.run(Unknown Source)
        at com.uber.cadence.internal.sync.CancellationScopeImpl.run(CancellationScopeImpl.java:102)
        at com.uber.cadence.internal.sync.WorkflowThreadImpl$RunnableWrapper.run(WorkflowThreadImpl.java:85)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
mfateev commented 3 years ago

Another user was calling an external service to track workflow state transitions. The call was taking a long time causing workflow task timeout. It took a long time to track the issue down. With the deadlock detector, it would be reported immediately.