uber / cadence-java-client

Java framework for Cadence Workflow Service
https://cadenceworkflow.io
Other
143 stars 106 forks source link

Sticky Decisions failing with IllegalStateException #422

Closed samarabbas closed 4 years ago

samarabbas commented 4 years ago

User reported an issue on slack channel where worker logs show the following error:

2019-12-05 08:41:20.384 ERROR 13176 --- [test-domain": 2] c.u.c.i.r.ReplayDecisionTaskHandler      : Workflow task failure. startedEventId=18, WorkflowID=d1cb88eb-f4e2-487b-b292-089dd1fc3006, RunID=873ae132-4c6a-4c2b-b8d0-20dba2864fa8. If see continuously the workflow might be stuck.
java.lang.IllegalStateException: id=DecisionId{decisionTarget=ACTIVITY, decisionEventId=5}, transitions=[CREATED, handleDecisionTaskStartedEvent, DECISION_SENT, handleInitiatedEvent, INITIATED, handleInitiatedEvent]
    at com.uber.cadence.internal.replay.DecisionStateMachineBase.failStateTransition(DecisionStateMachineBase.java:204)
    at com.uber.cadence.internal.replay.DecisionStateMachineBase.handleInitiatedEvent(DecisionStateMachineBase.java:109)
    at com.uber.cadence.internal.replay.DecisionsHelper.handleActivityTaskScheduled(DecisionsHelper.java:147)
    at com.uber.cadence.internal.replay.ReplayDecider.processEvent(ReplayDecider.java:185)
    at com.uber.cadence.internal.replay.ReplayDecider.decideImpl(ReplayDecider.java:423)
    at com.uber.cadence.internal.replay.ReplayDecider.decide(ReplayDecider.java:359)
    at com.uber.cadence.internal.replay.ReplayDecisionTaskHandler.processDecision(ReplayDecisionTaskHandler.java:135)
    at com.uber.cadence.internal.replay.ReplayDecisionTaskHandler.handleDecisionTaskImpl(ReplayDecisionTaskHandler.java:115)
    at com.uber.cadence.internal.replay.ReplayDecisionTaskHandler.handleDecisionTask(ReplayDecisionTaskHandler.java:76)
    at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:258)
    at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:230)
    at com.uber.cadence.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:71)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Worker execution history shows 2 decision task failures and both of them are sticky decisions when there are more events between DecisionTaskScheduled and DecisionTaskStarted. My guess is this pattern is tripping up java client implementation for handling of sticky decisions. Both the times workflow execution recovered immediately once the decision was scheduled back on global task list.

{
        "eventId": 15,
        "timestamp": 1575553280337708700,
        "eventType": "DecisionTaskScheduled",
        "version": -24,
        "taskId": 6291489,
        "decisionTaskScheduledEventAttributes": {
            "taskList": {
                "name": "Scotts-iMac.local:5792bec8-cc01-4210-9389-a8658aa598a9"
            },
            "startToCloseTimeoutSeconds": 10,
            "attempt": 0
        }
    },
    {
        "eventId": 16,
        "timestamp": 1575553279319512900,
        "eventType": "ActivityTaskStarted",
        "version": -24,
        "taskId": 6291494,
        "activityTaskStartedEventAttributes": {
            "scheduledEventId": 11,
            "identity": "13183@Scotts-iMac.local",
            "requestId": "6c4c2af2-fb84-405c-9e9e-b3ce9a4f186d",
            "attempt": 0
        }
    },
    {
        "eventId": 17,
        "timestamp": 1575553280355000100,
        "eventType": "ActivityTaskCompleted",
        "version": -24,
        "taskId": 6291495,
        "activityTaskCompletedEventAttributes": {
            "result": "eyJ1c2VybmFtZSI6ImRzZiIsIm5hbWUiOiJkc2YgU21pdGgiLCJiYW5rQWNjb3VudCI6bnVsbH0=",
            "scheduledEventId": 11,
            "startedEventId": 16,
            "identity": "13183@Scotts-iMac.local"
        }
    },
    {
        "eventId": 18,
        "timestamp": 1575553280381849000,
        "eventType": "DecisionTaskStarted",
        "version": -24,
        "taskId": 6291498,
        "decisionTaskStartedEventAttributes": {
            "scheduledEventId": 15,
            "identity": "5792bec8-cc01-4210-9389-a8658aa598a9",
            "requestId": "e3410a75-2bee-4804-bbb0-ddd1ad5f2d3f"
        }
    },
    {
        "eventId": 19,
        "timestamp": 1575553280426034900,
        "eventType": "DecisionTaskFailed",
        "version": -24,
        "taskId": 6291501,
        "decisionTaskFailedEventAttributes": {
            "scheduledEventId": 15,
            "startedEventId": 18,
            "cause": "UNHANDLED_DECISION",
            "details": "amF2YS5sYW5nLklsbGVnYWxTdGF0ZUV4Y2VwdGlvbjogaWQ9RGVjaXNpb25JZHtkZWNpc2lvblRhcmdldD1BQ1RJVklUWSwgZGVjaXNpb25FdmVudElkPTV9LCB0cmFuc2l0aW9ucz1bQ1JFQVRFRCwgaGFuZGxlRGVjaXNpb25UYXNrU3RhcnRlZEV2ZW50LCBERUNJU0lPTl9TRU5ULCBoYW5kbGVJbml0aWF0ZWRFdmVudCwgSU5JVElBVEVELCBoYW5kbGVJbml0aWF0ZWRFdmVudF0KCWF0IGNvbS51YmVyLmNhZGVuY2UuaW50ZXJuYWwucmVwbGF5LkRlY2lzaW9uU3RhdGVNYWNoaW5lQmFzZS5mYWlsU3RhdGVUcmFuc2l0aW9uKERlY2lzaW9uU3RhdGVNYWNoaW5lQmFzZS5qYXZhOjIwNCkKCWF0IGNvbS51YmVyLmNhZGVuY2UuaW50ZXJuYWwucmVwbGF5LkRlY2lzaW9uU3RhdGVNYWNoaW5lQmFzZS5oYW5kbGVJbml0aWF0ZWRFdmVudChEZWNpc2lvblN0YXRlTWFjaGluZUJhc2UuamF2YToxMDkpCglhdCBjb20udWJlci5jYWRlbmNlLmludGVybmFsLnJlcGxheS5EZWNpc2lvbnNIZWxwZXIuaGFuZGxlQWN0aXZpdHlUYXNrU2NoZWR1bGVkKERlY2lzaW9uc0hlbHBlci5qYXZhOjE0NykKCWF0IGNvbS51YmVyLmNhZGVuY2UuaW50ZXJuYWwucmVwbGF5LlJlcGxheURlY2lkZXIucHJvY2Vzc0V2ZW50KFJlcGxheURlY2lkZXIuamF2YToxODUpCglhdCBjb20udWJlci5jYWRlbmNlLmludGVybmFsLnJlcGxheS5SZXBsYXlEZWNpZGVyLmRlY2lkZUltcGwoUmVwbGF5RGVjaWRlci5qYXZhOjQyMykKCWF0IGNvbS51YmVyLmNhZGVuY2UuaW50ZXJuYWwucmVwbGF5LlJlcGxheURlY2lkZXIuZGVjaWRlKFJlcGxheURlY2lkZXIuamF2YTozNTkpCglhdCBjb20udWJlci5jYWRlbmNlLmludGVybmFsLnJlcGxheS5SZXBsYXlEZWNpc2lvblRhc2tIYW5kbGVyLnByb2Nlc3NEZWNpc2lvbihSZXBsYXlEZWNpc2lvblRhc2tIYW5kbGVyLmphdmE6MTM1KQoJYXQgY29tLnViZXIuY2FkZW5jZS5pbnRlcm5hbC5yZXBsYXkuUmVwbGF5RGVjaXNpb25UYXNrSGFuZGxlci5oYW5kbGVEZWNpc2lvblRhc2tJbXBsKFJlcGxheURlY2lzaW9uVGFza0hhbmRsZXIuamF2YToxMTUpCglhdCBjb20udWJlci5jYWRlbmNlLmludGVybmFsLnJlcGxheS5SZXBsYXlEZWNpc2lvblRhc2tIYW5kbGVyLmhhbmRsZURlY2lzaW9uVGFzayhSZXBsYXlEZWNpc2lvblRhc2tIYW5kbGVyLmphdmE6NzYpCglhdCBjb20udWJlci5jYWRlbmNlLmludGVybmFsLndvcmtlci5Xb3JrZmxvd1dvcmtlciRUYXNrSGFuZGxlckltcGwuaGFuZGxlKFdvcmtmbG93V29ya2VyLmphdmE6MjU4KQoJYXQgY29tLnViZXIuY2FkZW5jZS5pbnRlcm5hbC53b3JrZXIuV29ya2Zsb3dXb3JrZXIkVGFza0hhbmRsZXJJbXBsLmhhbmRsZShXb3JrZmxvd1dvcmtlci5qYXZhOjIzMCkKCWF0IGNvbS51YmVyLmNhZGVuY2UuaW50ZXJuYWwud29ya2VyLlBvbGxUYXNrRXhlY3V0b3IubGFtYmRhJHByb2Nlc3MkMChQb2xsVGFza0V4ZWN1dG9yLmphdmE6NzEpCglhdCBqYXZhLnV0aWwuY29uY3VycmVudC5UaHJlYWRQb29sRXhlY3V0b3IucnVuV29ya2VyKFRocmVhZFBvb2xFeGVjdXRvci5qYXZhOjExNDIpCglhdCBqYXZhLnV0aWwuY29uY3VycmVudC5UaHJlYWRQb29sRXhlY3V0b3IkV29ya2VyLnJ1bihUaHJlYWRQb29sRXhlY3V0b3IuamF2YTo2MTcpCglhdCBqYXZhLmxhbmcuVGhyZWFkLnJ1bihUaHJlYWQuamF2YTo3NDUpCg==",
            "identity": "13176@Scotts-iMac.local",
            "reason": "",
            "baseRunId": "",
            "newRunId": "",
            "forkEventVersion": 0
        }
    },
meiliang86 commented 4 years ago

It's because of https://github.com/uber/cadence/issues/2895. Closing this client side bug and will track the server issue instead.