treasure-data / digdag

Workload Automation System
https://www.digdag.io/
Apache License 2.0
1.31k stars 224 forks source link

When we retry a workflow, its handling of the `_error` changes depending on resume option #1431

Open to-lz1 opened 4 years ago

to-lz1 commented 4 years ago

Digdag version: 0.9.42

We have the _error parameter and main tasks in our workflow.

example:

timezone: Asia/Tokyo

_error:
  echo>: here is _error configuration.

+test:
  +task1:
    sh>: sleep 2; exit 0
  +task2:
    sh>: sleep 2; exit 1

In this example, we have an error in the task2. And please assume we get some notification from the _error and usually retry it. However, in retrying the attempt, whether _error is called again is different depending on our option.

If we use --resume

digdag retry [id] --resume --latest-revision  

The workflow failed again, and _error parameter is ignored.

2020-07-04 12:36:30 +0900 [INFO] (XNIO-1 task-50): Starting a new session project id=1 workflow name=sample_1 session_time=2020-07-04T12:21:42+09:00
2020-07-04 12:36:30 +0900 [INFO] (0092@[0:sample]+sample_1+test+task2): sh>: sleep 2; exit 1
2020-07-04 12:36:32 +0900 [ERROR] (0092@[0:sample]+sample_1+test+task2): Task failed with unexpected error: Command failed with code 1
java.lang.RuntimeException: Command failed with code 1
    at io.digdag.standards.operator.ShOperatorFactory$ShOperator.runTask(ShOperatorFactory.java:143)
    at io.digdag.util.BaseOperator.run(BaseOperator.java:35)
    at io.digdag.core.agent.OperatorManager.callExecutor(OperatorManager.java:315)
    at io.digdag.core.agent.OperatorManager.runWithWorkspace(OperatorManager.java:257)
    at io.digdag.core.agent.OperatorManager.lambda$runWithHeartbeat$2(OperatorManager.java:137)
    at io.digdag.core.agent.OperatorManager$$Lambda$253/0000000000000000.run(Unknown Source)
    at io.digdag.core.agent.ExtractArchiveWorkspaceManager.withExtractedArchive(ExtractArchiveWorkspaceManager.java:77)
    at io.digdag.core.agent.OperatorManager.runWithHeartbeat(OperatorManager.java:135)
    at io.digdag.core.agent.OperatorManager.run(OperatorManager.java:119)
    at io.digdag.core.agent.MultiThreadAgent.lambda$null$0(MultiThreadAgent.java:127)
    at io.digdag.core.agent.MultiThreadAgent$$Lambda$249/0000000000000000.run(Unknown Source)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:821)

If we use --resume-from

digdag retry [id] --resume-from task2 --latest-revision  

The workflow failed again, and operation in _error parameter is called.

2020-07-04 12:40:24 +0900 [INFO] (XNIO-1 task-61): Starting a new session project id=1 workflow name=sample_1 session_time=2020-07-04T12:39:59+09:00
2020-07-04 12:40:24 +0900 [INFO] (0112@[0:sample]+sample_1+test+task2): sh>: sleep 2; exit 1
2020-07-04 12:40:26 +0900 [ERROR] (0112@[0:sample]+sample_1+test+task2): Task failed with unexpected error: Command failed with code 1
java.lang.RuntimeException: Command failed with code 1
    at io.digdag.standards.operator.ShOperatorFactory$ShOperator.runTask(ShOperatorFactory.java:143)
    at io.digdag.util.BaseOperator.run(BaseOperator.java:35)
    at io.digdag.core.agent.OperatorManager.callExecutor(OperatorManager.java:315)
    at io.digdag.core.agent.OperatorManager.runWithWorkspace(OperatorManager.java:257)
    at io.digdag.core.agent.OperatorManager.lambda$runWithHeartbeat$2(OperatorManager.java:137)
    at io.digdag.core.agent.OperatorManager$$Lambda$253/0000000000000000.run(Unknown Source)
    at io.digdag.core.agent.ExtractArchiveWorkspaceManager.withExtractedArchive(ExtractArchiveWorkspaceManager.java:77)
    at io.digdag.core.agent.OperatorManager.runWithHeartbeat(OperatorManager.java:135)
    at io.digdag.core.agent.OperatorManager.run(OperatorManager.java:119)
    at io.digdag.core.agent.MultiThreadAgent.lambda$null$0(MultiThreadAgent.java:127)
    at io.digdag.core.agent.MultiThreadAgent$$Lambda$249/0000000000000000.run(Unknown Source)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:821)
2020-07-04 12:40:26 +0900 [INFO] (0112@[0:sample]+sample_1^failure-alert): type: notify
2020-07-04 12:40:26 +0900 [INFO] (0110@[0:sample]+sample_1^error): echo>: here is _error configuration.
here is _error configuration.

Is this expected behavior? And should we always use --resume-from if we want to receive some notification from the _error parameter in retrying our workflows?

hiroyuki-sato commented 4 years ago

Hello, @to-lz1 Thank you for your report.

I think #825 relates to this issue.

As far as I know, there are some issues around retry behavior. (ex: #1220) The core team is trying to fix these issues.

It's a little complicated to fix them.

@szyn, @yoyama Could you comment when you get a chance?