treasure-data / digdag

Workload Automation System
https://www.digdag.io/
Apache License 2.0
1.3k stars 221 forks source link

Attempt has been stuck in retry loop and could not be killed. #409

Closed bwtakacy closed 7 years ago

bwtakacy commented 7 years ago

Hi,

In our environment, one attempt keeps failing and could not be killed.

I did digdag kill and it suceeded, but attempt still working from yesterday.

$ digdag attempt 2
2016-12-08 12:48:45 +0900: Digdag v0.8.21
  session id: 2
  attempt id: 2
  uuid: 2453f6e0-791d-4cd8-9680-9a085e8c5e8f
  project: test
  workflow: digdag
  session time: 2016-12-07 06:30:43 +0000
  retry attempt name: 
  params: {"repository_path":"/opt/digdag/plugins/digdag-slack/build/repo"}
  created at: 2016-12-07 15:30:44 +0900
  finished at: 
  kill requested: true
  status: running

Looking at digdag server log, the below messages are repeatedly shown.

2016-12-08 12:39:08.313 +0900 [WARN] (lock-expire-0): 1 task locks are expired. Tasks will be retried.
2016-12-08 12:39:08.794 +0900 [INFO] (0196@+digdag+repeat^error): slack>: error_message.txt
2016-12-08 12:39:08.796 +0900 [ERROR] (task-thread-1): Uncaught exception. Task queue will detect this failure and this task will be retried later.
java.lang.AbstractMethodError: jp.techium.blog.SlackOperatorFactory.newOperator(Ljava/nio/file/Path;Lio/digdag/spi/TaskRequest;)Lio/digdag/spi/Operator;
        at io.digdag.core.agent.OperatorManager.callExecutor(OperatorManager.java:290)
        at io.digdag.core.agent.OperatorManager.runWithWorkspace(OperatorManager.java:258)
        at io.digdag.core.agent.OperatorManager.lambda$runWithHeartbeat$2(OperatorManager.java:141)
        at io.digdag.core.agent.ExtractArchiveWorkspaceManager.withExtractedArchive(ExtractArchiveWorkspaceManager.java:53)
        at io.digdag.core.agent.OperatorManager.runWithHeartbeat(OperatorManager.java:139)
        at io.digdag.core.agent.OperatorManager.run(OperatorManager.java:123)
        at io.digdag.core.agent.MultiThreadAgent.lambda$run$0(MultiThreadAgent.java:95)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

From the stacktrace, I have found that the reason of failure is caused by plugin implementation which does not match to digdag server version and finally i have fixed it. But, I could not find how to stop this retry loop.

The digdag server restart does not solve it 😢

frsyuki commented 7 years ago

A good point. AbstractMethodError is not an Exception and OperatorManager is not handling it. But it should handle the error as a possible exception.

bwtakacy commented 7 years ago

Hi,

I have resolved finally by deleting and updating digdag server database record. Thanks.