treasure-data / digdag

Workload Automation System
https://www.digdag.io/
Apache License 2.0
1.3k stars 221 forks source link

A task using a simple shell script failed when digdag was updated to v0.9.12 #568

Closed takkeybook closed 7 years ago

takkeybook commented 7 years ago

The incident as I wrote in the subject suddenly happens after updating digdag to the latest one using "digdag selfupdate" command. When I use digdag of v0.9.7, it does not happen. There was not error logs, thus I'm not really sure what's going on.

A sample of both workflow file and shell script kicked from it is as follows,

sample.dig

timezone: Asia/Tokyo

schedule:
  cron>: 02 13 * * *

+prepare:
  echo>: start ${session_time}

+step1:
  sh>: scripts/step1.sh

scripts/step1.sh

#!/bin/bash

echo "job 1 start"
date
whoami
sleep 10
echo "job 1 end"

exit 0
cosmok commented 7 years ago

I could reproduce it when the script wasn't executable, the error I got is:

/bin/sh: 1: scripts/step1.sh: Permission denied
2017-05-26 13:21:13 +1000 [ERROR] (0017@[0:default]+sample+step1): Task failed with unexpected error: Command failed with code 126

but, when I made the shell script executable ( chmod a+x scripts/step1.sh ), it ran:

2017-05-26 13:21:51 +1000 [INFO] (0017@[0:default]+sample+step1): sh>: scripts/step1.sh
job 1 start
Friday 26 May  13:21:52 AEST 2017
k7
job 1 end
Success. Task state is saved at /tmp/dig_test/.digdag/status/20170526T000000+0900 directory.
takkeybook commented 7 years ago

In my case, the script is executable. In fact, the corresponding task succeeded when it ran on digdag server of v0.9.7.

takkeybook commented 7 years ago

The problem is not only on digdag v0.9.12, but on v0.9.8 or later.

What I'd like to add is that such the problem does not happen when I run the workflow on the command line. That is, digdag run sample.dig --session "YYYY-MM-DD"

This simple task failure occurs only when you run on the web-ui or the task is kicked by the scheduler.

takkeybook commented 7 years ago

I found out the corresponding error messages, as follows,

Jun  1 12:04:00 ip-10-130-66-122 java: 2017-06-01 12:04:00 +0900 [INFO] (0067@[0:sample]+sample+step1) io.digdag.core.agent.OperatorManager: sh>: scripts/step1.sh
Jun  1 12:04:00 ip-10-130-66-122 java: /bin/sh: line 1: scripts/step1.sh: No such file or directory
Jun  1 12:04:00 ip-10-130-66-122 java: 2017-06-01 12:04:00 +0900 [ERROR] (0067@[0:sample]+sample+step1) io.digdag.core.agent.OperatorManager: Task failed with unexpected error: Command failed with code 127
Jun  1 12:04:00 ip-10-130-66-122 java: java.lang.RuntimeException: Command failed with code 127
Jun  1 12:04:00 ip-10-130-66-122 java: at io.digdag.standards.operator.ShOperatorFactory$ShOperator.runTask(ShOperatorFactory.java:143)
Jun  1 12:04:00 ip-10-130-66-122 java: at io.digdag.util.BaseOperator.run(BaseOperator.java:35)
Jun  1 12:04:00 ip-10-130-66-122 java: at io.digdag.core.agent.OperatorManager.callExecutor(OperatorManager.java:312)
Jun  1 12:04:00 ip-10-130-66-122 java: at io.digdag.core.agent.OperatorManager.runWithWorkspace(OperatorManager.java:254)
Jun  1 12:04:00 ip-10-130-66-122 java: at io.digdag.core.agent.OperatorManager.lambda$runWithHeartbeat$2(OperatorManager.java:137)
Jun  1 12:04:00 ip-10-130-66-122 java: at io.digdag.core.agent.LocalWorkspaceManager.withExtractedArchive(LocalWorkspaceManager.java:25)
Jun  1 12:04:00 ip-10-130-66-122 java: at io.digdag.core.agent.OperatorManager.runWithHeartbeat(OperatorManager.java:135)
Jun  1 12:04:00 ip-10-130-66-122 java: at io.digdag.core.agent.OperatorManager.run(OperatorManager.java:119)
Jun  1 12:04:00 ip-10-130-66-122 java: at io.digdag.core.agent.MultiThreadAgent.lambda$null$0(MultiThreadAgent.java:127)
Jun  1 12:04:00 ip-10-130-66-122 java: at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
Jun  1 12:04:00 ip-10-130-66-122 java: at java.util.concurrent.FutureTask.run(FutureTask.java:266)
Jun  1 12:04:00 ip-10-130-66-122 java: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
Jun  1 12:04:00 ip-10-130-66-122 java: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
Jun  1 12:04:00 ip-10-130-66-122 java: at java.lang.Thread.run(Thread.java:748)

It seems to me that external shell script files packed together with the workflow file may not be retrieved from the revision_archives table, or not be put on the supposed directory.

komamitsu commented 7 years ago

Hmm... I can't reproduce this issue.

@takkeybook Could you give us more detailed information?

takkeybook commented 7 years ago

@komamitsu Here is detailed infos.

Like as the above, I cannot encounter the trouble in the case that I use Digdag cli.

komamitsu commented 7 years ago

@takkeybook Thanks for the information. But the issue isn't still reproduced on my side...

Could you try this command and tell me what the archived project includes?

$ digdag download (the project name)
takkeybook commented 7 years ago

@komamitsu I really appreciate your help.

I try the command as you said, and get the following results. Necessary files to execute the tasks are retrieved.

cd WORKDIR
digdag download sample
cd sample
find . -ls
92563817    0 drwxr-xr-x   3 kagimoto_takashi drecom         37  6月  2 15:35 .
101807045    0 drwxr-xr-x   2 kagimoto_takashi drecom         36  6月  2 15:35 ./scripts
101807046    4 -rwxr-xr-x   1 kagimoto_takashi drecom         78  6月  2 15:35 ./scripts/step1.sh
101807047    4 -rwxr-xr-x   1 kagimoto_takashi drecom         78  6月  2 15:35 ./scripts/step2.sh
92563819    4 -rw-r--r--   1 kagimoto_takashi drecom        129  6月  2 15:35 ./sample.dig

Hmmm....What's going on in my environment..

frsyuki commented 7 years ago

@takkeybook why don't you try this workflow to debug?:

+pwd:
  sh>: ""
  shell: ["pwd"]

+ls:
  sh>: "scripts"
  shell: ["ls", "-l"]
takkeybook commented 7 years ago

@frsyuki After creating a workflow file as you mentioned, I got slightly wierd results as follows by executing it with Digdag CLI (not on my server),

Task state is saved at /home/kagimoto_takashi/digdag/debugShell/.digdag/status/20170602T000000+0900 directory.


I encounter the latter one mostly, but the former once in approximately ten times.
takkeybook commented 7 years ago

@komamitsu @frsyuki According to Furuhashi-san's yesterday talk, I possibly figure out what's going on not in your environment, but in mine. To tell the truth, I ran digdag server and digdag scheduler simultaneously. I really misunderstood a function of digdag scheduler. Now I stop the scheduler and then I successfuly execute all tasks described by bash script that I wrote in the above as a sample.

If what I found out is correct, I will close this issue.