marlof / ScORCH

DevOps Orchestration for Obrar deploy and Ansible playbooks
http://www.autoscorch.com
Apache License 2.0
5 stars 1 forks source link

Auto start success, job fails to move #117

Closed marlof closed 4 years ago

marlof commented 5 years ago

After a timed auto start with 2 similar jobs, the tasks completes but the mv "running" to "completed" and then the touch "completed" fail, stopping the completion of the job.

marlof commented 5 years ago
190729-080137       [obrar:0] INFO  SUCCESS: CREATECACHE completed taking 36 seconds
190729-080137       [obrar:0] INFO  =====================================================================
190729-080137 Completed task [1/1]
===================
mv: cannot stat '/opt/scorch/jobs/running/Job_ID-107_107.AWS-CACHE.1_AWS-CACHE_PROD_': No such file or directory
AUDIT:FINISH:1564383698
190729-080138 Tasks[ 1] Time[00h 00m 38s] Failures[1]
touch: setting times of '/opt/scorch/jobs/completed/Job_ID-107_107.AWS-CACHE.1_AWS-CACHE_PROD_': No such file or directory
190729-083001 loftusm ran transition accepted ownership of the job
marlof commented 5 years ago

When tracing the job, the job never gets into the running state and moved directly from starting to failed after completing the tasks.

marlof commented 5 years ago

After checking deeper it appears that the job is created twice but one of the jobs succeeds and the other fails

190729-090125 Created by:loftusm Tasks[1]
AUDIT:PID:31711
AUDIT:PID:31710
mv: cannot stat '/opt/scorch/jobs/starting/Job_ID-109_109.AWS-CACHE.1_AWS-CACHE_NONPROD_': No such file or directory
===================
===================
190729-090226 Starting task [0/1]
190729-090226 Starting task [0/1]
    echo AUDIT:START:${str_StartTime} >> "${file_Log}" 2>&1;
AUDIT:START:1564387346
    echo AUDIT:START:${str_StartTime} >> "${file_Log}" 2>&1;
AUDIT:START:1564387346
190729-090226 Completed task [0/1]
190729-090226 Completed task [0/1]
===================
===================
190729-090227 Starting task [1/1]
190729-090227 Starting task [1/1]
    obrar CREATECACHE -e NONPROD -f Scorch-31194 >> "${file_Log}" 2>&1;
    obrar CREATECACHE -e NONPROD -f Scorch-31194 >> "${file_Log}" 2>&1;
190729-090227       [obrar:1] ERROR =====================================================================
190729-090227       [obrar:1] ERROR Could not get exclusive lock. Try again.
 If the issue persists remove the lock file /opt/scorch/var/JobID.obrar.lock

190729-090227       [obrar:0] INFO  Loading env file /opt/scorch/projects/common/etc/obrar.env
190729-090227       [obrar:1] ERROR After 0 seconds.
marlof commented 5 years ago

A ps shows

loftusm  16060     1  0 09:34 pts/8    00:00:00 /bin/bash /opt/scorch/scorch -background -j /opt/scorch/jobs
loftusm  16061     1  0 09:34 pts/8    00:00:00 /bin/bash /opt/scorch/scorch -a MESSAGE -o SLEEP:240 -s
loftusm  16065 16061  0 09:34 pts/8    00:00:00 /bin/bash /opt/scorch/scorch -background