Closed marlof closed 4 years ago
190729-080137 [obrar:0] INFO SUCCESS: CREATECACHE completed taking 36 seconds
190729-080137 [obrar:0] INFO =====================================================================
190729-080137 Completed task [1/1]
===================
mv: cannot stat '/opt/scorch/jobs/running/Job_ID-107_107.AWS-CACHE.1_AWS-CACHE_PROD_': No such file or directory
AUDIT:FINISH:1564383698
190729-080138 Tasks[ 1] Time[00h 00m 38s] Failures[1]
touch: setting times of '/opt/scorch/jobs/completed/Job_ID-107_107.AWS-CACHE.1_AWS-CACHE_PROD_': No such file or directory
190729-083001 loftusm ran transition accepted ownership of the job
When tracing the job, the job never gets into the running state and moved directly from starting to failed after completing the tasks.
After checking deeper it appears that the job is created twice but one of the jobs succeeds and the other fails
190729-090125 Created by:loftusm Tasks[1]
AUDIT:PID:31711
AUDIT:PID:31710
mv: cannot stat '/opt/scorch/jobs/starting/Job_ID-109_109.AWS-CACHE.1_AWS-CACHE_NONPROD_': No such file or directory
===================
===================
190729-090226 Starting task [0/1]
190729-090226 Starting task [0/1]
echo AUDIT:START:${str_StartTime} >> "${file_Log}" 2>&1;
AUDIT:START:1564387346
echo AUDIT:START:${str_StartTime} >> "${file_Log}" 2>&1;
AUDIT:START:1564387346
190729-090226 Completed task [0/1]
190729-090226 Completed task [0/1]
===================
===================
190729-090227 Starting task [1/1]
190729-090227 Starting task [1/1]
obrar CREATECACHE -e NONPROD -f Scorch-31194 >> "${file_Log}" 2>&1;
obrar CREATECACHE -e NONPROD -f Scorch-31194 >> "${file_Log}" 2>&1;
190729-090227 [obrar:1] ERROR =====================================================================
190729-090227 [obrar:1] ERROR Could not get exclusive lock. Try again.
If the issue persists remove the lock file /opt/scorch/var/JobID.obrar.lock
190729-090227 [obrar:0] INFO Loading env file /opt/scorch/projects/common/etc/obrar.env
190729-090227 [obrar:1] ERROR After 0 seconds.
A ps shows
loftusm 16060 1 0 09:34 pts/8 00:00:00 /bin/bash /opt/scorch/scorch -background -j /opt/scorch/jobs
loftusm 16061 1 0 09:34 pts/8 00:00:00 /bin/bash /opt/scorch/scorch -a MESSAGE -o SLEEP:240 -s
loftusm 16065 16061 0 09:34 pts/8 00:00:00 /bin/bash /opt/scorch/scorch -background
After a timed auto start with 2 similar jobs, the tasks completes but the mv "running" to "completed" and then the touch "completed" fail, stopping the completion of the job.