natefoo / slurm-drmaa

DRMAA for Slurm: Implementation of the DRMAA C bindings for Slurm
GNU General Public License v3.0
48 stars 22 forks source link

segfault when waiting for bulk jobs > 20 mins #27

Open reyjul opened 5 years ago

reyjul commented 5 years ago

Hi,

When I run bulk jobs with one of the jobs lasting more than 20 minutes, the s.synchronize() function waits for a few minutes after the last job is finished and then triggers a segfault:

import drmaa
s = drmaa.Session()
s.initialize()
jt = s.createJobTemplate()
jt.remoteCommand = "eval"
jt.args = [ 'sleep $((12/${SLURM_ARRAY_TASK_ID}*100))' ]
joblist = s.runBulkJobs(jt, 1, 4, 1)
s.synchronize(joblist, drmaa.Session.TIMEOUT_WAIT_FOREVER, True)

Same happens if I loop through the job ids with the s.wait() function:

for jobid in joblist:
   s.wait(jobid, drmaa.Session.TIMEOUT_WAIT_FOREVER)

However it works perfectly fine if jobs finish in the same order as their SLURM_ARRAY_TASK_ID:

import drmaa
s = drmaa.Session()
s.initialize()
jt = s.createJobTemplate()
jt.remoteCommand = "eval"
jt.args = [ 'sleep $((${SLURM_ARRAY_TASK_ID}*300))' ]
joblist = s.runBulkJobs(jt, 1, 4, 1)
s.synchronize(joblist, drmaa.Session.TIMEOUT_WAIT_FOREVER, True)

No problem if jobs last only 10 minutes:

import drmaa
s = drmaa.Session()
s.initialize()
jt = s.createJobTemplate()
jt.remoteCommand = "eval"
jt.args = [ 'sleep $((12/${SLURM_ARRAY_TASK_ID}*10))' ]
joblist = s.runBulkJobs(jt, 1, 4, 1)
s.synchronize(joblist, drmaa.Session.TIMEOUT_WAIT_FOREVER, True)

I came up with this little piece of code to bypass the bug:

import drmaa
s = drmaa.Session()
s.initialize()
jt = s.createJobTemplate()
jt.remoteCommand = "eval"
jt.args = [ 'sleep $((12/${SLURM_ARRAY_TASK_ID}*100))' ]
joblist = s.runBulkJobs(jt, 1, 4, 1)
for jobid in joblist:
   while (s.jobStatus(jobid)=="running"):
      time.sleep(10)
   print "job %s done" % jobid

Yields:

job 135892105_1 done
job 135892105_2 done
job 135892105_3 done
job 135892105_4 done
mkher64 commented 4 years ago

This is a bug in job.c - slurmdrmaa_find_job_info()/slurmdrmaa_job_on_missing(). Some time after a job finishes slurm_load_job() returns an error. That is expected slurm behavior. I'm not sure what on_missing() is supposed to do, but the code continues as though it filled job_info, when that's not true. I suspect it should throw error here. Alternately on_missing() could get job information from slurmdb. Unfortunately, I'm not familiar enough with this library or slurm to do it right now.