segfault when waiting for bulk jobs > 20 mins

Hi,

When I run bulk jobs with one of the jobs lasting more than 20 minutes, the s.synchronize() function waits for a few minutes after the last job is finished and then triggers a segfault:

import drmaa
s = drmaa.Session()
s.initialize()
jt = s.createJobTemplate()
jt.remoteCommand = "eval"
jt.args = [ 'sleep $((12/${SLURM_ARRAY_TASK_ID}*100))' ]
joblist = s.runBulkJobs(jt, 1, 4, 1)
s.synchronize(joblist, drmaa.Session.TIMEOUT_WAIT_FOREVER, True)

Same happens if I loop through the job ids with the s.wait() function:

for jobid in joblist:
   s.wait(jobid, drmaa.Session.TIMEOUT_WAIT_FOREVER)

However it works perfectly fine if jobs finish in the same order as their SLURM_ARRAY_TASK_ID:

import drmaa
s = drmaa.Session()
s.initialize()
jt = s.createJobTemplate()
jt.remoteCommand = "eval"
jt.args = [ 'sleep $((${SLURM_ARRAY_TASK_ID}*300))' ]
joblist = s.runBulkJobs(jt, 1, 4, 1)
s.synchronize(joblist, drmaa.Session.TIMEOUT_WAIT_FOREVER, True)

No problem if jobs last only 10 minutes:

import drmaa
s = drmaa.Session()
s.initialize()
jt = s.createJobTemplate()
jt.remoteCommand = "eval"
jt.args = [ 'sleep $((12/${SLURM_ARRAY_TASK_ID}*10))' ]
joblist = s.runBulkJobs(jt, 1, 4, 1)
s.synchronize(joblist, drmaa.Session.TIMEOUT_WAIT_FOREVER, True)

I came up with this little piece of code to bypass the bug:

import drmaa
s = drmaa.Session()
s.initialize()
jt = s.createJobTemplate()
jt.remoteCommand = "eval"
jt.args = [ 'sleep $((12/${SLURM_ARRAY_TASK_ID}*100))' ]
joblist = s.runBulkJobs(jt, 1, 4, 1)
for jobid in joblist:
   while (s.jobStatus(jobid)=="running"):
      time.sleep(10)
   print "job %s done" % jobid

Yields:

job 135892105_1 done
job 135892105_2 done
job 135892105_3 done
job 135892105_4 done

natefoo / slurm-drmaa

segfault when waiting for bulk jobs > 20 mins #27