Running Batch Jobs - Githubissues

kaufmann42 commented 8 years ago

Hi I'm trying to run a batch job utilizing the --array slurm option. Wondering if this is possible using drmaa-python. I know there is a runBulkJobs(...), however it seems that this doesn't run an array of jobs. There doesn't seem to be any $SLURM_ARRAY_TASK_ID (or the likes) associated with that run environment.

with drmaa.Session() as s:
                try:
                    # create job template
                    jt = s.createJobTemplate()
                    jt.nativeSpecification='--mem-per-cpu='+self.memory + ' --array=1-3' + ' --time=' + self.time
                    jt.remoteCommand = command
                    print(jt.nativeSpecification)

                    # run job
                    joblist = s.runBulkJobs(jt, 1, 3, 1)

                    # wait for the return value
                    s.synchronize(joblist, self.convertToSeconds(), False)

                    for curjob in joblist:
                        print('Collecting job ' + curjob)
                        retval = s.wait(curjob, drmaa.Session.TIMEOUT_WAIT_FOREVER)
                        print('Job: {0} finished with status {1} and was aborted {2}'.format(retval.jobId, retval.exitStatus, retval.wasAborted))
                        if retval.wasAborted == True:
                            print("Ran out of memeory using: " + self.memory)
                            self.increaseMemory(6000)
                        # if zero exit code then break and job is over
                        elif retval.exitStatus == 0:
                            break
                except drmaa.ExitTimeoutException:
                    print("Ran out of time using: " + self.time)
                    self.increaseTime(6)
                except drmaa.OutOfMemoryException:
                    print("Ran out of memeory using: " + self.memory)
                    self.increaseMemory(6000)

when I try and run this I get a segmentation fault.

OUTPUT

(gdb) run run.py "echo $SLURM_ARRAY_TASK_ID" 01:00:00 100
Starting program: /apps/python/2.7.6/bin/python run.py "echo $SLURM_ARRAY_TASK_ID" 01:00:00 100
[Thread debugging using libthread_db enabled]
--mem-per-cpu=100 -a=1-3 --time=01:00:00

Program received signal SIGSEGV, Segmentation fault.
drmaa_release_job_ids (values=0x0) at drmaa_base.c:297
297 drmaa_base.c: No such file or directory.
    in drmaa_base.c
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.107.el6.x86_64 keyutils-libs-1.4-4.el6.x86_64 krb5-libs-1.10.3-10.el6_4.6.x86_64 libcom_err-1.41.12-18.el6.x86_64 libselinux-2.0.94-5.3.el6.x86_64 openssl-1.0.1e-16.el6_5.14.x86_64
(gdb) backtrace
#0  drmaa_release_job_ids (values=0x0) at drmaa_base.c:297
#1  0x00002aaab15cecbc in ffi_call_unix64 () at /apps/python/Python-2.7.6/Modules/_ctypes/libffi/src/x86/unix64.S:76
#2  0x00002aaab15ce393 in ffi_call (cif=<value optimized out>, fn=0x2aaab301b290 <drmaa_release_job_ids>, rvalue=<value optimized out>, avalue=0x7fffffffb0d0) at /apps/python/Python-2.7.6/Modules/_ctypes/libffi/src/x86/ffi64.c:522
#3  0x00002aaab15c6006 in _call_function_pointer (pProc=0x2aaab301b290 <drmaa_release_job_ids>, argtuple=0x7fffffffb1a0, flags=4353, argtypes=<value optimized out>, restype=0x2aaaaae5ebd0, checker=0x0) at /apps/python/Python-2.7.6/Modules/_ctypes/callproc.c:836
#4  _ctypes_callproc (pProc=0x2aaab301b290 <drmaa_release_job_ids>, argtuple=0x7fffffffb1a0, flags=4353, argtypes=<value optimized out>, restype=0x2aaaaae5ebd0, checker=0x0) at /apps/python/Python-2.7.6/Modules/_ctypes/callproc.c:1183
#5  0x00002aaab15bdcf3 in PyCFuncPtr_call (self=<value optimized out>, inargs=<value optimized out>, kwds=0x0) at /apps/python/Python-2.7.6/Modules/_ctypes/_ctypes.c:3929
#6  0x00002aaaaaaf79b3 in PyObject_Call (func=0x93b530, arg=<value optimized out>, kw=<value optimized out>) at Objects/abstract.c:2529
#7  0x00002aaaaaba6ad9 in do_call (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4239
#8  call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4044
#9  PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#10 0x00002aaaaab1b887 in gen_send_ex (gen=0x971e60, arg=0x0, exc=<value optimized out>) at Objects/genobject.c:84
#11 0x00002aaaaab2f656 in listextend (self=0x981200, b=<value optimized out>) at Objects/listobject.c:872
#12 0x00002aaaaab2fae0 in list_init (self=0x981200, args=<value optimized out>, kw=<value optimized out>) at Objects/listobject.c:2458
#13 0x00002aaaaab5a8a8 in type_call (type=<value optimized out>, args=0x97bf90, kwds=0x0) at Objects/typeobject.c:745
#14 0x00002aaaaaaf79b3 in PyObject_Call (func=0x2aaaaae5b0a0, arg=<value optimized out>, kw=<value optimized out>) at Objects/abstract.c:2529
#15 0x00002aaaaaba6ad9 in do_call (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4239
#16 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4044
#17 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#18 0x00002aaaaaba917e in PyEval_EvalCodeEx (co=0x78f4b0, globals=<value optimized out>, locals=<value optimized out>, args=<value optimized out>, argcount=4, kws=0x9addb8, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:3253
#19 0x00002aaaaaba7332 in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4117
#20 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#21 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#22 0x00002aaaaaba807e in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4107
#23 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#24 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#25 0x00002aaaaaba807e in fast_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4107
#26 call_function (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:4042
#27 PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at Python/ceval.c:2666
#28 0x00002aaaaaba917e in PyEval_EvalCodeEx (co=0x7743b0, globals=<value optimized out>, locals=<value optimized out>, args=<value optimized out>, argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:3253
#29 0x00002aaaaaba9292 in PyEval_EvalCode (co=<value optimized out>, globals=<value optimized out>, locals=<value optimized out>) at Python/ceval.c:667
#30 0x00002aaaaabc8e40 in run_mod (fp=0x82a030, filename=<value optimized out>, start=<value optimized out>, globals=0x67e510, locals=0x67e510, closeit=1, flags=0x7fffffffbe20) at Python/pythonrun.c:1370
#31 PyRun_FileExFlags (fp=0x82a030, filename=<value optimized out>, start=<value optimized out>, globals=0x67e510, locals=0x67e510, closeit=1, flags=0x7fffffffbe20) at Python/pythonrun.c:1356
#32 0x00002aaaaabc901f in PyRun_SimpleFileExFlags (fp=0x82a030, filename=0x7fffffffc4bd "run.py", closeit=1, flags=0x7fffffffbe20) at Python/pythonrun.c:948
#33 0x00002aaaaabdeb34 in Py_Main (argc=<value optimized out>, argv=<value optimized out>) at Modules/main.c:640
#34 0x00000039ad21ecdd in __libc_start_main () from /lib64/libc.so.6
#35 0x0000000000400669 in _start ()

gdb debug backtrace gives the following result

aside: I'm also having trouble with it throwing a OutOfMemoryException.. Therefore am forced to assume it was aborted due to memory (not preferable) so advice on what's happening there would be great.

Thanks!

unode commented 7 years ago

I also see the segfault on SLURM. Possibly related to this.

unode commented 7 years ago

TL;DR: In my case this is no longer a problem with drmaa-python.

Tried to debug the code to figure out which call triggered the segfault. At the moment I'm not sure that what I'm seeing is the same error reported above.

The problem seems to be in the call to drmaa_get_next_job_id. The job is submitted and queued, it reaches the while loops, loops once and a jid is collected. Printing jid returns the same id as the submitted job. Yet as soon as the while condition is evaluated a second time it segfaults.

Afterwards I tried compiling the latest libdrmaa for SLURM and with it I could no longer reproduce the segfault. This is also true even when compiling manually the same libdrmaa.so version provided system-wide.

It seems like this is an upstream problem and only seems to affect some releases when a certain combination of conditions is met.

The code used to debug this issue:

#!/usr/bin/env python
from __future__ import print_function
import os
import drmaa

LOGS = "logs/"
if not os.path.isdir(LOGS):
    os.mkdir(LOGS)

s = drmaa.Session()
s.initialize()
print("Supported contact strings:", s.contact)
print("Supported DRM systems:", s.drmsInfo)
print("Supported DRMAA implementations:", s.drmaaImplementation)
print("Version", s.version)

jt = s.createJobTemplate()
jt.remoteCommand = "/usr/bin/echo"
jt.args = ["Hello", "world"]
jt.jobName = "testdrmaa"
jt.jobEnvironment = os.environ.copy()
jt.workingDirectory = os.getcwd()

jt.outputPath = ":" + os.path.join(LOGS, "job-%A_%a.out")
jt.errorPath = ":" + os.path.join(LOGS, "job-%A_%a.err")

jt.nativeSpecification = "--ntasks=2 --mem-per-cpu=50 --partition=1day"

print("Submitting", jt.remoteCommand, "with", jt.args, "and logs to", jt.outputPath)
ids = s.runBulkJobs(jt, beginIndex=1, endIndex=10, step=1)
print("Job submitted with ids", ids)

s.deleteJobTemplate(jt)

On a malfunctioning system the message "Submitting..." would be shown and immediately after "Segmentation fault" . When the system is working properly you should instead see "Job submitted with...".

unode commented 7 years ago

For anyone that lands on this post after dealing with segmentation fault errors on SLURM, you might want to ask your cluster administrator to use libdrmaa.so from https://github.com/FredHutch/slurm-drmaa.

It's far from perfect and will still segfault if there are options in the nativeSpecification string that it cannot parse. Regardless I've managed to workaround most of the issues mentioned above with this version.

jakirkham commented 6 years ago

Perhaps worth adding a note in the docs and closing as this is a DRMAA implementation issue.

unode commented 6 years ago

Just adding that https://github.com/natefoo/slurm-drmaa/commit/7b5991efc03ab14fdc9e7af67dc91f6085e4d648 solves this issue. I agree with @jakirkham.

pygridtools / drmaa-python

Running Batch Jobs #38