natefoo / slurm-drmaa

DRMAA for Slurm: Implementation of the DRMAA C bindings for Slurm
GNU General Public License v3.0
48 stars 22 forks source link

segfault when max number of jobs reached #6

Closed unode closed 6 years ago

unode commented 6 years ago

This still requires confirmation and a proper backtrace but when submitting over 10.000 jobs on a cluster with a 10.000 limit a segfault would be seen.

unode commented 6 years ago

This can be triggered by using the same approach described in #5 and:

ids = s.runBulkJobs(jt, beginIndex=1, endIndex=100000, step=1)

resulting in a backtrace

Program received signal SIGSEGV, Segmentation fault.
drmaa_release_job_ids (values=0x0) at drmaa_base.c:297
297     iter_function(job_id, drmaa_job_ids_t)
(gdb) bt
#0  drmaa_release_job_ids (values=0x0) at drmaa_base.c:297
#1  0x00007fffeffed550 in ffi_call_unix64 () at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/libffi/src/x86/unix64.S:76
#2  0x00007fffeffeccf5 in ffi_call (cif=<optimized out>, fn=0x7fffed771770 <drmaa_release_job_ids>, rvalue=<optimized out>, avalue=0x7fffffffc710)
    at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/libffi/src/x86/ffi64.c:525
#3  0x00007fffeffe483c in _call_function_pointer (argcount=1, resmem=0x7fffffffc730, restype=<optimized out>, atypes=<optimized out>, avalues=0x7fffffffc710, pProc=0x7fffed771770 <drmaa_release_job_ids>, flags=4353)
    at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/callproc.c:809
#4  _ctypes_callproc (pProc=0x7fffed771770 <drmaa_release_job_ids>, argtuple=0x7fffffffc7e0, flags=4353, argtypes=<optimized out>, restype=0x7ffff7d61cd0 <_Py_NoneStruct>, checker=0x0)
    at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/callproc.c:1147
#5  0x00007fffeffdcda3 in PyCFuncPtr_call (self=<optimized out>, inargs=<optimized out>, kwds=0x0) at /home/ilan/minonda/conda-bld/python_1494526091235/work/Python-3.6.1/Modules/_ctypes/_ctypes.c:3870
#6  0x00007ffff793fade in _PyObject_FastCallDict (func=0x7fffeea65818, args=<optimized out>, nargs=<optimized out>, kwargs=0x0) at Objects/abstract.c:2316
#7  0x00007ffff7a1c2bb in call_function (pp_stack=0x7fffffffcb18, oparg=<optimized out>, kwnames=0x0) at Python/ceval.c:4822
#8  0x00007ffff7a1f15d in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3284
#9  0x00007ffff7969e33 in gen_send_ex (gen=0x7fffefd90200, arg=<optimized out>, exc=<optimized out>, closing=<optimized out>) at Objects/genobject.c:189
#10 0x00007ffff7978f16 in listextend (self=0x7fffeea79d48, b=<optimized out>) at Objects/listobject.c:857
#11 0x00007ffff7979398 in list_init (self=0x7fffeea79d48, args=<optimized out>, kw=<optimized out>) at Objects/listobject.c:2316
#12 0x00007ffff79add4c in type_call (type=<optimized out>, args=0x7ffff7e8d470, kwds=0x0) at Objects/typeobject.c:915
#13 0x00007ffff793fade in _PyObject_FastCallDict (func=0x7ffff7d5bb40 <PyList_Type>, args=<optimized out>, nargs=<optimized out>, kwargs=0x0) at Objects/abstract.c:2316
#14 0x00007ffff7a1c2bb in call_function (pp_stack=0x7fffffffce58, oparg=<optimized out>, kwnames=0x0) at Python/ceval.c:4822
#15 0x00007ffff7a1f15d in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3284
#16 0x00007ffff7a1aa60 in _PyEval_EvalCodeWithName (_co=0x7ffff01fc420, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=1, kwnames=0x7ffff7e9dba0, kwargs=0x7ffff7f8fba8, kwcount=3, kwstep=1, 
    defs=0x0, defcount=0, kwdefs=0x0, closure=0x0, name=0x7ffff7ea3c30, qualname=0x7fffefd8d2b8) at Python/ceval.c:4128
#17 0x00007ffff7a1c48a in fast_function (kwnames=<optimized out>, nargs=1, stack=<optimized out>, func=0x7fffeea8c2f0) at Python/ceval.c:4939
#18 call_function (pp_stack=0x7fffffffd0f8, oparg=<optimized out>, kwnames=<optimized out>) at Python/ceval.c:4819
#19 0x00007ffff7a1e8dd in _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:3300
#20 0x00007ffff7a1aa60 in _PyEval_EvalCodeWithName (_co=0x7ffff7f1b930, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=0, kwnames=0x0, kwargs=0x8, kwcount=0, kwstep=2, defs=0x0, defcount=0, 
    kwdefs=0x0, closure=0x0, name=0x0, qualname=0x0) at Python/ceval.c:4128
#21 0x00007ffff7a1aee3 in PyEval_EvalCodeEx (_co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, argcount=<optimized out>, kws=<optimized out>, kwcount=0, defs=0x0, defcount=0, 
    kwdefs=0x0, closure=0x0) at Python/ceval.c:4149
#22 0x00007ffff7a1af2b in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>) at Python/ceval.c:695
#23 0x00007ffff7a4d6c0 in run_mod (arena=0x7ffff7f79180, flags=0x7fffffffd450, locals=0x7ffff7f5cf30, globals=0x7ffff7f5cf30, filename=0x7ffff7ea3830, mod=0x683f58) at Python/pythonrun.c:980
#24 PyRun_FileExFlags (fp=0x64cc30, filename_str=<optimized out>, start=<optimized out>, globals=0x7ffff7f5cf30, locals=0x7ffff7f5cf30, closeit=<optimized out>, flags=0x7fffffffd450) at Python/pythonrun.c:933
#25 0x00007ffff7a4ec83 in PyRun_SimpleFileExFlags (fp=0x64cc30, filename=<optimized out>, closeit=1, flags=0x7fffffffd450) at Python/pythonrun.c:396
#26 0x00007ffff7a6a0b5 in run_file (p_cf=0x7fffffffd450, filename=0x603310 L<error reading variable>, fp=0x64cc30) at Modules/main.c:338
#27 Py_Main (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:810
#28 0x0000000000400c1d in main (argc=2, argv=<optimized out>) at ./Programs/python.c:69

From this backtrace it seems a duplicate of #10

unode commented 6 years ago

This may actually be a bug hiding behind #5 . I guess #5 will have to be fixed first.

unode commented 6 years ago

Now that #5 is fixed, I can confirm that this segfault is still present.

unode commented 6 years ago

This now produces an informative exception and no segfault. Thanks for the fix.