natefoo / slurm-drmaa

DRMAA for Slurm: Implementation of the DRMAA C bindings for Slurm
GNU General Public License v3.0
49 stars 23 forks source link

recipe for installing SLURM and friends on Debian 11 #70

Open judith-ipac opened 2 years ago

judith-ipac commented 2 years ago

Hello and apologies if this question is in the wrong place. We are upgrading from Debian 8 to Debian 11. I am a developer with no particular background in system administration or configuration. Several weeks into a cycle of install/google-error-message/install-something-else, I have installed munge, slurm, slurm-drmaa, and bats(!). slurmctld and slurmd are now running, but calls to drmaa_run_job() result in seg faults. (The surrounding C++ code is copied from our Debian 8 host, where drmaa_run_job() runs successfully.) I'll print some debug output below, but what I'm really looking for is start-to-finish step-by-step instructions for configuring, installing, and running whatever it takes to make SLURM usable on Debian 11. Thanks in advance.

Last few steps of debug output from drmaa_run_job:

d #597f9 [ 40.42] finalizing job constraints d #597f9 [ 40.42] set min_cpus to ntasks: 1 t #597f9 [ 40.42] <- slurmdrmaa_parse_native ORA-24550: signal received: [si_signo=11] [si_errno=0] [si_code=1] [si_int=0] [si_ptr=(nil)] [si_addr=0x1656] kpedbg_dmp_stack()+394<-kpeDbgCrash()+204<-kpeDbgSignalHandler()+113<-skgesig_sigactionHandler()+258<-sighandler()<-0x00007F06CFEC9B71<-slurm_pack_selected_step()+1286<-slurm_send_node_msg()+505<-slurm_send_recv_msg()+66<-slurm_send_recv_controller_msg()+315<-slurm_submit_batch_job()+119<-slurmdrmaa_session_run_bulk()+518<-slurmdrmaa_session_run_job()+179<-drmaa_run_job()+374<-_ZN19custom_code::submit_jobERKN5boost10filesystem4pathES4_RKNSt7cxx1112basic_stringIcSt11char_traitsIcESaIcEEESC_bb()+4407<-0x0000000000000009<-0x7453705F6D00626F

runscript.sh: line 62: 366577 Segmentation fault

Stack trace from gdb:

           Stack trace of thread 366585:
            #0  0x00007f06d1914fe1 raise (libpthread.so.0 + 0x13fe1)
            #1  0x00007f06c254893f skgesigOSCrash (libclntsh.so + 0x267293f)
            #2  0x00007f06c2c63cdd kpeDbgSignalHandler (libclntsh.so + 0x2d8dcdd)
            #3  0x00007f06c2548c12 skgesig_sigactionHandler (libclntsh.so + 0x2672c12)
            #4  0x00007f06d1915140 __restore_rt (libpthread.so.0 + 0x14140)
            #5  0x00007f06cfec9b71 __strlen_avx2 (libc.so.6 + 0x15fb71)
            #6  0x00007f06d0467cb3 n/a (libslurm.so.36 + 0xf8cb3)
            #7  0x00007f06d047c646 n/a (libslurm.so.36 + 0x10d646)
            #8  0x00007f06d0456cf9 slurm_send_node_msg (libslurm.so.36 + 0xe7cf9)
            #9  0x00007f06d0457f72 slurm_send_recv_msg (libslurm.so.36 + 0xe8f72)
            #10 0x00007f06d04580db slurm_send_recv_controller_msg (libslurm.so.36 + 0xe90db)
            #11 0x00007f06d03b76e7 slurm_submit_batch_job (libslurm.so.36 + 0x486e7)
            #12 0x00007f06d05414f1 slurmdrmaa_session_run_bulk (libdrmaa.so.1 + 0xb4f1)
            #13 0x00007f06d054123b slurmdrmaa_session_run_job (libdrmaa.so.1 + 0xb23b)
            #14 0x00007f06d055c133 drmaa_run_job (libdrmaa.so.1 + 0x26133)
            #15 0x000056442ad0bf37 n/a (XXX + 0xd1f37)
            #16 0x0000000000000009 n/a (n/a + 0x0)

Any advice would be greatly appreciated.

judith-ipac commented 2 years ago

FWIW, when I run "make check" in the slurm-drmaa-1.1.3 repo, it stalls after the first test suite:

============================================================================ Testsuite summary for FedStage DRMAA utilities library 2.0.1

TOTAL: 1

PASS: 1

SKIP: 0

XFAIL: 0

FAIL: 0

XPASS: 0

ERROR: 0

============================================================================

make[4]: Leaving directory 'ROOTDIR/slurm-drmaa-1.1.3/drmaa_utils/test' make[3]: Leaving directory 'ROOTDIR/slurm-drmaa-1.1.3/drmaa_utils/test' make[2]: Leaving directory 'ROOTDIR/slurm-drmaa-1.1.3/drmaa_utils/test' make[2]: Entering directory 'ROOTDIR/slurm-drmaa-1.1.3/drmaa_utils' make[2]: Leaving directory 'ROOTDIR/slurm-drmaa-1.1.3/drmaa_utils' make[1]: Leaving directory 'ROOTDIR/slurm-drmaa-1.1.3/drmaa_utils' Making check in slurm_drmaa make[1]: Entering directory 'ROOTDIR/slurm-drmaa-1.1.3/slurm_drmaa' make[1]: Nothing to be done for 'check'. make[1]: Leaving directory 'ROOTDIR/slurm-drmaa-1.1.3/slurm_drmaa' Making check in test make[1]: Entering directory 'ROOTDIR/slurm-drmaa-1.1.3/test' make slurm_ping make[2]: Entering directory 'ROOTDIR/slurm-drmaa-1.1.3/test' make[2]: 'slurm_ping' is up to date. make[2]: Leaving directory 'ROOTDIR/slurm-drmaa-1.1.3/test' make check-TESTS make[2]: Entering directory 'ROOTDIR/slurm-drmaa-1.1.3/test' make[3]: Entering directory 'ROOTDIR/slurm-drmaa-1.1.3/test'

Thanks.