natefoo / slurm-drmaa

DRMAA for Slurm: Implementation of the DRMAA C bindings for Slurm
GNU General Public License v3.0
48 stars 22 forks source link

Running CLI commands without options segfaults. #64

Closed richc-admin-gcai closed 2 years ago

richc-admin-gcai commented 2 years ago

Testing slurm drmaa in a container, but even when running outside of a container either building from source or installing via galaxy rpm every time I run binary its segfaults.

am I missing something?

Error is: [root@f8ddc11bc51e /]# DRMAA_LIBRARY_PATH=/usr/lib64/libdrmaa.so /usr/bin/drmaa-run Segmentation fault (core dumped)

Backtrace shows:

[root@f8ddc11bc51e /]# export DRMAA_LIBRARY_PATH=/usr/lib64/libdrmaa.so
[root@f8ddc11bc51e /]# gdb /usr/bin/drmaa-run GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/... Reading symbols from /usr/bin/drmaa-run...Reading symbols from /usr/lib/debug/usr/bin/drmaa-run.debug...done. done. (gdb) run Starting program: /usr/bin/drmaa-run [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault. 0x00000000004129b6 in parse_args (argc=0, argv=0x7fffffffe7a0) at drmaa_run.c:254 254 while (argc >= 0 && argv[0][0] == '-') (gdb) backtrace

0 0x00000000004129b6 in parse_args (argc=0, argv=0x7fffffffe7a0) at drmaa_run.c:254

1 0x00000000004120df in main (argc=1, argv=0x7fffffffe798) at drmaa_run.c:122

(gdb)

My test setup is as follows:

Dockerfile: $ cat Dockerfile FROM centos:7

RUN (cd /lib/systemd/system/sysinit.target.wants/; for i in ; do [ $i == systemd-tmpfiles-setup.service ] || rm -f $i; done); \ rm -f /lib/systemd/system/multi-user.target.wants/;\ rm -f /etc/systemd/system/.wants/;\ rm -f /lib/systemd/system/local-fs.target.wants/; \ rm -f /lib/systemd/system/sockets.target.wants/udev; \ rm -f /lib/systemd/system/sockets.target.wants/initctl; \ rm -f /lib/systemd/system/basic.target.wants/;\ rm -f /lib/systemd/system/anaconda.target.wants/*;

VOLUME [ "/sys/fs/cgroup"]

RUN yum-config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-rhel7.repo RUN yum-config-manager --add-repo https://depot.galaxyproject.org/yum/galaxy.repo

RUN yum -y install which strace gdb RUN debuginfo-install -y libgcc-4.8.5-44.el7.x86_64 RUN debuginfo-install -y glibc-2.17-324.el7_9.x86_64 RUN yum -y install slurm-slurmd-20.11.8 slurm-devel-20.11.8glibc-2.17-324.el7_9.x86_64

RUN yum clean all && yum -y update

RUN yum -y install slurm-drmaa slurm-drmaa-debuginfo

RUN yum clean all && \ rm -rf /var/cache/yum

VOLUME [ "/sys/fs/cgroup"]

ENTRYPOINT ['/usr/sbin/init']

Which results in a working container, and when I login to the container I'm running:

[root@f8ddc11bc51e /]# cat /etc/redhat-release CentOS Linux release 7.9.2009 (Core)

[root@f8ddc11bc51e7 /]# rpm -qa slurm*
slurm-slurmd-20.11.8-1.el7.x86_64 slurm-drmaa-debuginfo-1.1.2-1.el7.x86_64 slurm-20.11.8-1.el7.x86_64 slurm-devel-20.11.8-1.el7.x86_64 slurm-drmaa-1.1.2-1.el7.x86_64

[root@f8ddc11bc51e /]# yum info slurm-drmaa-1.1.2-1.el7.x86_64 Loaded plugins: fastestmirror, ovl Loading mirror speeds from cached hostfile

richc-admin-gcai commented 2 years ago

I believe this issues is in:

drmaa_utils/drmaa_utils/drmaa_run_bulk.c: while (argc >= 0 && argv[0][0] == '-') drmaa_utils/drmaa_utils/drmaa_run.c: while (argc >= 0 && argv[0][0] == '-')

Shouldn't this be:

while (argc > 0 && argv[0][0] == '-')

As if argc = 0, then referencing argv to check for '-' will cause a segfault.

If I make that change then the binaries throw the expected error:

[root@f8ddc11bc51e slurm-drmaa-1.1.2]# ./drmaa-run-bulk F #9472 [ 0.00] * syntax error F #9472 [ 0.00] | drmaa-run-bulk {start} {end} {step} {command}

[root@f8ddc11bc51e slurm-drmaa-1.1.2]# ./drmaa-run
F #9473 [ 0.00] * Failed to submit a job: drmaa_remote_command not set for job template

natefoo commented 2 years ago

Your analysis looks correct to me, I'll commit a fix and include it in the next release of slurm-drmaa. Thanks!