Open kaufmann42 opened 8 years ago
TL;DR: In my case this is no longer a problem with drmaa-python
.
Tried to debug the code to figure out which call triggered the segfault. At the moment I'm not sure that what I'm seeing is the same error reported above.
The problem seems to be in the call to drmaa_get_next_job_id.
The job is submitted and queued, it reaches the while
loops, loops once and a jid
is collected. Printing jid
returns the same id as the submitted job. Yet as soon as the while condition is evaluated a second time it segfaults.
Afterwards I tried compiling the latest libdrmaa for SLURM and with it I could no longer reproduce the segfault. This is also true even when compiling manually the same libdrmaa.so version provided system-wide.
It seems like this is an upstream problem and only seems to affect some releases when a certain combination of conditions is met.
The code used to debug this issue:
#!/usr/bin/env python
from __future__ import print_function
import os
import drmaa
LOGS = "logs/"
if not os.path.isdir(LOGS):
os.mkdir(LOGS)
s = drmaa.Session()
s.initialize()
print("Supported contact strings:", s.contact)
print("Supported DRM systems:", s.drmsInfo)
print("Supported DRMAA implementations:", s.drmaaImplementation)
print("Version", s.version)
jt = s.createJobTemplate()
jt.remoteCommand = "/usr/bin/echo"
jt.args = ["Hello", "world"]
jt.jobName = "testdrmaa"
jt.jobEnvironment = os.environ.copy()
jt.workingDirectory = os.getcwd()
jt.outputPath = ":" + os.path.join(LOGS, "job-%A_%a.out")
jt.errorPath = ":" + os.path.join(LOGS, "job-%A_%a.err")
jt.nativeSpecification = "--ntasks=2 --mem-per-cpu=50 --partition=1day"
print("Submitting", jt.remoteCommand, "with", jt.args, "and logs to", jt.outputPath)
ids = s.runBulkJobs(jt, beginIndex=1, endIndex=10, step=1)
print("Job submitted with ids", ids)
s.deleteJobTemplate(jt)
On a malfunctioning system the message "Submitting..." would be shown and immediately after "Segmentation fault" . When the system is working properly you should instead see "Job submitted with...".
For anyone that lands on this post after dealing with segmentation fault errors on SLURM, you might want to ask your cluster administrator to use libdrmaa.so
from https://github.com/FredHutch/slurm-drmaa.
It's far from perfect and will still segfault if there are options in the nativeSpecification
string that it cannot parse. Regardless I've managed to workaround most of the issues mentioned above with this version.
Perhaps worth adding a note in the docs and closing as this is a DRMAA implementation issue.
Just adding that https://github.com/natefoo/slurm-drmaa/commit/7b5991efc03ab14fdc9e7af67dc91f6085e4d648 solves this issue. I agree with @jakirkham.
Hi I'm trying to run a batch job utilizing the --array slurm option. Wondering if this is possible using drmaa-python. I know there is a runBulkJobs(...), however it seems that this doesn't run an array of jobs. There doesn't seem to be any $SLURM_ARRAY_TASK_ID (or the likes) associated with that run environment.
when I try and run this I get a segmentation fault.
OUTPUT
gdb debug backtrace gives the following result
aside: I'm also having trouble with it throwing a OutOfMemoryException.. Therefore am forced to assume it was aborted due to memory (not preferable) so advice on what's happening there would be great.
Thanks!