Closed EricR86 closed 5 years ago
Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).
This is likely not a Segway-specific bug, nor even a DRMAA for Slurm bug since neither have any mention of any string with “unknown” or “signal”. This is more likely to be a cluster or OS specific.
Original comment by Rachel Chan (Bitbucket: rcwchan).
@ericr86 But aren’t the slurm host machines just running Centos 7?
Also, sacct says the job had ‘0:0’ exit status:
rachelc@mordorlogin1: log$ sacct -j 105932
JobID JobName Partition Account AllocCPUS State ExitCode
105932 vit1365.2+ hoffmangr+ hoffmangr+ 1 COMPLETED 0:0
105932.batch batch hoffmangr+ 1 COMPLETED 0:0
a job which Segway reports as having 0-exit status looks identical:
JobID JobName Partition Account AllocCPUS State ExitCode
105820 vit448.20+ hoffmangr+ hoffmangr+ 1 COMPLETED 0:0
105820.batch batch hoffmangr+ 1 COMPLETED 0:0
Original comment by Rachel Chan (Bitbucket: rcwchan).
I also want to point out that when comparing jobs.identify.tabs from SGE and from SLURM, it becomes apparent that some columns are not being reported properly:
SGE:
5529340 vit412.20190421-0305_3.identifydir.82380ade643f11e9ab5b5254004fdc0a gmtkViterbi 10 1993960 1323479040.0000 51.0422 0
SLURM:
137679 vit404.20190508-1026_3.identifydir.bc5d517c71ac11e998855254009ae54a gmtkViterbi 10 1993960 0 0 unknown signal?!
Specifically, it seems like the final 3 columns (maxvmem, cpu, and exit_status) are all being reported incorrectly.
Note that when running locally, I think only maxvmem (and maybe exit status?) is reported incorrectly?:
vit34.20190508-1026_3.identifydir.eaeed436729511e9857152540005a5cf gmtkViterbi 10 1966430 0 73.25283 0
Original report (BitBucket issue) by Rachel Chan (Bitbucket: rcwchan).
I had a viterbi job fail on SLURM for an unknown reason with empty error/output messages.
identfiydir/jobs.identify.tab
reports:so it seems that returned nonzero error codes are not parsed properly in SLURM.