hoffmangroup / segway

Application for semi-automated genomic annotation.
http://segway.hoffmanlab.org/
GNU General Public License v2.0
13 stars 7 forks source link

Error codes on SLURM #134

Closed EricR86 closed 5 years ago

EricR86 commented 5 years ago

Original report (BitBucket issue) by Rachel Chan (Bitbucket: rcwchan).


I had a viterbi job fail on SLURM for an unknown reason with empty error/output messages. identfiydir/jobs.identify.tab reports:

05 105825  vit443.20190507-1553_3.identifydir.e6ceabaa710b11e9b532525400626261 gmtkViterbi 10  1983410 0   0   0
106 105826  vit442.20190507-1553_3.identifydir.e6ceabaa710b11e9b532525400626261 gmtkViterbi 10  1983410 0   0   0
107 105827  vit441.20190507-1553_3.identifydir.e6ceabaa710b11e9b532525400626261 gmtkViterbi 10  1983410 0   0   0
108 105820  vit448.20190507-1553_3.identifydir.e6ceabaa710b11e9b532525400626261 gmtkViterbi 10  1983410 0   0   0
109 105821  vit447.20190507-1553_3.identifydir.e6ceabaa710b11e9b532525400626261 gmtkViterbi 10  1983410 0   0   0
110 105822  vit446.20190507-1553_3.identifydir.e6ceabaa710b11e9b532525400626261 gmtkViterbi 10  1983410 0   0   0
111 105823  vit445.20190507-1553_3.identifydir.e6ceabaa710b11e9b532525400626261 gmtkViterbi 10  1983410 0   0   0
112 105828  vit440.20190507-1553_3.identifydir.e6ceabaa710b11e9b532525400626261 gmtkViterbi 10  1983410 0   0   0
113 105829  vit439.20190507-1553_3.identifydir.e6ceabaa710b11e9b532525400626261 gmtkViterbi 10  1983410 0   0   0
114 105587  vit401.20190507-1553_3.identifydir.e6ceabaa710b11e9b532525400626261 gmtkViterbi 10  1993960 0   0   unknown signal?!
115 105586  vit402.20190507-1553_3.identifydir.e6ceabaa710b11e9b532525400626261 gmtkViterbi 10  1993960 0   0   unknown signal?!
116 105585  vit403.20190507-1553_3.identifydir.e6ceabaa710b11e9b532525400626261 gmtkViterbi 10  1993960 0   0   unknown signal?!
117 105584  vit404.20190507-1553_3.identifydir.e6ceabaa710b11e9b532525400626261 gmtkViterbi 10  1993960 0   0   unknown signal?!
118 105583  vit405.20190507-1553_3.identifydir.e6ceabaa710b11e9b532525400626261 gmtkViterbi 10  1993960 0   0   unknown signal?!
119 105582  vit406.20190507-1553_3.identifydir.e6ceabaa710b11e9b532525400626261 gmtkViterbi 10  1993960 0   0   unknown signal?!
120 105581  vit407.20190507-1553_3.identifydir.e6ceabaa710b11e9b532525400626261 gmtkViterbi 10  1993960 0   0   unknown signal?!
121 105580  vit408.20190507-1553_3.identifydir.e6ceabaa710b11e9b532525400626261 gmtkViterbi 10  1993960 0   0   unknown signal?!

so it seems that returned nonzero error codes are not parsed properly in SLURM.

EricR86 commented 5 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


This is likely not a Segway-specific bug, nor even a DRMAA for Slurm bug since neither have any mention of any string with “unknown” or “signal”. This is more likely to be a cluster or OS specific.

EricR86 commented 5 years ago

Original comment by Rachel Chan (Bitbucket: rcwchan).


@ericr86 But aren’t the slurm host machines just running Centos 7?

Also, sacct says the job had ‘0:0’ exit status:

rachelc@mordorlogin1: log$ sacct -j 105932
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
105932       vit1365.2+ hoffmangr+ hoffmangr+          1  COMPLETED      0:0 
105932.batch      batch            hoffmangr+          1  COMPLETED      0:0

a job which Segway reports as having 0-exit status looks identical:

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
105820       vit448.20+ hoffmangr+ hoffmangr+          1  COMPLETED      0:0 
105820.batch      batch            hoffmangr+          1  COMPLETED      0:0 
EricR86 commented 5 years ago

Original comment by Rachel Chan (Bitbucket: rcwchan).


I also want to point out that when comparing jobs.identify.tabs from SGE and from SLURM, it becomes apparent that some columns are not being reported properly:

SGE:
5529340 vit412.20190421-0305_3.identifydir.82380ade643f11e9ab5b5254004fdc0a gmtkViterbi 10  1993960 1323479040.0000 51.0422 0

SLURM:
137679  vit404.20190508-1026_3.identifydir.bc5d517c71ac11e998855254009ae54a gmtkViterbi 10  1993960 0   0   unknown signal?!

Specifically, it seems like the final 3 columns (maxvmem, cpu, and exit_status) are all being reported incorrectly.

Note that when running locally, I think only maxvmem (and maybe exit status?) is reported incorrectly?:

vit34.20190508-1026_3.identifydir.eaeed436729511e9857152540005a5cf  gmtkViterbi 10  1966430 0   73.25283    0
EricR86 commented 5 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


For now this should be resolved in PR #110