radical-cybertools / radical.saga

A Light-Weight Access Layer for Distributed Computing Infrastructure and Reference Implementation of the SAGA Python Language Bindings.
http://radical-cybertools.github.io/saga-python/
Other
83 stars 34 forks source link

PBS Pro pbsnodes error #427

Closed mscook closed 5 years ago

mscook commented 9 years ago

Hi,

I'm using this example modificed slightly for our PHC resource (http://saga-python.readthedocs.org/en/latest/adaptors/saga.adaptor.pbsjob.html)

SAGA_VERBOSE=5 (with initial lines deliberately missing).

$)
2015:06:02 15:51:52 22576  MainThread   saga.PTYShell         : [DEBUG   ] Got initial shell prompt (6) (
$)
2015:06:02 15:51:52 22576  MainThread   saga.PTYShell         : [DEBUG   ] running command shell:         exec /bin/sh -i
2015:06:02 15:51:52 22576  MainThread   saga.PTYShell         : [DEBUG   ] write: [    4] [   47] ( stty -echo ; unset HISTFILE ; exec /bin/sh -i\n)
2015:06:02 15:51:52 22576  MainThread   saga.PTYShell         : [DEBUG   ] read : [    4] [    1] ($)
2015:06:02 15:51:52 22576  MainThread   saga.PTYShell         : [DEBUG   ] flush: [    4] [     ] (flush pty read cache)
2015:06:02 15:51:52 22576  MainThread   saga.PTYShell         : [DEBUG   ] write: [    4] [  100] ( unset PROMPT_COMMAND ;  unset HISTFILE ; PS1='PROMPT-$?->'; PS2=''; export PS1 PS2 2>&1 >/dev/null\n)
2015:06:02 15:51:52 22576  MainThread   saga.PTYShell         : [DEBUG   ] read : [    4] [   10] (PROMPT-0->)
2015:06:02 15:51:52 22576  MainThread   saga.PTYShell         : [DEBUG   ] got new shell prompt
2015:06:02 15:51:52 22576  MainThread   saga.PTYShell         : [DEBUG   ] flush: [    4] [     ] (flush pty read cache)
2015:06:02 15:51:52 22576  MainThread   saga.PTYShell         : [DEBUG   ] flush: [    4] [     ] (flush pty read cache)
2015:06:02 15:51:53 22576  MainThread   saga.PTYShell         : [DEBUG   ] run_sync: which qdel
2015:06:02 15:51:53 22576  MainThread   saga.PTYShell         : [DEBUG   ] write: [    4] [   11] (which qdel\n)
2015:06:02 15:51:53 22576  MainThread   saga.PTYShell         : [DEBUG   ] read : [    4] [   37] (/opt/pbs/default/bin/qdel\nPROMPT-0->)
2015:06:02 15:51:53 22576  MainThread   saga.PTYShell         : [DEBUG   ] flush: [    4] [     ] (flush pty read cache)
2015:06:02 15:51:53 22576  MainThread   saga.PTYShell         : [DEBUG   ] run_sync: which qsub
2015:06:02 15:51:53 22576  MainThread   saga.PTYShell         : [DEBUG   ] write: [    4] [   11] (which qsub\n)
2015:06:02 15:51:53 22576  MainThread   saga.PTYShell         : [DEBUG   ] read : [    4] [   21] (/usr/local/bin/qsub\n)
2015:06:02 15:51:53 22576  MainThread   saga.PTYShell         : [DEBUG   ] read : [    4] [   10] (PROMPT-0->)
2015:06:02 15:51:53 22576  MainThread   saga.PTYShell         : [DEBUG   ] flush: [    4] [     ] (flush pty read cache)
2015:06:02 15:51:53 22576  MainThread   saga.PTYShell         : [DEBUG   ] run_sync: qsub --version
2015:06:02 15:51:53 22576  MainThread   saga.PTYShell         : [DEBUG   ] write: [    4] [   15] (qsub --version\n)
2015:06:02 15:51:53 22576  MainThread   saga.PTYShell         : [DEBUG   ] read : [    4] [   99] (pbs_version = PBSPro_11.3.0.121723\n/usr/local/bin/qsub: line 26: [: too many arguments\nPROMPT-0->)
2015:06:02 15:51:53 22576  MainThread   saga.PTYShell         : [DEBUG   ] flush: [    4] [     ] (flush pty read cache)
2015:06:02 15:51:53 22576  MainThread   saga.PTYShell         : [DEBUG   ] run_sync: which pbsnodes
2015:06:02 15:51:53 22576  MainThread   saga.PTYShell         : [DEBUG   ] write: [    4] [   15] (which pbsnodes\n)
2015:06:02 15:51:53 22576  MainThread   saga.PTYShell         : [DEBUG   ] read : [    4] [   31] (/opt/pbs/default/bin/pbsnodes\n)
2015:06:02 15:51:53 22576  MainThread   saga.PTYShell         : [DEBUG   ] read : [    4] [   10] (PROMPT-0->)
2015:06:02 15:51:53 22576  MainThread   saga.PTYShell         : [DEBUG   ] flush: [    4] [     ] (flush pty read cache)
2015:06:02 15:51:53 22576  MainThread   saga.PTYShell         : [DEBUG   ] run_sync: pbsnodes --version
2015:06:02 15:51:53 22576  MainThread   saga.PTYShell         : [DEBUG   ] write: [    4] [   19] (pbsnodes --version\n)
2015:06:02 15:51:53 22576  MainThread   saga.PTYShell         : [DEBUG   ] read : [    4] [   46] (pbs_version = PBSPro_11.3.0.121723\nPROMPT-0->)
2015:06:02 15:51:53 22576  MainThread   saga.PTYShell         : [DEBUG   ] flush: [    4] [     ] (flush pty read cache)
2015:06:02 15:51:53 22576  MainThread   saga.PTYShell         : [DEBUG   ] run_sync: which qstat
2015:06:02 15:51:53 22576  MainThread   saga.PTYShell         : [DEBUG   ] write: [    4] [   12] (which qstat\n)
2015:06:02 15:51:53 22576  MainThread   saga.PTYShell         : [DEBUG   ] read : [    4] [   28] (/opt/pbs/default/bin/qstat\n)
2015:06:02 15:51:53 22576  MainThread   saga.PTYShell         : [DEBUG   ] read : [    4] [   10] (PROMPT-0->)
2015:06:02 15:51:53 22576  MainThread   saga.PTYShell         : [DEBUG   ] flush: [    4] [     ] (flush pty read cache)
2015:06:02 15:51:53 22576  MainThread   saga.PTYShell         : [DEBUG   ] run_sync: qstat --version
2015:06:02 15:51:53 22576  MainThread   saga.PTYShell         : [DEBUG   ] write: [    4] [   16] (qstat --version\n)
2015:06:02 15:51:53 22576  MainThread   saga.PTYShell         : [DEBUG   ] read : [    4] [   46] (pbs_version = PBSPro_11.3.0.121723\nPROMPT-0->)
2015:06:02 15:51:53 22576  MainThread   saga.PBSJobService    : [INFO    ] Found PBS tools: {'qdel': {'path': '/opt/pbs/default/bin/qdel', 'version': '?'}, 'qsub': {'path': '/usr/local/bin/qsub', 'version': 'pbs_version = PBSPro_11.3.0.121723\n/usr/local/bin/qsub: line 26: [: too many arguments\n'}, 'pbsnodes': {'path': '/opt/pbs/default/bin/pbsnodes', 'version': 'pbs_version = PBSPro_11.3.0.121723\n'}, 'qstat': {'path': '/opt/pbs/default/bin/qstat', 'version': 'pbs_version = PBSPro_11.3.0.121723\n'}}
2015:06:02 15:51:53 22576  MainThread   saga.PTYShell         : [DEBUG   ] flush: [    4] [     ] (flush pty read cache)
2015:06:02 15:51:54 22576  MainThread   saga.PTYShell         : [DEBUG   ] run_sync: which aprun
2015:06:02 15:51:54 22576  MainThread   saga.PTYShell         : [DEBUG   ] write: [    4] [   12] (which aprun\n)
2015:06:02 15:51:54 22576  MainThread   saga.PTYShell         : [DEBUG   ] read : [    4] [ 1024] (which: no aprun in (/home/uqms ... mpt/mpt-2.10/bin:/home/uqmstan)
2015:06:02 15:51:54 22576  MainThread   saga.PTYShell         : [DEBUG   ] read : [    4] [ 1024] (1/bin:/usr/local/bin:/usr/bin: ... Q/Science/SCMB/Beatson/bin/vel)
2015:06:02 15:51:54 22576  MainThread   saga.PTYShell         : [DEBUG   ] read : [    4] [ 1024] (vet/1.2.07/contrib/observed-in ... QC/0.10.1:/work2/UQ/Science/SC)
2015:06:02 15:51:54 22576  MainThread   saga.PTYShell         : [DEBUG   ] read : [    4] [ 1023] (MB/Beatson/bin/FastQScreen/0.2 ... ipeline/0.3.2.1/scripts:/work2)
2015:06:02 15:51:54 22576  MainThread   saga.PTYShell         : [DEBUG   ] read : [    4] [ 1024] (/UQ/Science/SCMB/Beatson/bin/g ... 2/UQ/Science/SCMB/Beatson/bin/)
2015:06:02 15:51:54 22576  MainThread   saga.PTYShell         : [DEBUG   ] read : [    4] [ 1024] (sff_extract/0.3.0:/work2/UQ/Sc ... UQ/Science/SCMB/Beatson/bin/CO)
2015:06:02 15:51:54 22576  MainThread   saga.PTYShell         : [DEBUG   ] read : [    4] [ 1024] (NTIGuator/2.7.3:/work2/UQ/Scie ... n:/work2/UQ/Science/SCMB/Beats)
2015:06:02 15:51:54 22576  MainThread   saga.PTYShell         : [DEBUG   ] read : [    4] [  663] (on/bin/mlstBLAST:/work2/UQ/Sci ... eatson/bin/pilercr/1.06/bin)\n)
2015:06:02 15:51:54 22576  MainThread   saga.PTYShell         : [DEBUG   ] read : [    4] [   10] (PROMPT-1->)
2015:06:02 15:51:54 22576  MainThread   saga.PTYShell         : [DEBUG   ] flush: [    4] [     ] (flush pty read cache)
2015:06:02 15:51:54 22576  MainThread   saga.PTYShell         : [DEBUG   ] run_sync: unset GREP_OPTIONS; /opt/pbs/default/bin/pbsnodes -a | grep -E "(np|pcpu)[[:blank:]]*="
2015:06:02 15:51:54 22576  MainThread   saga.PTYShell         : [DEBUG   ] write: [    4] [   88] (unset GREP_OPTIONS; /opt/pbs/default/bin/pbsnodes -a | grep -E "(np|pcpu)[[:blank:]]*="\n)
2015:06:02 15:51:54 22576  MainThread   saga.PTYShell         : [DEBUG   ] read : [    4] [   10] (PROMPT-1->)
2015:06:02 15:51:54 22576  MainThread   saga.PBSJobService    : [ERROR   ] Error running pbsnodes:
Traceback (most recent call last):
  File "test.py", line 111, in <module>
    sys.exit(main())
  File "test.py", line 104, in main
    print "An exception occured: (%s) %s " % (ex.type, (str(ex)))
  File "/Users/mscook/.Virtualenvs/Banzai/lib/python2.7/site-packages/saga/exceptions.py", line 226, in get_type
    return self._type
AttributeError: 'NoSuccess' object has no attribute '_type'
marksantcroos commented 9 years ago

That should not happen :) Can you please paste the output of qstat --version and pbsnodes -a here?

mscook commented 9 years ago

(I've only been using saga-python for a few hours) so it might be me...

qstat --version

pbs_version = PBSPro_11.3.0.121723

and the bottom of

pbsnodes -a

b01a07
     Mom = b01a07.barrine.hpcu.uq.edu.au
     ntype = PBS
     state = <various>
     pcpus = 16
     jobs = 1417589[2].paroo3/0, 1417589[2].paroo3/1, 1417589[2].paroo3/2, 1417589[2].paroo3/3, 1417589[2].paroo3/0, 1417724.paroo3/1, 1417724.paroo3/2, 1417724.paroo3/3, 1417724.paroo3/3
     resv_enable = True
     sharing = <various>
     resources_available.arch = linux_cpuset
     resources_available.host = b01a07
     resources_available.mem = 24602260kb
     resources_available.ncpus = 8
     resources_available.NodeType = medium
     resources_available.router = b01a07,med-04,ib-04,barrine
     resources_available.schedclass = flex,normal,large,constrain,reserv
     resources_available.schedmins = 19938
     resources_available.vmem = 32794252kb
     resources_available.vnode = <various>
     resources_available.accelerator_memory = 0kb
     resources_available.naccelerators = 0
     resources_available.netwins = 0
     resources_available.scratch = 241526648kb
     resources_assigned.mem = 23068672kb
     resources_assigned.ncpus = 8
     resources_assigned.vmem = 0kb
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.netwins = 0

b02b35
     Mom = b02b35.barrine.hpcu.uq.edu.au
     ntype = PBS
     state = <various>
     pcpus = <various>
     jobs = 1416574.paroo3/1, 1416574.paroo3/2, 1416574.paroo3/3, 1416574.paroo3/3, 1417623[1109].paroo3/0, 1417623[1034].paroo3/1
     resv_enable = True
     sharing = <various>
     resources_available.arch = linux_cpuset
     resources_available.host = b02b35
     resources_available.mem = 24600820kb
     resources_available.ncpus = 8
     resources_available.NodeType = medium
     resources_available.router = "b02b35,ib-04,barrine"
     resources_available.schedclass = flex,normal,large,constrain,reserv
     resources_available.schedmins = 17984
     resources_available.vmem = 0kb
     resources_available.vnode = <various>
     resources_available.accelerator_memory = 0kb
     resources_available.naccelerators = 0
     resources_available.netwins = 0
     resources_available.scratch = 241524724kb
     resources_assigned.mem = 14680064kb
     resources_assigned.ncpus = 6
     resources_assigned.vmem = 0kb
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.netwins = 0

b10a13
     Mom = b10a13.barrine.hpcu.uq.edu.au
     ntype = PBS
     state = free
     pcpus = <various>
     resv_enable = True
     sharing = <various>
     resources_available.arch = linux_cpuset
     resources_available.host = b10a13
     resources_available.mem = 74243324kb
     resources_available.ncpus = 8
     resources_available.NodeType = large
     resources_available.router = "b10a13,ib-05,barrine"
     resources_available.schedclass = flex,normal,large,constrain,reserv
     resources_available.schedmins = 40320
     resources_available.vmem = 0kb
     resources_available.vnode = <various>
     resources_available.accelerator_memory = 0kb
     resources_available.naccelerators = 0
     resources_available.netwins = 0
     resources_available.scratch = 241522968kb
     resources_assigned.mem = 0kb
     resources_assigned.ncpus = 0
     resources_assigned.vmem = 0kb
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.netwins = 0

b07b35
     Mom = b07b35.barrine.hpcu.uq.edu.au
     ntype = PBS
     state = free
     pcpus = <various>
     jobs = 1417623[884].paroo3/0, 1417071[9].paroo3/1, 1417623[1015].paroo3/2, 1417623[1015].paroo3/0, 1417071[9].paroo3/0, 1417627.paroo3/1
     resv_enable = True
     sharing = <various>
     resources_available.arch = linux_cpuset
     resources_available.host = b07b35
     resources_available.mem = 24600820kb
     resources_available.ncpus = 8
     resources_available.NodeType = medium
     resources_available.router = "b07b35,ib-05,barrine"
     resources_available.schedclass = flex,normal,large,constrain,reserv
     resources_available.schedmins = 6854
     resources_available.vmem = 0kb
     resources_available.vnode = <various>
     resources_available.accelerator_memory = 0kb
     resources_available.naccelerators = 0
     resources_available.netwins = 0
     resources_available.scratch = 241462844kb
     resources_assigned.mem = 24117248kb
     resources_assigned.ncpus = 4
     resources_assigned.vmem = 0kb
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.netwins = 0

b10b18
     Mom = b10b18.barrine.hpcu.uq.edu.au
     ntype = PBS
     state = state-unknown,offline
     pcpus = 1
     comment = EACHAM Windows Cluster :20130312 09:51 dannys
     resv_enable = True
     sharing = default_shared
     resources_available.host = b10b18
     resources_available.mem = 0kb
     resources_available.ncpus = 0
     resources_available.NodeType = large
     resources_available.schedclass = flex,normal,large,constrain,reserv
     resources_available.schedmins = 40320
     resources_available.vmem = 0kb
     resources_available.vnode = <various>
     resources_available.accelerator_memory = 0kb
     resources_available.naccelerators = 0
     resources_available.netwins = 0
     resources_assigned.mem = 0kb
     resources_assigned.ncpus = 0
     resources_assigned.vmem = 0kb
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.netwins = 0

b10a06
     Mom = b10a06.barrine.hpcu.uq.edu.au
     ntype = PBS
     state = free
     pcpus = <various>
     jobs = 1417387[3].paroo3/0, 1417387[3].paroo3/1, 1417387[3].paroo3/2, 1417387[3].paroo3/0
     resv_enable = True
     sharing = <various>
     resources_available.arch = linux_cpuset
     resources_available.host = b10a06
     resources_available.mem = 74243316kb
     resources_available.ncpus = 8
     resources_available.NodeType = large
     resources_available.router = "b10a06,ib-05,barrine"
     resources_available.schedclass = flex,normal,large,constrain,reserv
     resources_available.schedmins = 8424
     resources_available.vmem = 0kb
     resources_available.vnode = <various>
     resources_available.accelerator_memory = 0kb
     resources_available.naccelerators = 0
     resources_available.netwins = 0
     resources_available.scratch = 240796124kb
     resources_assigned.mem = 52428800kb
     resources_assigned.ncpus = 3
     resources_assigned.vmem = 0kb
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.netwins = 0

b01a29
     Mom = b01a29.barrine.hpcu.uq.edu.au
     ntype = PBS
     state = <various>
     pcpus = 16
     jobs = 1417589[3].paroo3/0, 1417589[3].paroo3/1, 1417589[3].paroo3/2, 1417589[3].paroo3/3, 1417589[3].paroo3/0, 1417725.paroo3/1, 1417725.paroo3/2, 1417725.paroo3/3, 1417725.paroo3/3
     resv_enable = True
     sharing = <various>
     resources_available.arch = linux_cpuset
     resources_available.host = b01a29
     resources_available.mem = 24602260kb
     resources_available.ncpus = 8
     resources_available.NodeType = medium
     resources_available.router = b01a29,med-04,ib-04,barrine
     resources_available.schedclass = flex,normal,large,constrain,reserv
     resources_available.schedmins = 40320
     resources_available.vmem = 32794252kb
     resources_available.vnode = <various>
     resources_available.accelerator_memory = 0kb
     resources_available.naccelerators = 0
     resources_available.netwins = 0
     resources_available.scratch = 241529652kb
     resources_assigned.mem = 23068672kb
     resources_assigned.ncpus = 8
     resources_assigned.vmem = 0kb
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.netwins = 0

b07b02
     Mom = b07b02.barrine.hpcu.uq.edu.au
     ntype = PBS
     state = down
     pcpus = <various>
     comment = node down: communication closed
     resv_enable = True
     sharing = <various>
     resources_available.arch = linux_cpuset
     resources_available.host = b07b02
     resources_available.mem = 24600820kb
     resources_available.ncpus = 8
     resources_available.NodeType = medium
     resources_available.router = "b07b02,ib-05,barrine"
     resources_available.schedclass = flex,normal,large,constrain,reserv
     resources_available.schedmins = 40320
     resources_available.vmem = 0kb
     resources_available.vnode = <various>
     resources_available.accelerator_memory = 0kb
     resources_available.naccelerators = 0
     resources_available.netwins = 0
     resources_available.scratch = 241481804kb
     resources_assigned.mem = 0kb
     resources_assigned.ncpus = 0
     resources_assigned.vmem = 0kb
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.netwins = 0

b10a04
     Mom = b10a04.barrine.hpcu.uq.edu.au
     ntype = PBS
     state = free
     pcpus = <various>
     resv_enable = True
     sharing = <various>
     resources_available.arch = linux_cpuset
     resources_available.host = b10a04
     resources_available.mem = 74243324kb
     resources_available.ncpus = 8
     resources_available.NodeType = large
     resources_available.router = b10a04,larg,ib-04,barrine
     resources_available.schedclass = flex,normal,large,constrain,reserv
     resources_available.schedmins = 40320
     resources_available.vmem = 0kb
     resources_available.vnode = <various>
     resources_available.accelerator_memory = 0kb
     resources_available.naccelerators = 0
     resources_available.netwins = 0
     resources_available.scratch = 241524956kb
     resources_assigned.mem = 0kb
     resources_assigned.ncpus = 0
     resources_assigned.vmem = 0kb
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.netwins = 0

b02b33
     Mom = b02b33.barrine.hpcu.uq.edu.au
     ntype = PBS
     state = free
     pcpus = <various>
     jobs = 1417335[7].paroo3/0, 1417335[7].paroo3/0, 1417623[1097].paroo3/1, 1417623[1076].paroo3/2
     resv_enable = True
     sharing = <various>
     resources_available.arch = linux_cpuset
     resources_available.host = b02b33
     resources_available.mem = 24600820kb
     resources_available.ncpus = 8
     resources_available.NodeType = medium
     resources_available.router = b02b33,med-04,ib-04,barrine
     resources_available.schedclass = flex,normal,large,constrain,reserv
     resources_available.schedmins = 40320
     resources_available.vmem = 0kb
     resources_available.vnode = <various>
     resources_available.accelerator_memory = 0kb
     resources_available.naccelerators = 0
     resources_available.netwins = 0
     resources_available.scratch = 241528360kb
     resources_assigned.mem = 23068672kb
     resources_assigned.ncpus = 3
     resources_assigned.vmem = 0kb
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.netwins = 0
marksantcroos commented 9 years ago

Ok, I see what the problem is. Will get back at you. In the time being you might want to investigate how to install from source/branch instead of pypi (if you haven't already), so that you can test the change one I commit it.

andre-merzky commented 8 years ago

The exception._type problem has been fixed. Mark, any update on the PBS layer problem?

alexsalex commented 5 years ago

The exception._type problem has been fixed.

What was the issue? How did you fix it?

Have some issue in PBS Pro 14.1.2

andre-merzky commented 5 years ago

Hi @alexsalex , the original thread is somewhart outdated by now. Would you mind opening a new issue with a description of the problem you face? Thank you!

vivek-bala commented 5 years ago

Closing this ticket as there has been no update on this thread.