radical-collaboration / extasy-grlsd

Repository to hold the input data and scripts for the ExTASY gromacs-lsdmap work
1 stars 1 forks source link

removing topology restriction #95

Closed euhruska closed 6 years ago

euhruska commented 6 years ago

To reduce queuetime, removing topology restriction on bluewaters: Have to add into entk pbs script:

PBS -l flags=commtransparent

Vivek mentioned a method going into RP for tweaks. Is there a simpler way?

andre-merzky commented 6 years ago

Interesting - given that RP mostly targets small units which, by definition, have no large scale inter node communication, that flag would make sense for RP in general I would think. I wasn't aware that this exists.

Please user RP release 0.5.10, and the SAGA branch feature/issue_grlsd_95, and give it a try. Thanks!

euhruska commented 6 years ago

got error:

2018-09-29 12:16:39,338: radical.entk.task_manager.0000: MainProcess                     : heartbeat      : ERROR   : Heartbeat failed with error: unsupported operand type
(s) for +: 'float' and 'str'
Traceback (most recent call last):
  File "/scratch1/eh22/conda/envs/extasy16/lib/python2.7/site-packages/radical/entk/execman/base/task_manager.py", line 146, in _heartbeat
    mq_connection.sleep(self._hb_interval)
  File "/scratch1/eh22/conda/envs/extasy16/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 771, in sleep
    deadline = time.time() + duration
TypeError: unsupported operand type(s) for +: 'float' and 'str'
Exception in thread heartbeat:
Traceback (most recent call last):
  File "/scratch1/eh22/conda/envs/extasy16/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/scratch1/eh22/conda/envs/extasy16/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/scratch1/eh22/conda/envs/extasy16/lib/python2.7/site-packages/radical/entk/execman/base/task_manager.py", line 146, in _heartbeat
    mq_connection.sleep(self._hb_interval)
  File "/scratch1/eh22/conda/envs/extasy16/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 771, in sleep
    deadline = time.time() + duration
TypeError: unsupported operand type(s) for +: 'float' and 'str'

radical-stack:

  python               : 2.7.14
  pythonpath           :
  virtualenv           : extasy16

  radical.analytics    : v0.50.0-10-g76b5950@devel
  radical.entk         : 0.7.6-0.7.6@devel
  radical.pilot        : 0.50.9-v0.50.9-9-ga4ae131d@devel
  radical.utils        : 0.50.1-v0.50.1-3-g2b7f6c6@devel
  saga                 : 0.50.0-v0.50.0-1-gdeb47812@feature-issue_grlsd_95
andre-merzky commented 6 years ago

this seems to be an unrelated issue on the Pika/RabbitMQ layer. @vivek-bala , any idea what's up?

euhruska commented 6 years ago

how do I actually check the submitted pbs script if it's correct?

andre-merzky commented 6 years ago
export RADICAL_SAGA_VERBOSE=DEBUG
export RADICAL_SAGA_LOG_TGT=rs.log

The resulting log should contain the PBS script somewhere - see here

euhruska commented 6 years ago

my rs.log is useless since several entk runs at the same time and overwrite it all the same. It would be nice if the pbs script would be on an easier accessible place. Where is the pbs script on the remote system?

andre-merzky commented 6 years ago

with export RADICAL_SAGA_LOG_TGT=rs_123.log, you can chose a different file name for each run, r you could run a separate test. But also, the logfile writes to the file should be fairly atomic, and you have a decent chance of finding the script - search for the string Generated PBS script.

The script is not logged separately, and is also not kept on the target side.

euhruska commented 6 years ago

got in saga log tgt

2018-09-30 15:14:40,063: radical.saga.cpi    : pmgr.0000.launching.0           : Thread-3       : ERROR   : Error running job via 'qsub':
ERROR: Multiple node types 'nodes=10:ppn=16:xk:flags=commtransparent' is requested in a single 'nodes=...' statement,
if you are requesting for multiple node types, please use a '+' sign between node types.
Example: "qsub -l nodes=23:ppn=16:xk+40:ppn=32:xe"

qsub: Your job has been administratively rejected by the queueing system.
qsub: There may be a more detailed explanation prior to this notice.
andre-merzky commented 6 years ago

This is unexpected, in that this seems unrelated to the additional flag. I am not sure what we are missing here. @euhruska , you had a contact at NCSA who recommended to use that flag, right? Do you mind asking him about the expected qsub command formatting? Alternatively, would you please forward me the email address, and I'll check with them. Thanks!

euhruska commented 6 years ago

I just sent an email, but maybe it's supposed to be literally a new line only having this: #PBS -l flags=commtransparent

euhruska commented 6 years ago

answer:

submit it as a separate property and not after the resource "xk" specifier. 
qsub -I -l nodes=10:ppn=16:xk -l flags=commtransparent
euhruska commented 6 years ago

any update?

andre-merzky commented 6 years ago

I fixed the branch, please do give it another try!

euhruska commented 6 years ago

no issues encountered, works