Closed euhruska closed 6 years ago
Interesting - given that RP mostly targets small units which, by definition, have no large scale inter node communication, that flag would make sense for RP in general I would think. I wasn't aware that this exists.
Please user RP release 0.5.10, and the SAGA branch feature/issue_grlsd_95
, and give it a try. Thanks!
got error:
2018-09-29 12:16:39,338: radical.entk.task_manager.0000: MainProcess : heartbeat : ERROR : Heartbeat failed with error: unsupported operand type
(s) for +: 'float' and 'str'
Traceback (most recent call last):
File "/scratch1/eh22/conda/envs/extasy16/lib/python2.7/site-packages/radical/entk/execman/base/task_manager.py", line 146, in _heartbeat
mq_connection.sleep(self._hb_interval)
File "/scratch1/eh22/conda/envs/extasy16/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 771, in sleep
deadline = time.time() + duration
TypeError: unsupported operand type(s) for +: 'float' and 'str'
Exception in thread heartbeat:
Traceback (most recent call last):
File "/scratch1/eh22/conda/envs/extasy16/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/scratch1/eh22/conda/envs/extasy16/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/scratch1/eh22/conda/envs/extasy16/lib/python2.7/site-packages/radical/entk/execman/base/task_manager.py", line 146, in _heartbeat
mq_connection.sleep(self._hb_interval)
File "/scratch1/eh22/conda/envs/extasy16/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 771, in sleep
deadline = time.time() + duration
TypeError: unsupported operand type(s) for +: 'float' and 'str'
radical-stack:
python : 2.7.14
pythonpath :
virtualenv : extasy16
radical.analytics : v0.50.0-10-g76b5950@devel
radical.entk : 0.7.6-0.7.6@devel
radical.pilot : 0.50.9-v0.50.9-9-ga4ae131d@devel
radical.utils : 0.50.1-v0.50.1-3-g2b7f6c6@devel
saga : 0.50.0-v0.50.0-1-gdeb47812@feature-issue_grlsd_95
this seems to be an unrelated issue on the Pika/RabbitMQ layer. @vivek-bala , any idea what's up?
how do I actually check the submitted pbs script if it's correct?
export RADICAL_SAGA_VERBOSE=DEBUG
export RADICAL_SAGA_LOG_TGT=rs.log
The resulting log should contain the PBS script somewhere - see here
my rs.log is useless since several entk runs at the same time and overwrite it all the same. It would be nice if the pbs script would be on an easier accessible place. Where is the pbs script on the remote system?
with export RADICAL_SAGA_LOG_TGT=rs_123.log
, you can chose a different file name for each run, r you could run a separate test. But also, the logfile writes to the file should be fairly atomic, and you have a decent chance of finding the script - search for the string Generated PBS script
.
The script is not logged separately, and is also not kept on the target side.
got in saga log tgt
2018-09-30 15:14:40,063: radical.saga.cpi : pmgr.0000.launching.0 : Thread-3 : ERROR : Error running job via 'qsub':
ERROR: Multiple node types 'nodes=10:ppn=16:xk:flags=commtransparent' is requested in a single 'nodes=...' statement,
if you are requesting for multiple node types, please use a '+' sign between node types.
Example: "qsub -l nodes=23:ppn=16:xk+40:ppn=32:xe"
qsub: Your job has been administratively rejected by the queueing system.
qsub: There may be a more detailed explanation prior to this notice.
This is unexpected, in that this seems unrelated to the additional flag. I am not sure what we are missing here. @euhruska , you had a contact at NCSA who recommended to use that flag, right? Do you mind asking him about the expected qsub command formatting? Alternatively, would you please forward me the email address, and I'll check with them. Thanks!
I just sent an email, but maybe it's supposed to be literally a new line only having this:
#PBS -l flags=commtransparent
answer:
submit it as a separate property and not after the resource "xk" specifier.
qsub -I -l nodes=10:ppn=16:xk -l flags=commtransparent
any update?
I fixed the branch, please do give it another try!
no issues encountered, works
To reduce queuetime, removing topology restriction on bluewaters: Have to add into entk pbs script:
PBS -l flags=commtransparent
Vivek mentioned a method going into RP for tweaks. Is there a simpler way?