radical-cybertools / radical.benchmark

Use RCT to benchmark HTC application on HPC resources
MIT License
0 stars 0 forks source link

Test RP from DTN #2

Open mturilli opened 7 years ago

mturilli commented 7 years ago

I did the following:

ssh <username>@dtn.ccs.ornl.gov
module load python/2.7.12 py-virtualenv/15.0.1 py-setuptools/25.2.0
virtualenv ve/test
. ve/test/bin/activate
pip install radical.pilot
git clone https://github.com/radical-cybertools/radical.pilot.git
cd radical.pilot/examples/
export RADICAL_PILOT_DBURL='mongodb://radical:<pswd>@172.31.252.169:27017/htcbenchmark'
./00_getting_started.py ornl.titan
new session: [rp.session.dtn38.mturilli1.017422.0003]                          \
database   : [mongodb://radical:2r4d1c4l@172.31.252.169:27017/htcbenchmark]  err
Traceback (most recent call last):
  File "./00_getting_started.py", line 36, in <module>
    session = rp.Session()
  File "/autofs/nccs-svm1_home1/mturilli1/ve/test/lib/python2.7/site-packages/radical/pilot/session.py", line 264, in __init__
    % (dburl, ex))  
RuntimeError: Couldn't create new session (database URL 'mongodb://radical:2r4d1c4l@172.31.252.169:27017/htcbenchmark' incorrect?): timed out

Tried to connect directly to the mongodb container:

python
Python 2.7.12 (default, Feb  1 2017, 13:57:05) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pymongo
>>> mongo = pymongo.MongoClient(host='172.31.252.169', port=27017)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/autofs/nccs-svm1_home1/mturilli1/ve/test/lib/python2.7/site-packages/pymongo/mongo_client.py", line 377, in __init__
    raise ConnectionFailure(str(e))
pymongo.errors.ConnectionFailure: timed out

Likely this is a problem with a firewall

mturilli commented 7 years ago

From the dtn, I did the following:

cd ~/src
wget https://github.com/openshift/origin/releases/download/v3.7.0-alpha.1/openshift-origin-client-tools-v3.7.0-alpha.1-fdbd3dc-linux-64bit.tar.gz 
tar xvfz openshift-origin-client-tools-v3.7.0-alpha.1-fdbd3dc-linux-64bit.tar.gz 
cd
cp -p src/openshift-origin-client-tools-v3.7.0-alpha.1-fdbd3dc-linux-64bit/oc bin/
export PATH=$PATH:~/bin
oc login https://openshift.ccs.ornl.gov:8443
oc project radical-benchmark
oc get all

That returned details of all the pods we are running on OpenShift:

NAME                        REVISION   DESIRED   CURRENT   TRIGGERED BY
deploymentconfigs/mongodb   1          1         1         config,image(mongodb:2.6)

NAME                 READY     STATUS    RESTARTS   AGE
po/mongodb-1-x4mq1   1/1       Running   0          1h

NAME           DESIRED   CURRENT   READY     AGE
rc/mongodb-1   1         1         1         1h

NAME          CLUSTER-IP       EXTERNAL-IP   PORT(S)     AGE
svc/mongodb   172.31.252.169   <none>        27017/TCP   1h

I then used oc to forward the local 27017 port to the one of the pod on which we run mongodb:

oc port-forward mongodb-1-x4mq1 27017:27017 &
Forwarding from 127.0.0.1:27017 -> 27017
Forwarding from [::1]:27017 -> 27017

Then used localhost as mongodb endpoint for RP:

export RADICAL_PILOT_DBURL='mongodb://radical:2r4d1c4l@127.0.0.1:27017/htcbenchmark'

and run the first example with RP:

./00_getting_started.py ornl.titan_aprun
new session: [rp.session.dtn38.mturilli1.017422.0004]                          \
database   : [mongodb://radical:2r4d1c4l@127.0.0.1:27017/htcbenchmark]Handling connection for 27017
Handling connection for 27017
        ok
create pilot manager                                                          ok
submit 1 pilot(s)
        .                                                                     ok
--------------
RADICAL Utils -- Stacktrace [116803] [MainThread]

mturill+ 116803 110030 13 12:39 pts/3    00:00:00  |           \_ python ./00_getting_started.py ornl.titan_aprun
mturill+ 116836 116803  0 12:39 pts/3    00:00:00  |               \_ rp.control.pubsub.bridge.0000.child
mturill+ 116842 116803  0 12:39 pts/3    00:00:00  |               \_ rp.state.pubsub.bridge.0000.child
mturill+ 116848 116803  0 12:39 pts/3    00:00:00  |               \_ rp.log.pubsub.bridge.0000.child
mturill+ 116854 116803  1 12:39 pts/3    00:00:00  |               \_ rp.update.0.child
mturill+ 116876 116803  0 12:39 pts/3    00:00:00  |               \_ rp.pmgr.launching.queue.bridge.0000.child
mturill+ 116882 116803  3 12:39 pts/3    00:00:00  |               \_ rp.pmgr.launching.0.child
mturill+ 116918 116803  0 12:39 pts/4    00:00:00  |               \_ /bin/bash -i
mturill+ 116951 116803  0 12:39 pts/3    00:00:00  |               \_ rp.umgr.reschedule.pubsub.bridge.0000.child
mturill+ 116957 116803  0 12:39 pts/3    00:00:00  |               \_ rp.umgr.staging.input.queue.bridge.0000.child
mturill+ 116963 116803  0 12:39 pts/3    00:00:00  |               \_ rp.umgr.staging.output.queue.bridge.0000.child
mturill+ 116969 116803  0 12:39 pts/3    00:00:00  |               \_ rp.umgr.unschedule.pubsub.bridge.0000.child
Traceback (most recent call last):
File "./00_getting_started.py", line 73, in <module>
umgr = rp.UnitManager(session=session)
File "/autofs/nccs-svm1_home1/mturilli1/ve/test/lib/python2.7/site-packages/radical/pilot/unit_manager.py", line 120, in __init__
self.start(spawn=False)
File "/autofs/nccs-svm1_home1/mturilli1/ve/test/lib/python2.7/site-packages/radical/utils/process.py", line 491, in start
self._ru_initialize()
File "/autofs/nccs-svm1_home1/mturilli1/ve/test/lib/python2.7/site-packages/radical/utils/process.py", line 805, in _ru_initialize
self.ru_initialize_common()
File "/autofs/nccs-svm1_home1/mturilli1/ve/test/lib/python2.7/site-packages/radical/pilot/utils/component.py", line 530, in ru_initialize_common
self._log)
File "/autofs/nccs-svm1_home1/mturilli1/ve/test/lib/python2.7/site-packages/radical/pilot/utils/component.py", line 152, in start_bridges
bridge = rpu_Queue(session, bname, rpu_QUEUE_BRIDGE, bcfg_clone)
File "/autofs/nccs-svm1_home1/mturilli1/ve/test/lib/python2.7/site-packages/radical/pilot/utils/queue.py", line 192, in __init__
self._pqueue = mp.Queue()
File "/autofs/nccs-svm1_sw/dtn-rhel7/.spack/opt/spack/20170112/linux-rhel7-x86_64/gcc-4.8.5/python-2.7.12-fr6qk4ysignpfff3rzsdg5dibwc7smbg/lib/python2.7/multiprocessing/__init__.py", line 218, in Queue
return Queue(maxsize)
File "/autofs/nccs-svm1_sw/dtn-rhel7/.spack/opt/spack/20170112/linux-rhel7-x86_64/gcc-4.8.5/python-2.7.12-fr6qk4ysignpfff3rzsdg5dibwc7smbg/lib/python2.7/multiprocessing/queues.py", line 68, in __init__
self._wlock = Lock()
File "/autofs/nccs-svm1_sw/dtn-rhel7/.spack/opt/spack/20170112/linux-rhel7-x86_64/gcc-4.8.5/python-2.7.12-fr6qk4ysignpfff3rzsdg5dibwc7smbg/lib/python2.7/multiprocessing/synchronize.py", line 147, in __init__
SemLock.__init__(self, SEMAPHORE, 1, 1)
File "/autofs/nccs-svm1_sw/dtn-rhel7/.spack/opt/spack/20170112/linux-rhel7-x86_64/gcc-4.8.5/python-2.7.12-fr6qk4ysignpfff3rzsdg5dibwc7smbg/lib/python2.7/multiprocessing/synchronize.py", line 75, in __init__
sl = self._semlock = _multiprocessing.SemLock(kind, value, maxvalue)
OSError: [Errno 28] No space left on device

--------------
closing session rp.session.dtn38.mturilli1.017422.0004                         \
close pilot manager                                                            \
wait for 1 pilot(s)
                                                                         timeout
                                                                              ok
Handling connection for 27017
+ rp.session.dtn38.mturilli1.017422.0004 (json)
Handling connection for 27017
- pilot.0000 (profiles)
Handling connection for 27017
- pilot.0000 (logfiles)
session lifetime: 2.4s                                                        ok

I.e., we use too many processes for a DTN. Back to the design board: try using titan headnode and see whether it is beefy enough for RP...

mturilli commented 7 years ago

To run a test from Titan's head node (bad boy!) I did the following:

ssh <username>@titan.ccs.ornl.gov
module load python
rm -r ve/test (shared fs between headnode and dts, different version of python)
virtualenv ve/test
. ~/ve/test/bin/activate
pip install radical.pilot
export PATH=$PATH:~/bin (shared fs so I can reuse oc)
oc login https://openshift.ccs.ornl.gov:8443
oc port-forward mongodb-1-x4mq1 27017:27017 &
export RADICAL_PILOT_DBURL='mongodb://radical:2r4d1c4l@127.0.0.1:27017/htcbenchmark'
export LD_PRELOAD=/lib64/librt.so.1
cd github/radical.pilot/examples/
export RADICAL_SAGA_PTY_VERBOSE="DEBUG"
export RADICAL_PROFILE="True"
export RADICAL_PILOT_PROFILE="True"
export RADICAL_VERBOSE="DEBUG"
export RADICAL_LOG_TGT="/ccs/home/mturilli1/radical_rct.log"
./00_getting_started.py ornl.titan_aprun

I got the following:

================================================================================
 Getting Started (RP version 0.46.2)                                            
================================================================================

new session: [rp.session.titan-ext5.mturilli1.017422.0006]                     \
database   : [mongodb://radical:2r4d1c4l@127.0.0.1:27017/htcbenchmark]Handling connection for 27017
Handling connection for 27017
        ok
read config                                                                   ok

--------------------------------------------------------------------------------
submit pilots                                                                   

create pilot manager                                                          ok
create pilot description [ornl.titan_aprun:64]                                ok
submit 1 pilot(s)
        .                                                                     ok

--------------------------------------------------------------------------------
submit units                                                                    

create unit manager                                                           ok
add 1 pilot(s)                                                                ok
create 5 unit description(s)
        .....                                                                 ok
submit 5 unit(s)
        .....                                                                 ok

--------------------------------------------------------------------------------
gather results                                                                  

wait for 5 unit(s)
        -----                                                                 ok

--------------------------------------------------------------------------------
finalize                                                                        

closing session rp.session.titan-ext5.mturilli1.017422.0006                    \
close unit manager                                                            ok
close pilot manager                                                            \
wait for 1 pilot(s)
                                                                         timeout
Handling connection for 27017
                                                                              ok
Handling connection for 27017
+ rp.session.titan-ext5.mturilli1.017422.0006 (json)
Handling connection for 27017
+ pilot.0000 (profiles)
Handling connection for 27017
+ pilot.0000 (logfiles)
session lifetime: 49.2s                                                       ok

--------------------------------------------------------------------------------

Inspection of logs shows that pilot fails due to a known bug. Ready to be handed over to Andre.