Open mturilli opened 7 years ago
From the dtn, I did the following:
cd ~/src
wget https://github.com/openshift/origin/releases/download/v3.7.0-alpha.1/openshift-origin-client-tools-v3.7.0-alpha.1-fdbd3dc-linux-64bit.tar.gz
tar xvfz openshift-origin-client-tools-v3.7.0-alpha.1-fdbd3dc-linux-64bit.tar.gz
cd
cp -p src/openshift-origin-client-tools-v3.7.0-alpha.1-fdbd3dc-linux-64bit/oc bin/
export PATH=$PATH:~/bin
oc login https://openshift.ccs.ornl.gov:8443
oc project radical-benchmark
oc get all
That returned details of all the pods we are running on OpenShift:
NAME REVISION DESIRED CURRENT TRIGGERED BY
deploymentconfigs/mongodb 1 1 1 config,image(mongodb:2.6)
NAME READY STATUS RESTARTS AGE
po/mongodb-1-x4mq1 1/1 Running 0 1h
NAME DESIRED CURRENT READY AGE
rc/mongodb-1 1 1 1 1h
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/mongodb 172.31.252.169 <none> 27017/TCP 1h
I then used oc
to forward the local 27017 port to the one of the pod on which we run mongodb:
oc port-forward mongodb-1-x4mq1 27017:27017 &
Forwarding from 127.0.0.1:27017 -> 27017
Forwarding from [::1]:27017 -> 27017
Then used localhost as mongodb endpoint for RP:
export RADICAL_PILOT_DBURL='mongodb://radical:2r4d1c4l@127.0.0.1:27017/htcbenchmark'
and run the first example with RP:
./00_getting_started.py ornl.titan_aprun
new session: [rp.session.dtn38.mturilli1.017422.0004] \
database : [mongodb://radical:2r4d1c4l@127.0.0.1:27017/htcbenchmark]Handling connection for 27017
Handling connection for 27017
ok
create pilot manager ok
submit 1 pilot(s)
. ok
--------------
RADICAL Utils -- Stacktrace [116803] [MainThread]
mturill+ 116803 110030 13 12:39 pts/3 00:00:00 | \_ python ./00_getting_started.py ornl.titan_aprun
mturill+ 116836 116803 0 12:39 pts/3 00:00:00 | \_ rp.control.pubsub.bridge.0000.child
mturill+ 116842 116803 0 12:39 pts/3 00:00:00 | \_ rp.state.pubsub.bridge.0000.child
mturill+ 116848 116803 0 12:39 pts/3 00:00:00 | \_ rp.log.pubsub.bridge.0000.child
mturill+ 116854 116803 1 12:39 pts/3 00:00:00 | \_ rp.update.0.child
mturill+ 116876 116803 0 12:39 pts/3 00:00:00 | \_ rp.pmgr.launching.queue.bridge.0000.child
mturill+ 116882 116803 3 12:39 pts/3 00:00:00 | \_ rp.pmgr.launching.0.child
mturill+ 116918 116803 0 12:39 pts/4 00:00:00 | \_ /bin/bash -i
mturill+ 116951 116803 0 12:39 pts/3 00:00:00 | \_ rp.umgr.reschedule.pubsub.bridge.0000.child
mturill+ 116957 116803 0 12:39 pts/3 00:00:00 | \_ rp.umgr.staging.input.queue.bridge.0000.child
mturill+ 116963 116803 0 12:39 pts/3 00:00:00 | \_ rp.umgr.staging.output.queue.bridge.0000.child
mturill+ 116969 116803 0 12:39 pts/3 00:00:00 | \_ rp.umgr.unschedule.pubsub.bridge.0000.child
Traceback (most recent call last):
File "./00_getting_started.py", line 73, in <module>
umgr = rp.UnitManager(session=session)
File "/autofs/nccs-svm1_home1/mturilli1/ve/test/lib/python2.7/site-packages/radical/pilot/unit_manager.py", line 120, in __init__
self.start(spawn=False)
File "/autofs/nccs-svm1_home1/mturilli1/ve/test/lib/python2.7/site-packages/radical/utils/process.py", line 491, in start
self._ru_initialize()
File "/autofs/nccs-svm1_home1/mturilli1/ve/test/lib/python2.7/site-packages/radical/utils/process.py", line 805, in _ru_initialize
self.ru_initialize_common()
File "/autofs/nccs-svm1_home1/mturilli1/ve/test/lib/python2.7/site-packages/radical/pilot/utils/component.py", line 530, in ru_initialize_common
self._log)
File "/autofs/nccs-svm1_home1/mturilli1/ve/test/lib/python2.7/site-packages/radical/pilot/utils/component.py", line 152, in start_bridges
bridge = rpu_Queue(session, bname, rpu_QUEUE_BRIDGE, bcfg_clone)
File "/autofs/nccs-svm1_home1/mturilli1/ve/test/lib/python2.7/site-packages/radical/pilot/utils/queue.py", line 192, in __init__
self._pqueue = mp.Queue()
File "/autofs/nccs-svm1_sw/dtn-rhel7/.spack/opt/spack/20170112/linux-rhel7-x86_64/gcc-4.8.5/python-2.7.12-fr6qk4ysignpfff3rzsdg5dibwc7smbg/lib/python2.7/multiprocessing/__init__.py", line 218, in Queue
return Queue(maxsize)
File "/autofs/nccs-svm1_sw/dtn-rhel7/.spack/opt/spack/20170112/linux-rhel7-x86_64/gcc-4.8.5/python-2.7.12-fr6qk4ysignpfff3rzsdg5dibwc7smbg/lib/python2.7/multiprocessing/queues.py", line 68, in __init__
self._wlock = Lock()
File "/autofs/nccs-svm1_sw/dtn-rhel7/.spack/opt/spack/20170112/linux-rhel7-x86_64/gcc-4.8.5/python-2.7.12-fr6qk4ysignpfff3rzsdg5dibwc7smbg/lib/python2.7/multiprocessing/synchronize.py", line 147, in __init__
SemLock.__init__(self, SEMAPHORE, 1, 1)
File "/autofs/nccs-svm1_sw/dtn-rhel7/.spack/opt/spack/20170112/linux-rhel7-x86_64/gcc-4.8.5/python-2.7.12-fr6qk4ysignpfff3rzsdg5dibwc7smbg/lib/python2.7/multiprocessing/synchronize.py", line 75, in __init__
sl = self._semlock = _multiprocessing.SemLock(kind, value, maxvalue)
OSError: [Errno 28] No space left on device
--------------
closing session rp.session.dtn38.mturilli1.017422.0004 \
close pilot manager \
wait for 1 pilot(s)
timeout
ok
Handling connection for 27017
+ rp.session.dtn38.mturilli1.017422.0004 (json)
Handling connection for 27017
- pilot.0000 (profiles)
Handling connection for 27017
- pilot.0000 (logfiles)
session lifetime: 2.4s ok
I.e., we use too many processes for a DTN. Back to the design board: try using titan headnode and see whether it is beefy enough for RP...
To run a test from Titan's head node (bad boy!) I did the following:
ssh <username>@titan.ccs.ornl.gov
module load python
rm -r ve/test (shared fs between headnode and dts, different version of python)
virtualenv ve/test
. ~/ve/test/bin/activate
pip install radical.pilot
export PATH=$PATH:~/bin (shared fs so I can reuse oc)
oc login https://openshift.ccs.ornl.gov:8443
oc port-forward mongodb-1-x4mq1 27017:27017 &
export RADICAL_PILOT_DBURL='mongodb://radical:2r4d1c4l@127.0.0.1:27017/htcbenchmark'
export LD_PRELOAD=/lib64/librt.so.1
cd github/radical.pilot/examples/
export RADICAL_SAGA_PTY_VERBOSE="DEBUG"
export RADICAL_PROFILE="True"
export RADICAL_PILOT_PROFILE="True"
export RADICAL_VERBOSE="DEBUG"
export RADICAL_LOG_TGT="/ccs/home/mturilli1/radical_rct.log"
./00_getting_started.py ornl.titan_aprun
I got the following:
================================================================================
Getting Started (RP version 0.46.2)
================================================================================
new session: [rp.session.titan-ext5.mturilli1.017422.0006] \
database : [mongodb://radical:2r4d1c4l@127.0.0.1:27017/htcbenchmark]Handling connection for 27017
Handling connection for 27017
ok
read config ok
--------------------------------------------------------------------------------
submit pilots
create pilot manager ok
create pilot description [ornl.titan_aprun:64] ok
submit 1 pilot(s)
. ok
--------------------------------------------------------------------------------
submit units
create unit manager ok
add 1 pilot(s) ok
create 5 unit description(s)
..... ok
submit 5 unit(s)
..... ok
--------------------------------------------------------------------------------
gather results
wait for 5 unit(s)
----- ok
--------------------------------------------------------------------------------
finalize
closing session rp.session.titan-ext5.mturilli1.017422.0006 \
close unit manager ok
close pilot manager \
wait for 1 pilot(s)
timeout
Handling connection for 27017
ok
Handling connection for 27017
+ rp.session.titan-ext5.mturilli1.017422.0006 (json)
Handling connection for 27017
+ pilot.0000 (profiles)
Handling connection for 27017
+ pilot.0000 (logfiles)
session lifetime: 49.2s ok
--------------------------------------------------------------------------------
Inspection of logs shows that pilot fails due to a known bug. Ready to be handed over to Andre.
I did the following:
Tried to connect directly to the mongodb container:
Likely this is a problem with a firewall