radical-collaboration / extasy-grlsd

Repository to hold the input data and scripts for the ExTASY gromacs-lsdmap work
1 stars 1 forks source link

run out of memory on bluewaters #44

Closed euhruska closed 6 years ago

euhruska commented 6 years ago

I increased the number of units to 1000 and I get the following error message:

radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017558.0002/pilot.0000/unit.001000/unit.001000.sh: fork: No space left on device

Logfile local: rp.session.leonardo.rice.edu.eh22.017558.0002.zip Logfile remote remote zipped only one unit to reduce size from 2G: rp.session.leonardo.rice.edu.eh22.017558.0002-remote.zip

I found the same error message in this pdf https://bluewaters.ncsa.illinois.edu/c/document_library/get_file?uuid=7013c401-80ba-4c52-b377-50d2fa4da8e1&groupId=10157 on page 5, claims there is a memory limit on MOM node on bluewaters.

My question is now, am I able to launch 1000 units on bluewaters, each generates 16M of data?

andre-merzky commented 6 years ago

This is likely not a memory issue, but a process limit limit. It seems you are using the resource tag ncsa.bw_aprun, is that correct? Please give ncsa.bw a try, which will use the ORTE backend.

euhruska commented 6 years ago

I an error got in bootstrap_1.out (nothing in bootstrap_1.err)

################################################################################
## Searching for available TCP port for tunnel in range 23000..23100.
## Found available port: 23000
0.0557,tunnel_setup_stop,bootstrap_1,MainThread,pilot.0000,PMGR_ACTIVE_PENDING,
PYTHON: /sw/bw/bwpy/mnt/bin/python
PIP   : /sw/bw/bwpy/mnt/bin/pip --cert /scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017560.0001/pilot.0000/../cacert.pem
0.1232,ve_setup_start,bootstrap_1,MainThread,pilot.0000,PMGR_ACTIVE_PENDING,
virtenv_create   : TRUE
virtenv_update   : FALSE
rp install sources:  radical.utils-0.47/ saga-python-0.47/ radical.pilot-0.47.1/
rp install target : SANDBOX
rp install lock   : FALSE
virtenv /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw.0.47.1 exists
2.7280,ve_activate_start,bootstrap_1,MainThread,pilot.0000,PMGR_ACTIVE_PENDING,
PYTHON: /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw.0.47.1/bin/python
PIP   : /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw.0.47.1/bin/pip --cert /scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017560.0001/pilot.0000/../cacert.pem
/mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw.0.47.1/bin/python: error while loading shared libraries: libpython2.7.so.1.0: cannot open shared object file: No such file or directory
/mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw.0.47.1/bin/python: error while loading shared libraries: libpython2.7.so.1.0: cannot open shared object file: No such file or directory
PYTHON INTERPRETER: /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw.0.47.1/bin/python
PYTHON_VERSION    :
VE_MOD_PREFIX     :
PIP installer     : /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw.0.47.1/bin/pip --cert /scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017560.0001/pilot.0000/../cacert.pem
/mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw.0.47.1/bin/python: error while loading shared libraries: libpython2.7.so.1.0: cannot open shared object file: No such file or directory
PIP version       :
activated virtenv
VIRTENV      : /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw.0.47.1
VE_MOD_PREFIX: ///////
RP_MOD_PREFIX: ///////
PYTHONPATH   : ///////:/opt/xalt/0.7.6/sles11.3/libexec:/opt/cray/sdb/1.1-1.0502.63652.4.27.gem/lib64/py
2.9590,ve_activate_stop,bootstrap_1,MainThread,pilot.0000,PMGR_ACTIVE_PENDING,
do not update virtenv /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw.0.47.1
2.9863,rp_install_start,bootstrap_1,MainThread,pilot.0000,PMGR_ACTIVE_PENDING,
Using RADICAL-Pilot install sources ' radical.utils-0.47/ saga-python-0.47/ radical.pilot-0.47.1/'
VE_MOD_PREFIX: ///////
VIRTENV      : /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw.0.47.1
SANDBOX      : /scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017560.0001/pilot.0000
VE_LOC_PREFIX:
using local install tree
PYTHONPATH: ///////::/opt/xalt/0.7.6/sles11.3/libexec:/opt/cray/sdb/1.1-1.0502.63652.4.27.gem/lib64/py
rp_install: ///////
radicalmod: ////////radical/
mkdir: cannot create directory `////////radical//': Read-only file system
/scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017560.0001/bootstrap_1.sh: line 1225: ////////radical//__init__.py: No such file or directory
/scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017560.0001/bootstrap_1.sh: line 1226: ////////radical//__init__.py: No such file or directory
/scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017560.0001/bootstrap_1.sh: line 1227: ////////radical//__init__.py: No such file or directory
/scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017560.0001/bootstrap_1.sh: line 1228: ////////radical//__init__.py: No such file or directory
created radical namespace in ////////radical//__init__.py

# -------------------------------------------------------------------
#
# update radical.utils-0.47/ via pip
# cmd: /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw.0.47.1/bin/pip --cert /scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017560.0001/pilot.0000/../cacert.pem install  --src '/scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017560.0001/pilot.0000/rp_install/src' --build '/scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017560.0001/pilot.0000/rp_install/build' --install-option='--prefix=/scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017560.0001/pilot.0000/rp_install' --no-deps radical.utils-0.47/
#
/mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw.0.47.1/bin/python: error while loading shared libraries: libpython2.7.so.1.0: cannot open shared object file: No such file or directory
#
# ERROR
# no fallback command available
#
# -------------------------------------------------------------------
Couldn't install radical.utils-0.47/! Lets see how far we get ...
purge install source at radical.utils-0.47/

# -------------------------------------------------------------------
#
# update saga-python-0.47/ via pip
# cmd: /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw.0.47.1/bin/pip --cert /scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017560.0001/pilot.0000/../cacert.pem install  --src '/scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017560.0001/pilot.0000/rp_install/src' --build '/scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017560.0001/pilot.0000/rp_install/build' --install-option='--prefix=/scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017560.0001/pilot.0000/rp_install' --no-deps saga-python-0.47/
#
/mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw.0.47.1/bin/python: error while loading shared libraries: libpython2.7.so.1.0: cannot open shared object file: No such file or directory
#
# ERROR
# no fallback command available
#
Couldn't install saga-python-0.47/! Lets see how far we get ...
purge install source at saga-python-0.47/

# -------------------------------------------------------------------
#
# update radical.pilot-0.47.1/ via pip
# cmd: /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw.0.47.1/bin/pip --cert /scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017560.0001/pilot.0000/../cacert.pem install  --src '/scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017560.0001/pilot.0000/rp_install/src' --build '/scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017560.0001/pilot.0000/rp_install/build' --install-option='--prefix=/scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017560.0001/pilot.0000/rp_install' --no-deps radical.pilot-0.47.1/
#
/mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw.0.47.1/bin/python: error while loading shared libraries: libpython2.7.so.1.0: cannot open shared object file: No such file or directory
#
# ERROR
# no fallback command available
#
# -------------------------------------------------------------------
Couldn't install radical.pilot-0.47.1/! Lets see how far we get ...
purge install source at radical.pilot-0.47.1/
4.6689,rp_install_stop,bootstrap_1,MainThread,pilot.0000,PMGR_ACTIVE_PENDING,
4.6797,ve_setup_stop,bootstrap_1,MainThread,pilot.0000,PMGR_ACTIVE_PENDING,
4.6905,ve_activate_start,bootstrap_1,MainThread,pilot.0000,PMGR_ACTIVE_PENDING,
which: no radical-pilot-agent in (/scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017560.0001/pilot.0000/rp_install/bin:/mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw.0.47.1/bin:/mnt/bwpy/single/bin:/mnt/bwpy/single/usr/bin:/sw/bw/bwpy/mnt/bin:/opt/bwpy/bin:/opt/cray/pmi/5.0.10-1.0000.11050.179.3.gem/bin:/opt/gcc/4.9.3/bin:/sw/xe/darshan/3.1.3/darshan-3.1.3/bin:/sw/EasyBuild/software/gnuplot/5.0.5/bin:/sw/admin/scripts:/sw/user/scripts:/opt/xalt/0.7.6/sles11.3/libexec:/opt/xalt/0.7.6/sles11.3/bin:/opt/moab/9.0.2/sbin:/opt/torque/6.0.4/sbin:/opt/torque/6.0.4/bin:/opt/cray/mpt/7.5.0/gni/bin:/opt/cray/craype/2.5.8/bin:/opt/cray/llm/default/bin:/opt/cray/llm/default/etc:/opt/cray/xpmem/0.1-2.0502.64982.5.3.gem/bin:/opt/cray/ugni/6.0-1.0502.10863.8.28.gem/bin:/opt/cray/udreg/2.3.2-1.0502.10518.2.17.gem/bin:/opt/cray/lustre-cray_gem_s/2.5_3.0.101_0.46.1_1.0502.8871.24.1-1.0502.21704.63.1/sbin:/opt/cray/lustre-cray_gem_s/2.5_3.0.101_0.46.1_1.0502.8871.24.1-1.0502.21704.63.1/bin:/opt/cray/alps/5.2.4-2.0502.9774.31.12.gem/sbin:/opt/cray/alps/5.2.4-2.0502.9774.31.12.gem/bin:/opt/cray/sdb/1.1-1.0502.63652.4.27.gem/bin:/opt/cray/nodestat/2.2-1.0502.60539.1.31.gem/bin:/opt/modules/3.2.10.5/bin:/opt/moab/9.0.2/bin:/u/sciteam/hruska/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/X11R6/bin:/usr/games:/usr/lib/mit/bin:/usr/lib/mit/sbin:.:/usr/lib/qt3/bin:/opt/cray/bin)
verify python viability: /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw.0.47.1/bin/python .../mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw.0.47.1/bin/python: error while loading shared libraries: libpython2.7.so.1.0: cannot open shared object file: No such file or directory
 failed
python installation (/mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw.0.47.1/bin/python) is not usable - abort
kill: no process ID specified
Try `kill --help' for more information.
vivek-bala commented 6 years ago

Can you remove /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw.0.47.1 and try again?

Is this with ncsa.bw ? @andre-merzky : doesn't using orte mean that the execution kernels that use mpi need to be recompiled with openmpi that RP uses? or is that not the case anymore?

euhruska commented 6 years ago

failed with

# unpacking virtualenv tgz
# cmd: tar zxmf 'virtualenv-1.9.tar.gz'
#
#
# SUCCESS
#
# -------------------------------------------------------------------

# -------------------------------------------------------------------
#
# Create virtualenv
# cmd: /sw/bw/bwpy/mnt/bin/python virtualenv-1.9/virtualenv.py /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw.0.47.1
#
New python executable in /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw.0.47.1/bin/python
Installing setuptools............done.
Installing pip...............done.
#
# ERROR
# no fallback command available
#
# -------------------------------------------------------------------
ERROR: Couldn't create virtualenv
Error on virtenv creation -- abort
removed `/mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw.0.47.1.lock'
kill: no process ID specified
Try `kill --help' for more information.
vivek-bala commented 6 years ago

Hmmm. The logs might be similar to the ones you posted initially. But just in case, could you upload the client and remote logs again?

andre-merzky commented 6 years ago

Can you please remove /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw.0.47.1/ and try again? Seems like a binary incompatible python update got activated on BW which screwed up the VE...

vivek-bala commented 6 years ago

@andre-merzky that's what I suggested above :) The last failure reported by Eugen is just after that. Heads-up on a similar ticket from Srinivas.

andre-merzky commented 6 years ago

Ah, sorry, i missed that - thanks! Lets see where removing the VE gets us, I'll check Srinivas' ticket after that.

euhruska commented 6 years ago

still fails rp.session.leonardo.rice.edu.eh22.017560.0004-remote.zip rp.session.leonardo.rice.edu.eh22.017560.0004.zip

euhruska commented 6 years ago

now even 'ncsa.bw_aprun' generates the same error

andre-merzky commented 6 years ago

Quick note that this is worked upon, see radical-cybertools/radical.pilot/issues/1546. The culprit seems to be a mixture of BW python update and apache-libcloud not liking our version of setuptools anymore.

euhruska commented 6 years ago

any progress?

andre-merzky commented 6 years ago

The fix (or rather workaround) waits for confirmation from Srinivas. If you have the time, can you give the instruction in radical-cybertools/radical.pilot#1546 a try?

euhruska commented 6 years ago

https://github.com/radical-cybertools/radical.pilot/issues/1546 reports that it works, but I still get the same error as before the bootstrap_1.out fails, see: rp.session.leonardo.rice.edu.eh22.017571.0001-remote.zip

andre-merzky commented 6 years ago

This is after you patched python2.7 in /scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.47.1/bin? Can you please run these commands and send the output:

$ cd  /scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.47.1/bin
$ cat python 2.7
$ module load bwpy
$ source activate
$ ./python -V

Thanks!

euhruska commented 6 years ago

I got some missing libraries, how to I load them?

>>>which python
/mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.47/bin/python
>>>./python -V
./python: error while loading shared libraries: libpython2.7.so.1.0: cannot open shared object file: No such file or directory
andre-merzky commented 6 years ago

Hmm, but those are different commands :-) Did you create the new python2. 7 script?

...

On Feb 9, 2018 13:24, "Eugen Hruska" notifications@github.com wrote:

I got:

which python /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.47/bin/python ./python -V ./python: error while loading shared libraries: libpython2.7.so.1.0: cannot open shared object file: No such file or directory

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/radical-collaboration/extasy-grlsd/issues/44#issuecomment-364420039, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQi-pr3Gl0WpB5AtW-xuBCixjsJ7P66ks5tTDkGgaJpZM4Rvmma .

euhruska commented 6 years ago

is 0.47 and 47.1 different, I see these two versions mixed in https://github.com/radical-cybertools/radical.pilot/issues/1546 I did

source ve.ncsa.bw_aprun.0.47/bin/activate
cd ve.ncsa.bw_aprun.0.47/bin/
mv python2.7 python2.7-exe
pwd
cat > python2.7
#!/bin/bash

exec /sw/bw/bwpy/mnt/bin/bwpy-environ --  /u/sciteam/hruska/scratch/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.47/bin/python2.7-exe '$@"
^C
chmod 0755 python2.7
andre-merzky commented 6 years ago

You will need to re-create and then patch the VE for the RP version you intent to use. The above commands are indeed the commands to patch the VE.

euhruska commented 6 years ago

hm, did it for 0.47.1, which python gives /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.47.1/bin/python but still ./python -V gives ./python: error while loading shared libraries: libpython2.7.so.1.0: cannot open shared object file: No such file or directory

andre-merzky commented 6 years ago

Could you please make the VE on BW readable? I'd like to have a look, if you don't mind...

euhruska commented 6 years ago

here: /scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.47.1

andre-merzky commented 6 years ago

sorry:

 h2ologin4  merzky  //scratch/sciteam  2   $ l -d hruska/
drwx------ 20 hruska PRAC_bamm 4096 Feb  2 16:27 hruska//
euhruska commented 6 years ago

better now?

andre-merzky commented 6 years ago

better, yes, but:

 h2ologin4  merzky  …/sciteam/hruska/radical.pilot.sandbox   $ cd ve.ncsa.bw.0.47.1/
-bash: cd: ve.ncsa.bw.0.47.1/: Permission denied

:-)

euhruska commented 6 years ago

I changed "aprun" - ve.ncsa.bw_aprun.0.47.1

andre-merzky commented 6 years ago

Thanks - that explains things... Seems like your VE was different from what I and srinivas got. We ended up with this link chain:

python -> python2 -> python2.7

where the last one was the binary which got then swapped out by the patch. You seem to have the opposite:

python2.7 -> python

and thus the fix did not do much. I have no idea why that was different.

Please try the following:

$ rm python2.7-exe python2
$ mv python python2.7-exe
$ ln -s python2.7 python2
$ ln -s python2 python

but also, the script in python2.7 misses the python executable. Please change from

exec /sw/bw/bwpy/mnt/bin/bwpy-environ -- /u/sciteam/hruska/scratch/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.47.1/bin "$@"

to

exec /sw/bw/bwpy/mnt/bin/bwpy-environ -- /u/sciteam/hruska/scratch/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.47.1/bin/python2.7-exe "$@"
euhruska commented 6 years ago

looks ok now

euhruska commented 6 years ago

not it fails in bootstrap_1.out with

purge install source at radical.pilot-0.47.1/
18.3512,rp_install_stop,bootstrap_1,MainThread,pilot.0000,PMGR_ACTIVE_PENDING,
18.3618,ve_setup_stop,bootstrap_1,MainThread,pilot.0000,PMGR_ACTIVE_PENDING,
18.3722,ve_activate_start,bootstrap_1,MainThread,pilot.0000,PMGR_ACTIVE_PENDING,
verify python viability: /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.47.1/bin/python ... ok
verify module viability: saga            ...Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017571.0002/pilot.0000/rp_install/lib/python2.7/site-packages/saga/__init__.py", line 8, in <module>
    import radical.utils        as ru
  File "/scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017571.0002/pilot.0000/rp_install/lib/python2.7/site-packages/radical/utils/__init__.py", line 11, in <module>
    from .plugin_manager import PluginManager
  File "/scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017571.0002/pilot.0000/rp_install/lib/python2.7/site-packages/radical/utils/plugin_manager.py", line 14, in <module>
    from .logger import get_logger
  File "/scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017571.0002/pilot.0000/rp_install/lib/python2.7/site-packages/radical/utils/logger.py", line 118, in <module>
    import colorama
ImportError: No module named colorama
 failed
python installation cannot load module saga - abort
andre-merzky commented 6 years ago

This I don't understand: the VE should have installed colorama if you used the script from https://github.com/radical-cybertools/radical.pilot/blob/devel/bin/radical-pilot-create-static-ve - line 6 lists colorama as dependency. Can you please check if you used the right script, and if that worked without error message? Can you confirm that it used the Virtualenv you patched?

euhruska commented 6 years ago

yes, says verify python viability: /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.47.1/bin/python ... ok and

VIRTENV : /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.47.1 (normalized)
andre-merzky commented 6 years ago

And is colorama available in that ve? (python -c "import colorama")

euhruska commented 6 years ago

no

andre-merzky commented 6 years ago

Eugene,

I updated the VE creation script to do all the patching. Please download it from https://github.com/radical-cybertools/radical.pilot/blob/dbb141745330590d4bb30190dfcfeee2d3bcc07c/bin/radicalpilot-create-static-ve . Remove all the VEs from the radical.pilot.sandbox dir (they won't work anymore anyway), and create the one you need with

radicalpilot-create-static-ve "/path/to/ve" bw

You may want to verify that the resulting ve is valid, with

$ module load bwpy
$ source "/path/to/ve/bin/activate"
$ which python
$ python -V

If that gives the expected results, a pilot agent should be able to use that VE.

Let me know how it goes!

euhruska commented 6 years ago

fails with radicalpilot-create-static-ve: line 94: exec: bwpy-environment: not found

euhruska commented 6 years ago

any idea why?

andre-merzky commented 6 years ago

Yes - please try again with https://raw.githubusercontent.com/radical-cybertools/radical.pilot/e849fcf33b3b7d6b50976507d60e4613b1002fbe/bin/radicalpilot-create-static-ve - thanks!

euhruska commented 6 years ago

creating the environment works now, but when I run extasy I get Cannot mount ext3 image on /dev/loop0 Details:

# -------------------------------------------------------------------
# Touching output tarballs
# -------------------------------------------------------------------
create gtod
build gtod with cc... success
0.0081,bootstrap_1_start,bootstrap_1,MainThread,pilot.0000,PMGR_ACTIVE_PENDING,
VIRTENV : /scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.47
VIRTENV : /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.47 (normalized)
PYTHON: /sw/bw/bwpy/mnt/bin/python
PIP   : /sw/bw/bwpy/mnt/bin/pip --cert /scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017575.0001/pilot.0000/../cacert.pem
0.0723,ve_setup_start,bootstrap_1,MainThread,pilot.0000,PMGR_ACTIVE_PENDING,
virtenv_create   : TRUE
virtenv_update   : FALSE
rp install sources:  radical.utils-0.47-v0.47-4-gcca43d5-devel/ saga-python-0.47-v0.46-53-gb342c0c3-feature-gpu/ radical.pilot-0.47-0.47-118-gf66e2f6d-feature-gpu/
rp install target : SANDBOX
rp install lock   : FALSE
virtenv /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.47 exists
0.9033,ve_activate_start,bootstrap_1,MainThread,pilot.0000,PMGR_ACTIVE_PENDING,
PYTHON: /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.47/bin/python
PIP   : /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.47/bin/pip --cert /scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017575.0001/pilot.0000/../cacert.pem
Error: Cannot mount ext3 image on /dev/loop0 (/mnt/a/sw/xe_xk_cle5.2UP02_pe2.3.0/images/bwpy/bwpy-0.3.2-20180213.img): Invalid argument!
Error: Error disassociating image from loop device: Device or resource busy!
Error: Cannot mount ext3 image on /dev/loop1 (/mnt/a/sw/xe_xk_cle5.2UP02_pe2.3.0/images/bwpy/bwpy-0.3.2-20180213.img): Invalid argument!
PYTHON INTERPRETER: /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.47/bin/python
PYTHON_VERSION    :
VE_MOD_PREFIX     :
PIP installer     : /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.47/bin/pip --cert /scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017575.0001/pilot.0000/../cacert.pem
Error: Cannot mount ext3 image on /dev/loop0 (/mnt/a/sw/xe_xk_cle5.2UP02_pe2.3.0/images/bwpy/bwpy-0.3.2-20180213.img): Invalid argument!
PIP version       :
activated virtenv
VIRTENV      : /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.47
VE_MOD_PREFIX: ///////
RP_MOD_PREFIX: ///////
PYTHONPATH   : ///////:/opt/xalt/0.7.6/sles11.3/libexec:/opt/cray/sdb/1.1-1.0502.63652.4.27.gem/lib64/py
1.9275,ve_activate_stop,bootstrap_1,MainThread,pilot.0000,PMGR_ACTIVE_PENDING,
do not update virtenv /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.47
1.9390,rp_install_start,bootstrap_1,MainThread,pilot.0000,PMGR_ACTIVE_PENDING,
Using RADICAL-Pilot install sources ' radical.utils-0.47-v0.47-4-gcca43d5-devel/ saga-python-0.47-v0.46-53-gb342c0c3-feature-gpu/ radical.pilot-0.47-0.47-118-gf66e2f6d-feature-gpu/'
VE_MOD_PREFIX: ///////
VIRTENV      : /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.47
SANDBOX      : /scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017575.0001/pilot.0000
VE_LOC_PREFIX:
using local install tree
PYTHONPATH: ///////::/opt/xalt/0.7.6/sles11.3/libexec:/opt/cray/sdb/1.1-1.0502.63652.4.27.gem/lib64/py
rp_install: ///////
radicalmod: ////////radical/
mkdir: cannot create directory `////////radical//': Read-only file system
/scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017575.0001/bootstrap_1.sh: line 1225: ////////radical//__init__.py: No such file or directory
/scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017575.0001/bootstrap_1.sh: line 1226: ////////radical//__init__.py: No such file or directory
/scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017575.0001/bootstrap_1.sh: line 1227: ////////radical//__init__.py: No such file or directory
/scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017575.0001/bootstrap_1.sh: line 1228: ////////radical//__init__.py: No such file or directory
created radical namespace in ////////radical//__init__.py

# -------------------------------------------------------------------
#
# update radical.utils-0.47-v0.47-4-gcca43d5-devel/ via pip
# cmd: /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.47/bin/pip --cert /scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017575.0001/pilot.0000/../cacert.pem install  --src '/scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017575.0001/pilot.0000/rp_install/src' --build '/scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017575.0001/pilot.0000/rp_install/build' --install-option='--prefix=/scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017575.0001/pilot.0000/rp_install' --no-deps radical.utils-0.47-v0.47-4-gcca43d5-devel/
#
Error: Cannot mount ext3 image on /dev/loop0 (/mnt/a/sw/xe_xk_cle5.2UP02_pe2.3.0/images/bwpy/bwpy-0.3.2-20180213.img): Invalid argument!
Error: Error disassociating image from loop device: Device or resource busy!
#
# ERROR
# no fallback command available
#
# -------------------------------------------------------------------
Couldn't install radical.utils-0.47-v0.47-4-gcca43d5-devel/! Lets see how far we get ...
purge install source at radical.utils-0.47-v0.47-4-gcca43d5-devel/

# -------------------------------------------------------------------
#
# update saga-python-0.47-v0.46-53-gb342c0c3-feature-gpu/ via pip
# cmd: /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.47/bin/pip --cert /scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017575.0001/pilot.0000/../cacert.pem install  --src '/scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017575.0001/pilot.0000/rp_install/src' --build '/scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017575.0001/pilot.0000/rp_install/build' --install-option='--prefix=/scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017575.0001/pilot.0000/rp_install' --no-deps saga-python-0.47-v0.46-53-gb342c0c3-feature-gpu/
#
Error: Cannot mount ext3 image on /dev/loop0 (/mnt/a/sw/xe_xk_cle5.2UP02_pe2.3.0/images/bwpy/bwpy-0.3.2-20180213.img): Invalid argument!
#
# ERROR
# no fallback command available
#
# -------------------------------------------------------------------
Couldn't install saga-python-0.47-v0.46-53-gb342c0c3-feature-gpu/! Lets see how far we get ...
purge install source at saga-python-0.47-v0.46-53-gb342c0c3-feature-gpu/

# -------------------------------------------------------------------
#
# update radical.pilot-0.47-0.47-118-gf66e2f6d-feature-gpu/ via pip
# cmd: /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.47/bin/pip --cert /scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017575.0001/pilot.0000/../cacert.pem install  --src '/scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017575.0001/pilot.0000/rp_install/src' --build '/scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017575.0001/pilot.0000/rp_install/build' --install-option='--prefix=/scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017575.0001/pilot.0000/rp_install' --no-deps radical.pilot-0.47-0.47-118-gf66e2f6d-feature-gpu/
#
Error: Cannot mount ext3 image on /dev/loop0 (/mnt/a/sw/xe_xk_cle5.2UP02_pe2.3.0/images/bwpy/bwpy-0.3.2-20180213.img): Invalid argument!
Error: Error disassociating image from loop device: Device or resource busy!
#
# ERROR
# no fallback command available
#
# -------------------------------------------------------------------
Couldn't install radical.pilot-0.47-0.47-118-gf66e2f6d-feature-gpu/! Lets see how far we get ...
purge install source at radical.pilot-0.47-0.47-118-gf66e2f6d-feature-gpu/
3.6720,rp_install_stop,bootstrap_1,MainThread,pilot.0000,PMGR_ACTIVE_PENDING,
3.6838,ve_setup_stop,bootstrap_1,MainThread,pilot.0000,PMGR_ACTIVE_PENDING,
3.6952,ve_activate_start,bootstrap_1,MainThread,pilot.0000,PMGR_ACTIVE_PENDING,
which: no radical-pilot-agent in (/scratch/sciteam/hruska/radical.pilot.sandbox/rp.session.leonardo.rice.edu.eh22.017575.0001/pilot.0000/rp_install/bin:/mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.47/bin:/mnt/bwpy/single/bin:/mnt/bwpy/single/usr/bin:/sw/bw/bwpy/mnt/bin:/opt/bwpy/bin:/opt/cray/pmi/5.0.10-1.0000.11050.179.3.gem/bin:/opt/gcc/4.9.3/bin:/sw/xe/darshan/3.1.3/darshan-3.1.3/bin:/sw/EasyBuild/software/gnuplot/5.0.5/bin:/sw/admin/scripts:/sw/user/scripts:/opt/xalt/0.7.6/sles11.3/libexec:/opt/xalt/0.7.6/sles11.3/bin:/opt/moab/9.0.2/sbin:/opt/torque/6.0.4/sbin:/opt/torque/6.0.4/bin:/opt/cray/mpt/7.5.0/gni/bin:/opt/cray/craype/2.5.8/bin:/opt/cray/llm/default/bin:/opt/cray/llm/default/etc:/opt/cray/xpmem/0.1-2.0502.64982.5.3.gem/bin:/opt/cray/ugni/6.0-1.0502.10863.8.28.gem/bin:/opt/cray/udreg/2.3.2-1.0502.10518.2.17.gem/bin:/opt/cray/lustre-cray_gem_s/2.5_3.0.101_0.46.1_1.0502.8871.24.1-1.0502.21704.63.1/sbin:/opt/cray/lustre-cray_gem_s/2.5_3.0.101_0.46.1_1.0502.8871.24.1-1.0502.21704.63.1/bin:/opt/cray/alps/5.2.4-2.0502.9774.31.12.gem/sbin:/opt/cray/alps/5.2.4-2.0502.9774.31.12.gem/bin:/opt/cray/sdb/1.1-1.0502.63652.4.27.gem/bin:/opt/cray/nodestat/2.2-1.0502.60539.1.31.gem/bin:/opt/modules/3.2.10.5/bin:/opt/moab/9.0.2/bin:/u/sciteam/hruska/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/X11R6/bin:/usr/games:/usr/lib/mit/bin:/usr/lib/mit/sbin:.:/usr/lib/qt3/bin:/opt/cray/bin)
verify python viability: /mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.47/bin/python ...Error: Cannot mount ext3 image on /dev/loop0 (/mnt/a/sw/xe_xk_cle5.2UP02_pe2.3.0/images/bwpy/bwpy-0.3.2-20180213.img): Invalid argument!
 failed
python installation (/mnt/c/scratch/sciteam/hruska/radical.pilot.sandbox/ve.ncsa.bw_aprun.0.47/bin/python) is not usable - abort
andre-merzky commented 6 years ago

Oh for christ sake... I will have to open a BW ticket for this one I'm afraid...

andre-merzky commented 6 years ago

From BW support: Can you get them to try again? I accidentally forgot to switch back to 20180201 while updating that image, so the image was momentarily invalid. They may have run it at just the wrong time. So, please do try again :-)

euhruska commented 6 years ago

bootstrap is ok