radical-cybertools / ExTASY

1 stars 1 forks source link

CoCo-Amber workflow on Archer only works if use e290 account #194

Closed CharlieLaughton closed 8 years ago

CharlieLaughton commented 9 years ago

Under ExTASY-0.2, I can run the example Amber-CoCo workflow on Archer if I use my e290 account, but not if I try to use my e280 account. In this latter case there seems to be a problem at an early stage of launching the radical pilot instance on Archer. I find in the radical.pilot.sandbox/.....-pilot.0000/ subdirectory an agent.err file which contains:

charlie@eslogin003:~/work/radical.pilot.sandbox/rp.session.marple.pharm.nottingam.ac.uk.charlie.016707.0004-pilot.0000> cat agent.err
which: no pip in (/opt/cray/llm/default/bin:/opt/cray/llm/default/etc:/opt/craylustre-cray_ari_s/2.4_3.0.80_0.5.1_1.0501.7664.16.1-1.0501.18401.34.1/sbin:/optcray/lustre-cray_ari_s/2.4_3.0.80_0.5.1_1.0501.7664.16.1-1.0501.18401.34.1/bin:opt/cray/MySQL/5.0.64-1.0000.7096.23.2/sbin:/opt/cray/MySQL/5.0.64-1.0000.7096.3.2/bin:/opt/cray/alps/5.1.1-2.0501.8471.1.1.ari/sbin:/opt/cray/alps/5.1.1-2.051.8471.1.1.ari/bin:/opt/cray/sdb/1.0-1.0501.48084.4.48.ari/bin:/opt/cray/nodestt/2.2-1.0501.47138.1.78.ari/bin:/usr/local/packages/cse/quickstart/1.0:/home/y0/y07/cse/nano/2.2.6/bin:/usr/local/packages/cse/serialJobs:/usr/local/packages/se/bolt/0.6/bin:/usr/local/packages/cse/checkDisk:/usr/local/packages/cse/checkueue:/usr/local/packages/cse/checkScript:/usr/local/packages/cse/budgets:/work/07/y07/cse/python/2.7.6-ucs4/bin:/opt/cray/mpt/7.1.1/gni/bin:/opt/pbs/12.2.401.41761/bin:/opt/cray/atp/1.7.5/bin:/opt/cray/rca/1.0.0-2.0501.48090.7.46.ari/bin/opt/cray/alps/5.1.1-2.0501.8507.1.1.ari/sbin:/opt/cray/alps/5.1.1-2.0501.8507..1.ari/bin:/opt/cray/dvs/2.4_0.9.0-1.0501.1672.2.122.ari/bin:/opt/cray/csa/3.0.-1_2.0501.47112.1.91.ari/sbin:/opt/cray/csa/3.0.0-1_2.0501.47112.1.91.ari/bin:/pt/cray/job/1.5.5-0.1_2.0501.48066.2.43.ari/bin:/opt/cray/xpmem/0.1-2.0501.4842.3.3.ari/bin:/opt/cray/dmapp/7.0.1-1.0501.8315.8.4.ari/bin:/opt/cray/pmi/5.0.6-.0000.10439.140.2.ari/bin:/opt/cray/ugni/5.0-1.0501.8253.10.22.ari/bin:/opt/cra/udreg/2.3.2-1.0501.7914.1.13.ari/bin:/opt/cray/cce/8.3.7/cray-binutils/x86_64-nknown-linux-gnu/bin:/opt/cray/cce/8.3.7/craylibs/x86-64/bin:/opt/cray/cce/8.3./cftn/bin:/opt/cray/cce/8.3.7/CC/bin:/opt/cray/craype/2.2.1/bin:/opt/cray/switc/1.0-1.0501.47124.1.93.ari/bin:/opt/cray/eslogin/eswrap/1.1.0-1.010400.915.0/bi:/opt/modules/
This is a private computing facility. Access to this system is limited to those
who have been granted access by the operating service provider on behalf of the
issuing authority and use is restricted to the purposes for which access was
granted. All access and usage are governed by the terms and conditions of acces
agreed to by all registered users and are thus subject to the provisions of the
Computer Misuse Act, 1990 under which unauthorised use is a criminal offence.

If you are not authorised to use this service you must disconnect immediately.

rm: cannot remove `/fs4/e290/shared/shared_pilot_ve_20150429.lock': No such file or directory
rm: cannot remove `/fs4/e290/shared/shared_pilot_ve_20150429.lock': No such file or directory
rm: cannot remove `/fs4/e290/shared/shared_pilot_ve_20150429.lock': No such file or directory
rm: cannot remove `/fs4/e290/shared/shared_pilot_ve_20150429.lock': No such file or directory
rm: cannot remove `/fs4/e290/shared/shared_pilot_ve_20150429.lock': No such file or directory
rm: cannot remove `/fs4/e290/shared/shared_pilot_ve_20150429.lock': No such file or directory
=>> PBS: job killed: walltime 1210 exceeded limit 1200
kill: 8414: No such process
kill: 26453: No such process
default_bootstrapper.sh: line 116: 26453 Terminated              sleep 1

Why is radical pilot trying to do anything in /fs4/e290 when I am not logging in in the e290 group and so have no access to these folders and files - has something got hardwired that shouldn't be?

marksantcroos commented 9 years ago

Why is radical pilot trying to do anything in /fs4/e290 when I am not logging in in the e290 group and so have no access to these folders and files - has something got hardwired that shouldn't be?

Because of network limitations on ARCHER we have a pre-installed Virtual Environment created on ARCHER.

We might not have tried that with users that are not in e290, but it shouldnt be an issue fundamentally as long as the access rights for reading are open.

There is also https://github.com/radical-cybertools/radical.pilot/issues/616, which might bite you here actually. Andre?

CharlieLaughton commented 9 years ago

My understanding is that by default the /fs4/e290/shared folder will be accessible to anyone in the e290 group, but not anyone else. I’m not sure if making it world readable/writeable would be an issue for Archer security policies – any thoughts, Iain/Elena?

andre-merzky commented 9 years ago

Ah, radical-cybertools/radical.pilot#616 might indeed be a problem. I'll try to address this then later today :/

ibethune commented 9 years ago

I can confirm that other users can see the /fs4/e290/shared folder (also please remember to use the alias /work instead of /fs4). From my user ibethune (in group z01):

ibethune@eslogin005:~> ls -l /work/e290/shared
total 48
dr-xr-sr-x  6 marksant e290 4096 Jun 30  2014 shared_pilot_ve_20140630
dr-xr-sr-x  5 marksant e290 4096 Aug 29  2014 shared_pilot_ve_20140703
drwxr-sr-x  3 marksant e290 4096 Sep  4 13:01 shared_pilot_ve_20141216
dr-xr-sr-x  6 merzky   e290 4096 Feb 17  2015 shared_pilot_ve_20150217
dr-xr-sr-x  7 merzky   e290 4096 Feb 27  2015 shared_pilot_ve_20150225
drwxr-sr-x  6 merzky   e290 4096 Mar 25  2015 shared_pilot_ve_20150325
drwxr-sr-x  6 merzky   e290 4096 Apr 29 12:12 shared_pilot_ve_20150429
drwxr-sr-x  6 marksant e290 4096 Sep  4 13:47 shared_pilot_ve_20150904
drwxr-sr-x 11 merzky   e290 4096 Sep  9 13:39 shared_pilot_ve_20150909
drwxr-sr-x  6 marksant e290 4096 Sep 16 13:05 shared_pilot_ve_20150916
drwxr-sr-x  6 marksant e290 4096 Sep 24 12:39 shared_pilot_ve_20150924
drwxr-sr-x  5 merzky   e290 4096 Feb 17  2015 ve

Any update on the multiple concurrent usage problem https://github.com/radical-cybertools/radical.pilot/issues/616?

marksantcroos commented 9 years ago

On 01 Oct 2015, at 16:58 , ibethune notifications@github.com wrote: I can confirm that other users can see the /fs4/e290/shared folder

Thanks for checking. Based on the structure of the project trees I assume that is intentionally to facilitate sharing across projects? (As there is also a /home/e290/e290/shared for project-only sharing)

(also please remember to use the alias /work instead of /fs4).

Yeah, we do already: https://github.com/radical-cybertools/radical.pilot/blob/devel/src/radical/pilot/configs/resource_epsrc.json#L62

Any update on the multiple concurrent usage problem radical-cybertools/radical.pilot#616?

I'll let Andre answer that.

andre-merzky commented 9 years ago

I finally got around to address this -- apologies for the delay. This is scheduled for merge now, and should make it into the upcoming release candidate (radical-cybertools/radical.pilot/pull/784)

andre-merzky commented 9 years ago

radical-cybertools/radical.pilot#784 has been merged into RP devel -- please give it a try and let us know if any problems persist. We don't have a non-e290 account to test the actual setup :/ Thanks!

ibethune commented 8 years ago

All working from both y14 and e290 (also running multiple jobs at the same time from different accounts). Please close!