radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

Support for gridmuc.lrz.de #306

Closed mturilli closed 10 years ago

mturilli commented 10 years ago

What is the status of the support for gridmuc.lrz.de? Any documentation I can access?

Many thanks!

marksantcroos commented 10 years ago

On 28 Aug 2014, at 23:13 , mturilli notifications@github.com wrote:

What is the status of the support for gridmuc.lrz.de? Any documentation I can access?

It should just work [tm] :-)

Main attention points are that you need to use the mongodb running at: mongodb://ec2-184-72-89-141.compute-1.amazonaws.com:24242/ and that you need gsissh access to the machine.

Also I don't think we have much compute hours there currently as it was only to get RP going.

andre-merzky commented 10 years ago

Shantenu mentioned that we should have 1M hours now, but that is to be confirmed with Helmut (who is on leave).

andre-merzky commented 10 years ago

I see the following error on pilot startup:

2014:09:11 09:58:16 radical.pilot.MainProcess: [DEBUG   ] Bootstrap command line: /bin/bash ['-l', 'default_bootstrapper.sh', "-n radicalpilot -s 5411722820a64139d3dcea2f -p 5411722a20a64139d3dcea31 -t 5 -d 10 -c 8 -v 0.19 -m ec2-184-72-89-141.compute-1.amazonaws.com:24242  -e 'source /etc/profile'  -e 'source /etc/profile.d/modules.sh'  -e 'echo module purge'  -e 'echo module load lrz'  -e 'module load python/2.7.6'  -e 'module unload mpi.ibm'  -e 'module load mpi.intel'  -e 'source /home/hpc/pr87be/di29sut/pilotve/bin/activate'  -g /home/hpc/pr87be/di29sut/pilotve  -l LOADL  -j SSH  -k MPIEXEC  -f login03 "]
2014:09:11 09:58:16 radical.pilot.MainProcess: [DEBUG   ] Submitting SAGA job with description: {'Executable': '/bin/bash', 'WorkingDirectory': '/home/hpc/pr87be/di29suh/radical.pilot.sandbox/pilot-5411722a20a64139d3dcea31', 'Queue': 'test', 'WallTimeLimit': 5, 'Arguments': ['-l', 'default_bootstrapper.sh', "-n radicalpilot -s 5411722820a64139d3dcea2f -p 5411722a20a64139d3dcea31 -t 5 -d 10 -c 8 -v 0.19 -m ec2-184-72-89-141.compute-1.amazonaws.com:24242  -e 'source /etc/profile'  -e 'source /etc/profile.d/modules.sh'  -e 'echo module purge'  -e 'echo module load lrz'  -e 'module load python/2.7.6'  -e 'module unload mpi.ibm'  -e 'module load mpi.intel'  -e 'source /home/hpc/pr87be/di29sut/pilotve/bin/activate'  -g /home/hpc/pr87be/di29sut/pilotve  -l LOADL  -j SSH  -k MPIEXEC  -f login03 "], 'Error': 'AGENT.STDERR', 'Output': 'AGENT.STDOUT', 'TotalCPUCount': 8}
2014:09:11 09:58:16 radical.pilot.MainProcess: [DEBUG   ] write: [   16] [  206] ((test -d /home/hpc/pr87be/di29suh/radical.pilot.sandbox/pilot-5411722a20a64139d3dcea31 && echo -n 0) || (mkdir -p /home/hpc/pr87be/di29suh/radical.pilot.sandbox/pilot-5411722a20a64139d3dcea31 && echo -n 1)\n)
2014:09:11 09:58:16 radical.pilot.MainProcess: [DEBUG   ] read : [   16] [    1] (0)
2014:09:11 09:58:16 radical.pilot.MainProcess: [DEBUG   ] read : [   16] [   10] (PROMPT-0->)
2014:09:11 09:58:16 radical.pilot.MainProcess: [DEBUG   ] write: [   16] [   52] ((test -d  && echo -n 0) || (mkdir -p  && echo -n 1)\n)
2014:09:11 09:58:16 radical.pilot.MainProcess: [DEBUG   ] read : [   16] [    1] (0)
2014:09:11 09:58:16 radical.pilot.MainProcess: [DEBUG   ] read : [   16] [   10] (PROMPT-0->)
2014:09:11 09:58:16 radical.pilot.MainProcess: [DEBUG   ] write: [   16] [   52] ((test -d  && echo -n 0) || (mkdir -p  && echo -n 1)\n)
2014:09:11 09:58:17 radical.pilot.MainProcess: [DEBUG   ] read : [   16] [    1] (0)
2014:09:11 09:58:17 radical.pilot.MainProcess: [DEBUG   ] read : [   16] [   10] (PROMPT-0->)
2014:09:11 09:58:17 radical.pilot.MainProcess: [DEBUG   ] write: [   16] [ 1953] (SCRIPTFILE=`mktemp -t SAGA-Pyt ... IPTFILE && rm -f $SCRIPTFILE\n)
2014:09:11 09:58:17 radical.pilot.MainProcess: [DEBUG   ] read : [   16] [  419] (\nWARNING: Number of Tasks per Node = 8\nWARNING: Note that, this is less than the number of Cores per Node (16)\nWARNING: and you will be accounted for 16 cores per node in either case!\nWARNING: Please see DOCUMENTATION for Node Allocation Policy:\nWARNING:        http://www.lrz.de/services/compute/supermuc/loadleveler .\nWARNING: \nWARNING: Ignore this WARNING if you use a reasonable number of Threads per task!\n)
2014:09:11 09:58:22 radical.pilot.MainProcess: [DEBUG   ] read : [   16] [  247] (\nINFO: Project: pr87be\nINFO: Project's Expiration Date:    2015-12-31\nINFO: Budget:                     Total [cpuh]        Used [cpuh]      Credit [cpuh]\nINFO:                                  1000000          499 (0%)       999501 (100%)\n\n)
2014:09:11 09:58:22 radical.pilot.MainProcess: [DEBUG   ] read : [   16] [   95] (llsubmit: Processed command file through Submit Filter: "/lrz/loadl/filter/submit_filter.pl".\n)
2014:09:11 09:58:22 radical.pilot.MainProcess: [DEBUG   ] read : [   16] [   57] (llsubmit: The job "srv03-ib.473250" has been submitted.\n)
2014:09:11 09:58:22 radical.pilot.MainProcess: [DEBUG   ] read : [   16] [   10] (PROMPT-0->)
2014:09:11 09:58:22 radical.pilot.MainProcess: [DEBUG   ] write: [   16] [   58] (/usr/bin/llq -j srv03-ib.473250 -r %st %dd %cc %jt %c %Xs\n)
2014:09:11 09:58:22 radical.pilot.MainProcess: [DEBUG   ] read : [   16] [   50] (llq: There is currently no job status to report.\n)
2014:09:11 09:58:22 radical.pilot.MainProcess: [DEBUG   ] read : [   16] [   10] (PROMPT-0->)
2014:09:11 09:58:22 radical.pilot.MainProcess: [DEBUG   ] write: [   16] [   51] (cat $HOME/.saga/adaptors/loadl_job/srv03-ib.473250\n)
2014:09:11 09:58:22 radical.pilot.MainProcess: [DEBUG   ] read : [   16] [   99] (cat: /home/hpc/pr87be/di29suh/.saga/adaptors/loadl_job/srv03-ib.473250: No such file or directory\n)
2014:09:11 09:58:22 radical.pilot.MainProcess: [DEBUG   ] read : [   16] [   10] (PROMPT-1->)
2014:09:11 09:58:22 radical.pilot.MainProcess: [DEBUG   ] write: [   16] [   51] (cat $HOME/.saga/adaptors/loadl_job/srv03-ib.473250\n)
2014:09:11 09:58:23 radical.pilot.MainProcess: [DEBUG   ] read : [   16] [  109] (cat: /home/hpc/pr87be/di29suh/.saga/adaptors/loadl_job/srv03-ib.473250: No such file or directory\nPROMPT-1->)
2014:09:11 09:58:25 radical.pilot.MainProcess: [DEBUG   ] write: [   16] [   51] (cat $HOME/.saga/adaptors/loadl_job/srv03-ib.473250\n)
2014:09:11 09:58:25 radical.pilot.MainProcess: [DEBUG   ] read : [   16] [   99] (cat: /home/hpc/pr87be/di29suh/.saga/adaptors/loadl_job/srv03-ib.473250: No such file or directory\n)
2014:09:11 09:58:25 radical.pilot.MainProcess: [DEBUG   ] read : [   16] [   10] (PROMPT-1->)
2014:09:11 09:58:29 radical.pilot.MainProcess: [DEBUG   ] write: [   16] [   51] (cat $HOME/.saga/adaptors/loadl_job/srv03-ib.473250\n)
2014:09:11 09:58:29 radical.pilot.MainProcess: [DEBUG   ] read : [   16] [  109] (cat: /home/hpc/pr87be/di29suh/.saga/adaptors/loadl_job/srv03-ib.473250: No such file or directory\nPROMPT-1->)
2014:09:11 09:58:37 radical.pilot.MainProcess: [DEBUG   ] write: [   16] [   51] (cat $HOME/.saga/adaptors/loadl_job/srv03-ib.473250\n)
2014:09:11 09:58:37 radical.pilot.MainProcess: [DEBUG   ] read : [   16] [  158] (hostname: i19r01a03\nqsub_time: Thu Sep 11 09:58:16 2014\nstart_time: Thu Sep 11 11:58:26 2014\nexit_status: 1\nend_time: Thu Sep 11 11:58:27 2014\nPROMPT-0->)
2014:09:11 09:58:37 radical.pilot.MainProcess: [ERROR   ] Pilot launching failed: SAGA Job state was FAILED.
Traceback (most recent call last):
  File "/home/merzky/matteos_experiments/ve/local/lib/python2.7/site-packages/radical/pilot/controller/pilot_launcher_worker.py", line 463, in run
    raise Exception("SAGA Job state was FAILED.")
Exception: SAGA Job state was FAILED.

looks like a SAGA level error, but I drop it here for the moment...

andre-merzky commented 10 years ago

Uhm, sooo, I removed the rm $SCRIPTFILE from the SAGA adaptor, to get some debugging going, and now the job submission succeeds. Hmm -- is there a race between us removing that script and llsubmit using it?

Either way, things look better, but I don't seem to have access to the global_virtenv:

di29suh@login05:~/radical.pilot.sandbox/pilot-5411741f20a6413d2a6a52e3> cat *ERR

DISK GROUP QUOTAS for home and project file systems:
Filesystem                                 Quota     Used Space     Free Space
/home/hpc/pr87be          ($HOME)        100.0GB    2.3GB ( 2%)   97.7GB (98%)
/gpfs/work/pr87be         ($WORK)        900.0GB    4.5MB ( 0%)  900.0GB (100%)
-------------------------------------------------------------------------------

Executing LRZ User Prolog ...
The mpi4py module is meant only for VERCE users. For all other users, please use the system default anaconda python module.

This python module is meant for VERCE users. Normal LRZ users should use the default python module using this command "module load python".
default_bootstrapper.sh: line 315: /home/hpc/pr87be/di29sut/pilotve/bin/activate: Permission denied

I am not sure what to make of the mpi4py error, but that won't be a showstopper AFAICS. Mark, would you mind some chmod magic (chmod -R a+rX /home/hpc/pr87be/di29sut/pilotve; chmod a+rX /home/hpc/pr87be/di29sut)? Thanks!

marksantcroos commented 10 years ago

On 11 Sep 2014, at 6:13 , Andre Merzky notifications@github.com wrote:

I am not sure what to make of the mpi4py error, but that won't be a showstopper AFAICS.

It's not indeed.

Mark, would you mind some chmod magic (chmod -R a+rX /home/hpc/pr87be/di29sut/pilotve; chmod a+rX /home/hpc/pr87be/di29sut)? Thanks!

Done.

The example works for me, so should work for you now too [tm]. (Simplified/corrected it a bit after your last unification)=

andre-merzky commented 10 years ago

Works, thanks! :)

marksantcroos commented 10 years ago

Both code and documentation are groomed: http://radicalpilot.readthedocs.org/en/latest/machconf.html#supermuc-lrz-de