radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

agent failing to start on india #35

Closed marksantcroos closed 10 years ago

marksantcroos commented 10 years ago

$ python demo_milestone_02.py 2014:01:21 10:06:58 MainThread sinon.logger : [INFO ] loading sinon version: 0.1.3-20-gbfc3e3e 2014:01:21 10:06:59 MainThread sinon.logger : [INFO ] Created new Session {'session_uid': '52de8d12a85378cead404187', 'database_url': 'mongodb://ec2-184-72-89-141.compute-1.amazonaws.com:27017/'}. 2014:01:21 10:06:59 MainThread sinon.logger : [INFO ] Added credential {'UserID': 'marksant', 'UserPass': None, 'Type': 'SSH', 'UserKey': None} to session 52de8d12a85378cead404187. 2014:01:21 10:06:59 MainThread sinon.logger : [INFO ] Created new PilotManager {'type': 'PilotManager', 'uid': '52de8d13a85378cead404189'}.

And then it stops.

I see on india that the job has completed already: $ qstat -f 1326243 Job Id: 1326243.i136 Job_Name = SAGA-Python-PBSJobScript.f25824 Job_Owner = marksant@i136 resources_used.cput = 00:00:00 resources_used.mem = 0kb resources_used.vmem = 0kb resources_used.walltime = 00:00:00 job_state = C queue = batch server = i136 Checkpoint = u ctime = Tue Jan 21 10:07:04 2014 Error_Path = i136://tmp/sinon/pilot-52de8d13a85378cead40418a/STDERR exec_host = i82/7+i82/6+i82/5+i82/4+i82/3+i82/2+i82/1+i82/0 Hold_Types = n Join_Path = n Keep_Files = n Mail_Points = a mtime = Tue Jan 21 10:07:13 2014 Output_Path = i136://tmp/sinon/pilot-52de8d13a85378cead40418a/STDOUT Priority = 0 qtime = Tue Jan 21 10:07:04 2014 Rerunable = True Resource_List.ncpus = 1 Resource_List.nodect = 1 Resource_List.nodes = 1:ppn=8 Resource_List.walltime = 00:10:00 session_id = 14392 Variable_List = PBS_O_QUEUE=batch,PBS_O_HOME=/N/u/marksant, PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=marksant, PBS_O_PATH=/opt/xcat/bin:/opt/xcat/sbin:/N/soft/moab/5.4.0/sbin:/N/so ft/moab/5.4.0/bin:/opt/torque/sbin:/opt/torque/bin:/usr/kerberos/bin:/ usr/bin:/bin:/usr/sbin:/sbin:/opt/openssh-4.6p1/bin:/N/u/marksant/bin: /N/u/marksant/bin,PBS_O_MAIL=/var/spool/mail/marksant, PBS_O_SHELL=/bin/bash,PBS_O_HOST=i136,PBS_SERVER=i136, PBS_O_WORKDIR=/N/u/marksant,MODULE_VERSION_STACK=3.2.8, MANPATH=/opt/xcat/share/man:/N/soft/moab/5.4.0/man:/opt/torque/man:/u sr/share/man,HOSTNAME=i136,SHELL=/bin/bash,TERM=vt100,HISTSIZE=1000, SSH_CLIENT=69.125.59.160 64256 22,SSH_TTY=/dev/pts/15,PERL_BADLANG=0, USER=marksant, LSCOLORS=no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:bd=40;33;01 :cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=01;32:.cmd=01;32:.exe= 01;32:.com=01;32:.btm=01;32:.bat=01;32:.sh=01;32:.csh=01;32:.tar =01;31:.tgz=01;31:.arj=01;31:.taz=01;31:.lzh=01;31:.zip=01;31:.z =01;31:.Z=01;31:.gz=01;31:.bz2=01;31:.bz=01;31:.tz=01;31:.rpm=01 ;31:.cpio=01;31:.jpg=01;35:.gif=01;35:.bmp=01;35:.xbm=01;35:.xpm =01;35:.png=01;35:_.tif=01;35:, LD_LIBRARY_PATH=/N/soft/moab/5.4.0/lib:/opt/torque/lib, XCATROOT=/opt/xcat, PATH=/opt/xcat/bin:/opt/xcat/sbin:/N/soft/moab/5.4.0/sbin:/N/soft/moa b/5.4.0/bin:/opt/torque/sbin:/opt/torque/bin:/usr/kerberos/bin:/usr/bi n:/bin:/usr/sbin:/sbin:/opt/openssh-4.6p1/bin:/N/u/marksant/bin:/N/u/m arksant/bin,MAIL=/var/spool/mail/marksant,MODULE_VERSION=3.2.8, PWD=/N/u/marksant,INPUTRC=/etc/inputrc, LMFILES=/opt/Modules/3.2.8/modulefiles/tools/torque/2.5.5:/opt/Modu les/3.2.8/modulefiles/tools/moab/5.4.0,LANG=en_US.UTF-8, MODULEPATH=/opt/Modules/3.2.8/modulefiles/applications:/opt/Modules/3 .2.8/modulefiles/compilers:/opt/Modules/3.2.8/modulefiles/debuggers:/o pt/Modules/3.2.8/modulefiles/libraries:/opt/Modules/3.2.8/modulefiles/ tools,LOADEDMODULES=torque/2.5.5:moab/5.4.0,PS1=PROMPT-$?->, XCATHOST=im1:3001,PS2=, SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass, HOME=/N/u/marksant,SHLVL=1,LOGNAME=marksant,CVS_RSH=ssh, SSH_CONNECTION=69.125.59.160 64256 149.165.146.136 22, MODULESHOME=/opt/Modules/3.2.8,LESSOPEN=|/usr/bin/lesspipe.sh %s, INCLUDE=/N/soft/moab/5.4.0/include:/opt/torque/include, G_BROKEN_FILENAMES=1, module=() { eval /opt/Modules/$MODULE_VERSION/bin/modulecmd bash $* \ },_=/opt/torque/bin/qsub etime = Tue Jan 21 10:07:04 2014 exit_status = 127 submit_args = /tmp/SAGA-Python-PBSJobScript.f25824 start_time = Tue Jan 21 10:07:11 2014 Walltime.Remaining = 559 start_count = 1 fault_tolerant = False comp_time = Tue Jan 21 10:07:13 2014 submit_host = i136 init_work_dir = /N/u/marksant

In /tmp/sinon/pilot-52de8d13a85378cead40418a/STDERR i see: /var/spool/torque/mom_priv/jobs/1326243.i136.SC: line 11: ./bootstrap-and-run-agent: No such file or directory

This is all on the devel branch.

oleweidner commented 10 years ago

Mark,

/tmp/ is not a shared filesystem on India. Pilot’s won’t bootstrap there. Please set the working directory for the agent to a shared fs, e.g. your home directory.

In the meantime, I will think about a way to catch this error, although this is not quite trivial.

Thanks! Ole

On Jan 21, 2014, at 10:12 , Mark Santcroos notifications@github.com wrote:

[DEBUG ] Created agent directory 'sftp://india.futuregrid.org//tmp/sinon/pilot-52de8d13a85378cead40418a/'

marksantcroos commented 10 years ago

Hi,

On 21 Jan 2014, at 13:51 , Ole Weidner notifications@github.com wrote:

/tmp/ is not a shared filesystem on India. Pilot’s won’t bootstrap there. Please set the working directory for the agent to a shared fs, e.g. your home directory.

Ok, hmm, true, thanks.

This config vs inline config things is not really helpful if you ask me as I was kind of “assuming” these kind of issues were dealt there. Of course it technically was my error, I think its worth taking this experience into account.

In the meantime, I will think about a way to catch this error, although this is not quite trivial.

Can you elaborate why this is difficult? I wouldn’t work on anything else before this is solved as it is so essential.

Gr,

Mark

andre-merzky commented 10 years ago

Ole,

you may want to simply forbid /tmp as pilot working dir when submitted via pbs or other batch systems: /tmp is basically never shared. For other non-home and non-scratch based dirs it might be worth to issue an explicit warning, but to accept whatever the user sets.

My $0.02, Andre.

andre-merzky commented 10 years ago

The approach to define shared file systems makes sense -- but it should not be applied on, for example, ssh submission, only when submitting to the batch system?

oleweidner commented 10 years ago

yes. that's why it's an optional parameter in the resource configuration file.

andre-merzky commented 10 years ago

Ah, so you consider ssh://india a different resource than ssh+pbs://india, right? I see...