Closed marksantcroos closed 10 years ago
Mark,
/tmp/ is not a shared filesystem on India. Pilot’s won’t bootstrap there. Please set the working directory for the agent to a shared fs, e.g. your home directory.
In the meantime, I will think about a way to catch this error, although this is not quite trivial.
Thanks! Ole
On Jan 21, 2014, at 10:12 , Mark Santcroos notifications@github.com wrote:
[DEBUG ] Created agent directory 'sftp://india.futuregrid.org//tmp/sinon/pilot-52de8d13a85378cead40418a/'
Hi,
On 21 Jan 2014, at 13:51 , Ole Weidner notifications@github.com wrote:
/tmp/ is not a shared filesystem on India. Pilot’s won’t bootstrap there. Please set the working directory for the agent to a shared fs, e.g. your home directory.
Ok, hmm, true, thanks.
This config vs inline config things is not really helpful if you ask me as I was kind of “assuming” these kind of issues were dealt there. Of course it technically was my error, I think its worth taking this experience into account.
In the meantime, I will think about a way to catch this error, although this is not quite trivial.
Can you elaborate why this is difficult? I wouldn’t work on anything else before this is solved as it is so essential.
Gr,
Mark
Ole,
you may want to simply forbid /tmp as pilot working dir when submitted via pbs or other batch systems: /tmp is basically never shared. For other non-home and non-scratch based dirs it might be worth to issue an explicit warning, but to accept whatever the user sets.
My $0.02, Andre.
The approach to define shared file systems makes sense -- but it should not be applied on, for example, ssh submission, only when submitting to the batch system?
yes. that's why it's an optional parameter in the resource configuration file.
Ah, so you consider ssh://india a different resource than ssh+pbs://india, right? I see...
$ python demo_milestone_02.py 2014:01:21 10:06:58 MainThread sinon.logger : [INFO ] loading sinon version: 0.1.3-20-gbfc3e3e 2014:01:21 10:06:59 MainThread sinon.logger : [INFO ] Created new Session {'session_uid': '52de8d12a85378cead404187', 'database_url': 'mongodb://ec2-184-72-89-141.compute-1.amazonaws.com:27017/'}. 2014:01:21 10:06:59 MainThread sinon.logger : [INFO ] Added credential {'UserID': 'marksant', 'UserPass': None, 'Type': 'SSH', 'UserKey': None} to session 52de8d12a85378cead404187. 2014:01:21 10:06:59 MainThread sinon.logger : [INFO ] Created new PilotManager {'type': 'PilotManager', 'uid': '52de8d13a85378cead404189'}.
And then it stops.
I see on india that the job has completed already: $ qstat -f 1326243 Job Id: 1326243.i136 Job_Name = SAGA-Python-PBSJobScript.f25824 Job_Owner = marksant@i136 resources_used.cput = 00:00:00 resources_used.mem = 0kb resources_used.vmem = 0kb resources_used.walltime = 00:00:00 job_state = C queue = batch server = i136 Checkpoint = u ctime = Tue Jan 21 10:07:04 2014 Error_Path = i136://tmp/sinon/pilot-52de8d13a85378cead40418a/STDERR exec_host = i82/7+i82/6+i82/5+i82/4+i82/3+i82/2+i82/1+i82/0 Hold_Types = n Join_Path = n Keep_Files = n Mail_Points = a mtime = Tue Jan 21 10:07:13 2014 Output_Path = i136://tmp/sinon/pilot-52de8d13a85378cead40418a/STDOUT Priority = 0 qtime = Tue Jan 21 10:07:04 2014 Rerunable = True Resource_List.ncpus = 1 Resource_List.nodect = 1 Resource_List.nodes = 1:ppn=8 Resource_List.walltime = 00:10:00 session_id = 14392 Variable_List = PBS_O_QUEUE=batch,PBS_O_HOME=/N/u/marksant, PBS_O_LANG=en_US.UTF-8,PBS_O_LOGNAME=marksant, PBS_O_PATH=/opt/xcat/bin:/opt/xcat/sbin:/N/soft/moab/5.4.0/sbin:/N/so ft/moab/5.4.0/bin:/opt/torque/sbin:/opt/torque/bin:/usr/kerberos/bin:/ usr/bin:/bin:/usr/sbin:/sbin:/opt/openssh-4.6p1/bin:/N/u/marksant/bin: /N/u/marksant/bin,PBS_O_MAIL=/var/spool/mail/marksant, PBS_O_SHELL=/bin/bash,PBS_O_HOST=i136,PBS_SERVER=i136, PBS_O_WORKDIR=/N/u/marksant,MODULE_VERSION_STACK=3.2.8, MANPATH=/opt/xcat/share/man:/N/soft/moab/5.4.0/man:/opt/torque/man:/u sr/share/man,HOSTNAME=i136,SHELL=/bin/bash,TERM=vt100,HISTSIZE=1000, SSH_CLIENT=69.125.59.160 64256 22,SSH_TTY=/dev/pts/15,PERL_BADLANG=0, USER=marksant, LSCOLORS=no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:bd=40;33;01 :cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=01;32:.cmd=01;32:.exe= 01;32:.com=01;32:.btm=01;32:.bat=01;32:.sh=01;32:.csh=01;32:.tar =01;31:.tgz=01;31:.arj=01;31:.taz=01;31:.lzh=01;31:.zip=01;31:.z =01;31:.Z=01;31:.gz=01;31:.bz2=01;31:.bz=01;31:.tz=01;31:.rpm=01 ;31:.cpio=01;31:.jpg=01;35:.gif=01;35:.bmp=01;35:.xbm=01;35:.xpm =01;35:.png=01;35:_.tif=01;35:, LD_LIBRARY_PATH=/N/soft/moab/5.4.0/lib:/opt/torque/lib, XCATROOT=/opt/xcat, PATH=/opt/xcat/bin:/opt/xcat/sbin:/N/soft/moab/5.4.0/sbin:/N/soft/moa b/5.4.0/bin:/opt/torque/sbin:/opt/torque/bin:/usr/kerberos/bin:/usr/bi n:/bin:/usr/sbin:/sbin:/opt/openssh-4.6p1/bin:/N/u/marksant/bin:/N/u/m arksant/bin,MAIL=/var/spool/mail/marksant,MODULE_VERSION=3.2.8, PWD=/N/u/marksant,INPUTRC=/etc/inputrc, LMFILES=/opt/Modules/3.2.8/modulefiles/tools/torque/2.5.5:/opt/Modu les/3.2.8/modulefiles/tools/moab/5.4.0,LANG=en_US.UTF-8, MODULEPATH=/opt/Modules/3.2.8/modulefiles/applications:/opt/Modules/3 .2.8/modulefiles/compilers:/opt/Modules/3.2.8/modulefiles/debuggers:/o pt/Modules/3.2.8/modulefiles/libraries:/opt/Modules/3.2.8/modulefiles/ tools,LOADEDMODULES=torque/2.5.5:moab/5.4.0,PS1=PROMPT-$?->, XCATHOST=im1:3001,PS2=, SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass, HOME=/N/u/marksant,SHLVL=1,LOGNAME=marksant,CVS_RSH=ssh, SSH_CONNECTION=69.125.59.160 64256 149.165.146.136 22, MODULESHOME=/opt/Modules/3.2.8,LESSOPEN=|/usr/bin/lesspipe.sh %s, INCLUDE=/N/soft/moab/5.4.0/include:/opt/torque/include, G_BROKEN_FILENAMES=1, module=() { eval
/opt/Modules/$MODULE_VERSION/bin/modulecmd bash $*
\ },_=/opt/torque/bin/qsub etime = Tue Jan 21 10:07:04 2014 exit_status = 127 submit_args = /tmp/SAGA-Python-PBSJobScript.f25824 start_time = Tue Jan 21 10:07:11 2014 Walltime.Remaining = 559 start_count = 1 fault_tolerant = False comp_time = Tue Jan 21 10:07:13 2014 submit_host = i136 init_work_dir = /N/u/marksantIn /tmp/sinon/pilot-52de8d13a85378cead40418a/STDERR i see: /var/spool/torque/mom_priv/jobs/1326243.i136.SC: line 11: ./bootstrap-and-run-agent: No such file or directory
This is all on the devel branch.