Closed ibethune closed 9 years ago
Gather re-ran this morning. Logs are at:
https://gist.github.com/ibethune/c49df16c211c0ed3ff4e
The pilot simply sat for 38 mins after starting and died. Nothing in the logs to suggest why.
Waiting on access to the pilot directory.
There is some strange stuff in the agent logs (relating to locks etc.). See /work/e290/e290/gshannon/radical.pilot.sandbox/rp.session.taggart.pharm.nottingham.ac.uk.gareths.016552.0000-pilot.0000
Over to someone from RADICAL for a look.
agent.err reads as follows:
Permission denied (publickey,keyboard-interactive).^M
=>> PBS: job killed: walltime 1202 exceeded limit 1200
kill: 13812: No such process
kill: 27805: No such process
pilot_bootstrapper.sh: line 102: 27805 Terminated sleep 1
Gareth has re-run and from my analysis, looks like we have the same original problem with connecting to the DB... he is not running the absolute latest code since he installed from master, but the only thing on devel is the gromacs CU changes, so it shouldn't matter for this issue.
I have pasted the output and log to https://gist.github.com/ibethune/f033e16f7c3f85b6ea35
The pilot directory is public readable on ARCHER: /work/e290/e290/gshannon/radical.pilot.sandbox/rp.session.taggart.pharm.nottingham.ac.uk.gareths.016567.0000-pilot.0000
The sequence of events is that Gareth's job (which requested 1 node for 1 hour) was
queued at: Tue May 12 17:41:15 2015 started at: Wed May 13 00:06:23 2015 finished at: Wed May 13 00:07:25 2015 - i.e. only ran for ~1 minute
The agent.err file shows the failure to connect to the mongodb.
Elena and I have both run jobs in the last day or so - please can someone look and see why this is causing a problem for Gareth!
In agent.err I see:
Permission denied (publickey,keyboard-interactive).
Did Gareth setup his ssh keys correctly?
Instructions are here: http://radicalpilot.readthedocs.org/en/latest/faq.html#q-i-see-the-error-permission-denied-publickey-keyboard-interactive-in-agent-stderr-or-stderr
Gareth confirmed he did not have keys set up for intra-node SSH. He is fixing this and will run again (hopefully with success).
Gareth reports all is now working as expected!
Great! Not sure why that was not catched earlier, but hopefully the documentation will contribute to preventing that in the future.
As experienced by Gareth during the beta testing sessions, first his connection to the mongodb instance failed partway through an extasy job, then subsequently could not be established at all. It was working fine for everyone else.
I don't believe we have the full log for the partial job, although the pilot directory on ARCHER is:
/work/e290/e290/gshannon/radical.pilot.sandbox/pilot-551bf659be5d6a3d9d9d8bf7
For a job which failed immediately on startup, see
/work/e290/e290/gshannon/radical.pilot.sandbox/pilot-551bfd3ebe5d6a60ffdb1f40
The extasy.log from this job is posted here:
https://gist.github.com/ibethune/1844d65e240022599f1c