radical-cybertools / ExTASY

MDEnsemble
Other
1 stars 1 forks source link

Connection to MongoDB failures #167

Closed ibethune closed 9 years ago

ibethune commented 9 years ago

As experienced by Gareth during the beta testing sessions, first his connection to the mongodb instance failed partway through an extasy job, then subsequently could not be established at all. It was working fine for everyone else.

I don't believe we have the full log for the partial job, although the pilot directory on ARCHER is:

/work/e290/e290/gshannon/radical.pilot.sandbox/pilot-551bf659be5d6a3d9d9d8bf7

For a job which failed immediately on startup, see

/work/e290/e290/gshannon/radical.pilot.sandbox/pilot-551bfd3ebe5d6a60ffdb1f40

The extasy.log from this job is posted here:

https://gist.github.com/ibethune/1844d65e240022599f1c

ibethune commented 9 years ago

Gather re-ran this morning. Logs are at:

https://gist.github.com/ibethune/c49df16c211c0ed3ff4e

The pilot simply sat for 38 mins after starting and died. Nothing in the logs to suggest why.

Waiting on access to the pilot directory.

ibethune commented 9 years ago

There is some strange stuff in the agent logs (relating to locks etc.). See /work/e290/e290/gshannon/radical.pilot.sandbox/rp.session.taggart.pharm.nottingham.ac.uk.gareths.016552.0000-pilot.0000

Over to someone from RADICAL for a look.

vivek-bala commented 9 years ago

agent.err reads as follows:

Permission denied (publickey,keyboard-interactive).^M
=>> PBS: job killed: walltime 1202 exceeded limit 1200
kill: 13812: No such process
kill: 27805: No such process
pilot_bootstrapper.sh: line 102: 27805 Terminated              sleep 1
ibethune commented 9 years ago

Gareth has re-run and from my analysis, looks like we have the same original problem with connecting to the DB... he is not running the absolute latest code since he installed from master, but the only thing on devel is the gromacs CU changes, so it shouldn't matter for this issue.

I have pasted the output and log to https://gist.github.com/ibethune/f033e16f7c3f85b6ea35

The pilot directory is public readable on ARCHER: /work/e290/e290/gshannon/radical.pilot.sandbox/rp.session.taggart.pharm.nottingham.ac.uk.gareths.016567.0000-pilot.0000

The sequence of events is that Gareth's job (which requested 1 node for 1 hour) was

queued at: Tue May 12 17:41:15 2015 started at: Wed May 13 00:06:23 2015 finished at: Wed May 13 00:07:25 2015 - i.e. only ran for ~1 minute

The agent.err file shows the failure to connect to the mongodb.

Elena and I have both run jobs in the last day or so - please can someone look and see why this is causing a problem for Gareth!

marksantcroos commented 9 years ago

In agent.err I see:

Permission denied (publickey,keyboard-interactive).

Did Gareth setup his ssh keys correctly?

Instructions are here: http://radicalpilot.readthedocs.org/en/latest/faq.html#q-i-see-the-error-permission-denied-publickey-keyboard-interactive-in-agent-stderr-or-stderr

ibethune commented 9 years ago

Gareth confirmed he did not have keys set up for intra-node SSH. He is fixing this and will run again (hopefully with success).

ibethune commented 9 years ago

Gareth reports all is now working as expected!

marksantcroos commented 9 years ago

Great! Not sure why that was not catched earlier, but hopefully the documentation will contribute to preventing that in the future.