radical-cybertools / radical.repex.at

This is the github location for RepEx developed by the RADICAL team in conjunction with the York Lab.
Other
4 stars 3 forks source link

Repex failing archer #66

Closed vivek-bala closed 8 years ago

vivek-bala commented 8 years ago

Verbose log: https://gist.github.com/vivek-bala/7735f8c8045375493f18

It seemed to be fine in the morning. Submitted on to the queue and was pending. I reran it now, it fails with saga.exceptions.NoSuccess: Could not detect shell prompt.

andre-merzky commented 8 years ago

Can you please try again? This looks like a connection timeout... Can you log into archer otherwise? Please try with:

/usr/bin/env TERM=vt100 /usr/bin/ssh -t -o IdentityFile=/home/vivek/.ssh/id_rsa -o ControlMaster=auto -o ControlPath=/tmp/saga_ssh_vivek_%h_%p.vb224.ctrl -o TCPKeepAlive=no -o ServerAliveInterval=10 -o ServerAliveCountMax=20 vb224@login.archer.ac.uk

preferably while the application is hanging and prints those HELLO_%d_SAGA log messages...

vivek-bala commented 8 years ago

Can you please try again? This looks like a connection timeout...

This is repeatable. I get this everytime now. Will try the command you mentioned now.

Can you log into archer otherwise?

Yes.

vivek-bala commented 8 years ago

I get the same error again: saga.exceptions.NoSuccess: Could not detect shell prompt

On executing the following command,

$ /usr/bin/env TERM=vt100 /usr/bin/ssh -t -o IdentityFile=/home/vivek/.ssh/id_rsa -o ControlMaster=auto -o ControlPath=/tmp/saga_ssh_vivek_%h_%p.vb224.ctrl -o TCPKeepAlive=no -o ServerAliveInterval=10 -o ServerAliveCountMax=20 vb224@login.archer.ac.uk

I don't get any output/message. It just hangs. This is from the VM. I was able to log into Archer during this time.

andre-merzky commented 8 years ago

That hang is exactly what also then hangs in SAGA :/ I am not exactly sure what to do about that though - no ssh connectivity is nothing we can fix or work around in the radical stack :/

vivek-bala commented 8 years ago

I don't understand the difference between what saga does (assuming the above command is what saga executes) and a normal ssh into the machine (which was successful). Rather, pondering over why one works and not the other.

marksantcroos commented 8 years ago

Can you remove any /tmp/sagassh* files and try again?

vivek-bala commented 8 years ago

Ok, that worked ! The job is now in the queue.

vivek-bala commented 8 years ago

@marksantcroos What exactly was happening ?

andre-merzky commented 8 years ago

The ssh master channel either died or got stuck. Removing that file prompted ssh to create a new one...

andre-merzky commented 8 years ago

I am not exactly sure how to avoid/detect that situation. It is not the first time this happened, but it is also not very frequent, thus hard to reproduce. We probably won't touch that code path right now, but eventually that deserves some recovering mechanism.

antonst commented 8 years ago

Is this issue still relevant for devel branch? If not I am closing this ticket.

antonst commented 8 years ago

closed due to lack of response