radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

CU fails on trestles with 'Permission denied' #478

Closed mturilli closed 9 years ago

mturilli commented 9 years ago

All or some of the CUs submitted to pilots on trestles fail. Pilot works properly and no errors are reported in the pilot LOG* files. The STDERR of each failing unit contains only the following lines:

Permission denied, please try again.
Received disconnect from 10.1.254.212: 2: Too many authentication
failures for mturilli

STDOUT are empty.

Here a summary of the pilot description:

xsede.trestles:
    Allocation; None -> RP default : unc100
    Queue; None -> RP default      : None
    Number of cores                : 683
    Walltime in minutes            : 365
    Stop once the workflow is done : True

ssh on localhost works as expected:

[mturilli@trestles-login2 ~]$ ssh localhost
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Last login: Sat Jan 24 05:15:19 2015 from
host231-181-dynamic.16-79-r.retail.telecomitalia.it
Rocks 6.1 (Emerald Boa)
Profile built 14:25 08-Dec-2013

Kickstarted 15:06 08-Dec-2013
Trestles Login Node
****************************************************************************
[mturilli@trestles-login2 ~]$
andre-merzky commented 9 years ago

Mark, do you have any idea in what direction to investigate? Looks like ssh level problem though, doesn't it (trestles has ssh as non-mpi launch method)?

antonst commented 9 years ago

Similar to: CUs are failing on Trestles #124 ? I have posted some instructions there, hope these will help...

Thanks, Antons

andre-merzky commented 9 years ago

Antons, thanks! That looks indeed very similar. Matteo, would you please follow that procedure to set up another keypair?

mturilli commented 9 years ago

Thank you Antons.

I believe my setup is already consistent with the instructions offered by Antons. Here some details. Naming seems fine:

[mturilli@trestles-login1 .ssh]$ ls
authorized_keys  id_rsa  id_rsa.pub  known_hosts

Permissions are fine.

$ ls -al
total 24
drwx------  2 mturilli unc100 4096 Aug 18 09:09 .
drwx------ 10 mturilli unc100 4096 Jan 14 10:22 ..
-rw-------  1 mturilli unc100  821 Feb  6  2014 authorized_keys
-rw-------  1 mturilli unc100 1679 Feb  6  2014 id_rsa
-rw-r--r--  1 mturilli unc100  415 Feb  6  2014 id_rsa.pub
-rw-r--r--  1 mturilli unc100  640 Jan 24 05:15 known_hosts

This is the local pub key:

[mturilli@trestles-login1 .ssh]$ cat id_rsa.pub 
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC2RPMsXCUpBL72IpoXhkrTq3jUvPmENiNqQlhb4v5+a72OYWj4Lz6ecfknSDNvOPpMlQ5iokC1ftIKhLEQzWouE2vfpgAy29m9p8KmCY/NsFXPv9zYFU7Cfe7YlKB3S0PrV3NoxmK0PsEZlqispFXGRWzY4d7MBZUSAHH4rYpsCNjleddPz5yieG/fR1oGaubp3waaDc0SZcRkEc8+WY/YDwqOssD8xlZZdWnZoPWBEAtr6sUlKFs+NUR61CZfw5lAkzlEZIjHTf7wuHV+fiH0U4l8DJSEODYogk7TwsEZ/2hafBTidKZPfB1GOPfJb+Dhew7ZHVxFczipOzCoZGKn mturilli@trestles-login1.sdsc.edu

And this is indeed the key in the authorized keys file:

$ cat authorized_keys 
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC2RPMsXCUpBL72IpoXhkrTq3jUvPmENiNqQlhb4v5+a72OYWj4Lz6ecfknSDNvOPpMlQ5iokC1ftIKhLEQzWouE2vfpgAy29m9p8KmCY/NsFXPv9zYFU7Cfe7YlKB3S0PrV3NoxmK0PsEZlqispFXGRWzY4d7MBZUSAHH4rYpsCNjleddPz5yieG/fR1oGaubp3waaDc0SZcRkEc8+WY/YDwqOssD8xlZZdWnZoPWBEAtr6sUlKFs+NUR61CZfw5lAkzlEZIjHTf7wuHV+fiH0U4l8DJSEODYogk7TwsEZ/2hafBTidKZPfB1GOPfJb+Dhew7ZHVxFczipOzCoZGKn mturilli@trestles-login1.sdsc.edu
ssh-rsa 
[...]

But:

[mturilli@trestles-login1 .ssh]$ ssh 10.1.254.212
Password: 

Even if I am not sure it should work like that - probably it should not.

antonst commented 9 years ago

Have you tried with mturilli@10.1.254.212 ? If you provide the password does it login? You also might wait for "some time" for keys to get propogated I guess. What is the output of: ssh -vv mturilli@10.1.254.212?

Thanks, Antons

mturilli commented 9 years ago

The keys were already there so I would exclude an issue with propagation. I have no password on trestles, only my own key that is different from the one pasted in my previous message, that was set up by trestles 'itself'. Note that CUs do not always fail, most of the time at the moment.

On Sat, Jan 24, 2015 at 9:59 PM, Antons notifications@github.com wrote:

Have you tried with mturilli@10.1.254.212 ? If you provide the password does it login? You also might wait for "some time" for keys to get propogated I guess. Not sure that else it might be.

Thanks, Antons

— Reply to this email directly or view it on GitHub https://github.com/radical-cybertools/radical.pilot/issues/478#issuecomment-71336988 .

Dr Matteo Turilli Department of Electrical and Computer Engineering Rutgers University

antonst commented 9 years ago

Thank you for your reply Matteo. Interesting.

CUs do not always fail, most of the time at the moment.

I would guess that CU's fail only for nodes where keys are not configured and not for all nodes. Is there a way to make RP messages more informative, e.g. from where to where the ssh attempt was made?

Thanks

andre-merzky commented 9 years ago

Is there a way to make RP messages more informative, e.g. from where to where the ssh attempt was made?

That will be hidden somewhere in the logs -- but the node the agent is running on, and the nodes used for the CUs, will change every time, so I am not sure how digging that information out would help us? But I agree, that is interesting that ssh setup seems ok for some nodes, but not for others. Lets wait until Mark chimes in, otherwise I'd probably suggest to open a trestles ticket...

mturilli commented 9 years ago

It is now working, I am going to close this ticket considering it a temporary failure of trestles, not of radical.pilot.