Closed ickc closed 8 months ago
It works ok for me, I haven't been able to reproduce your problems. It looks like the job starts, but for some reason gets terminated on the WNs. On the WN:
unhandled job exit: pid=23882, status=0
Process exited, pid=23895, status=255
Got SIGTERM. Performing graceful shutdown.
ShutdownGraceful all jobs.
Process exited, pid=23880, signal=15
The authentication errors have me puzzled, I wonder if you've set up a specific ssh client config in your environment that interferes with the condor ssh commands. The authentication fails, it can't connect to the WN, so as a result the job gets killed. Too bad that HTCondor is so good at cleaning up after itself, it would be useful to get hold off the sshd logs it creates on the WNs.
The problem is stateful. I don't always experience this. But it also wasn't the first round I experience this issue. My guess is it got temporarily banned, this time because of playing around with sshd.sh
when trying to get MPICH work.
This issue will be left open for now, but is not actionable at the moment.
This is likely to be some sort of auto-banning that would be lifted automatically after some time. (Given that it occurred when I was setting up SSH between processes.)
We will revisit this if this continues to be our problems.
I just ran into the same problem again these 2 days:
$ cat example.ini
RequestMemory=32999
RequestCpus=16
queue
$ condor_submit -i example.ini
Submitting job(s).
1 job(s) submitted to cluster 807.
Received disconnect from UNKNOWN port 65535:2: Too many authentication failures
Disconnected from UNKNOWN port 65535
make: *** [submit] Error 255
Edit: this time I wasn't testing ssh connections within mpich jobs in parallel universe. In fact I haven't used Blackett over the weekend.
I think that problem is caused by your ssh configuration and ssh-agent. In your config you're using:
Host *
AddKeysToAgent yes
This means that every time you run an interactive condor job, or use condor_ssh_to_job, the temporary key used by htcondor is added to the active agent. Every time htcondor starts ssh, it tries all the keys in your agent in order (it prefers them over the command line option), all of which fail and count as a failed login attempt. Once you have six keys in your agent you get disconnected because of too many failed login attempts. There are several ways you can avoid this:
AddKeysToAgent yes
only on specific hosts, or unset for htcondor ssh jobs.IdentitiesOnly yes
to your config (would keep adding keys to agent if above is still used on all hosts).SSH_AUTH_SOCK
before running an interactive htcondor job or use condor_ssh_to_job.The hostnames HTCondor uses in its ssh commands are all prefixed with condor-job.
. You can use this in your ssh config:
Host condor-job.*
AddKeysToAgent no
IdentitiesOnly yes
Host *
AddKeysToAgent yes
Here's the recipe to reproduce the problem:
$ ssh -o ForwardAgent=no <user>@<schedd> # prevent forwarding of any local ssh-agents when logging in
-bash-4.2$ cat .ssh/config
Host *
AddKeysToAgent yes
-bash-4.2$ eval `ssh-agent`
Agent pid 2409083
-bash-4.2$ ssh-add -l
The agent has no identities.
# submitting 6 jobs works fine
-bash-4.2$ for i in `seq 1 6`; do echo "exit" | condor_submit -i test.submit; done
Submitting job(s).
1 job(s) submitted to cluster 1788.
Pseudo-terminal will not be allocated because stdin is not a terminal.
Welcome to slot1_5@wn1!
Submitting job(s).
1 job(s) submitted to cluster 1789.
Waiting for job to start...
Pseudo-terminal will not be allocated because stdin is not a terminal.
Welcome to slot1_2@wn2!
Submitting job(s).
1 job(s) submitted to cluster 1790.
Waiting for job to start...
Pseudo-terminal will not be allocated because stdin is not a terminal.
Welcome to slot1_5@wn1!
Submitting job(s).
1 job(s) submitted to cluster 1791.
Waiting for job to start...
Pseudo-terminal will not be allocated because stdin is not a terminal.
Welcome to slot1_5@wn1!
Submitting job(s).
1 job(s) submitted to cluster 1792.
Waiting for job to start...
Pseudo-terminal will not be allocated because stdin is not a terminal.
Welcome to slot1_5@wn1!
Submitting job(s).
1 job(s) submitted to cluster 1793.
Waiting for job to start...
Pseudo-terminal will not be allocated because stdin is not a terminal.
Welcome to slot1_5@wn1!
# all interactive jobs fail now
-bash-4.2$ condor_submit -i test.submit
Submitting job(s).
1 job(s) submitted to cluster 1794.
Waiting for job to start...
Received disconnect from UNKNOWN port 65535:2: Too many authentication failures
Authentication failed.
-bash-4.2$ ssh-add -l
2048 SHA256:Y/kvlzVCBMQ/XsU+Tjz0/soNfMgEC2RrKsyOyfj1F4E (RSA)
2048 SHA256:sAUJ6w3i/k//a5CcvKYOLOh+GSRvZZuonMPmV1miLP0 (RSA)
2048 SHA256:6ouggpK9SRvUsvHnVJ7CgfijtLt3DilWHwtPtVdrKHE (RSA)
2048 SHA256:3ofKCm8+iBXKVys/NxYyi/1Ky03lNFhEoegbPl8Vn58 (RSA)
2048 SHA256:QKwAbXTUXnyXLWwzSUvmbmpw40FN/d5RGQ0220dTEic (RSA)
2048 SHA256:EM3v9ACdvMUIXGdt3/r4+YEFIj2Sri5i86gjTxHFzBU (RSA)
-bash-4.2$ ssh-add -D
All identities removed.
# it starts working again when identities have been removed from agent
-bash-4.2$ condor_submit -i test.submit
Submitting job(s).
1 job(s) submitted to cluster 1799.
Welcome to slot1_3@wn1!
bash-4.2$ logout
Connection to condor-job.wn1 closed.
Thank you for your thorough analysis. This issue indeed highlights an interesting side-effect between the two systems.
I am exploring ways to enhance robustness against individual user SSH configurations. My understanding of the order of precedence for SSH options is as follows:
Given this hierarchy, system-wide configurations are unable to override user-specific settings. With respect to CLI options, it seems IdentitiesOnly
is not available as a command-line flag. Moreover, it's uncertain whether HTCondor allows for modifications to these settings. If it is possible, implementing -o IdentitiesOnly=yes
via the CLI might help prevent issues with dropping into an interactive node due to excessive key attempts. What are your thoughts on the feasibility of this approach?
Documented in 11207bb40f015fc4f72e2209f8e19571be17c6fa and f3cd79c859134f6ac1ac153ce4398f6b1650506e. We can close if the CLI option cannot be set automatically by HTCondor.
You can set additional command line options with condor_ssh_to_job
because it has a -ssh
option that allows you to specify the ssh command to which condor_ssh_to_job
will add its own options. There's no way to set this when submitting an interactive job though.
If you submit a normal sleep job, not an interactive one then you can use condor_ssh_to_job -ssh 'ssh -o IdentitiesOnly=yes' <job id>
.
I think it's best to just document this, then users can fix their own configurations if they want to use AddKeysToAgent
on the submission machine. If you really want to make it easier for users then you can provide them with a wrapper script that removes the SSH_AUTH_SOCK
environment variable before submitting the interactive job.
Feel free to report this to the HTCondor developers, but considering the huge number of ssh options, they are probably not willing to start adding additional command line options to prevent all possible errors that could be caused by a user configuration on the submission machine.
Thanks. I'll consider pinging / PR HTCondor later.
Hi, @rwf14f,
I’m experiencing this issue for a few days: "Too many authentication failures” occurs when I requested an interactive node.
I’m not sure if this is related to testing sshd.sh in parallel universe recently in #12.