simonsobs-uk / data-centre

This tracks the issues in the baseline design of the SO:UK Data Centre at Blackett
https://souk-data-centre.readthedocs.io
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

SSH Config Issue: Excessive Key Accumulation in SSH Agent with HTCondor Jobs Leading to Authentication Failures #13

Closed ickc closed 8 months ago

ickc commented 11 months ago

Hi, @rwf14f,

I’m experiencing this issue for a few days: "Too many authentication failures” occurs when I requested an interactive node.

❯ cat example.ini
RequestMemory=32999
RequestCpus=16
queue
❯ condor_submit -i example.ini
Submitting job(s).
1 job(s) submitted to cluster 463.
Waiting for job to start...
Received disconnect from UNKNOWN port 65535:2: Too many authentication failures
Disconnected from UNKNOWN port 65535

I’m not sure if this is related to testing sshd.sh in parallel universe recently in #12.

rwf14f commented 11 months ago

It works ok for me, I haven't been able to reproduce your problems. It looks like the job starts, but for some reason gets terminated on the WNs. On the WN:

unhandled job exit: pid=23882, status=0
Process exited, pid=23895, status=255
Got SIGTERM. Performing graceful shutdown.
ShutdownGraceful all jobs.
Process exited, pid=23880, signal=15

The authentication errors have me puzzled, I wonder if you've set up a specific ssh client config in your environment that interferes with the condor ssh commands. The authentication fails, it can't connect to the WN, so as a result the job gets killed. Too bad that HTCondor is so good at cleaning up after itself, it would be useful to get hold off the sshd logs it creates on the WNs.

ickc commented 11 months ago

The problem is stateful. I don't always experience this. But it also wasn't the first round I experience this issue. My guess is it got temporarily banned, this time because of playing around with sshd.sh when trying to get MPICH work.

This issue will be left open for now, but is not actionable at the moment.

ickc commented 9 months ago

This is likely to be some sort of auto-banning that would be lifted automatically after some time. (Given that it occurred when I was setting up SSH between processes.)

We will revisit this if this continues to be our problems.

ickc commented 8 months ago

I just ran into the same problem again these 2 days:

$ cat example.ini 
RequestMemory=32999
RequestCpus=16
queue
$ condor_submit -i example.ini
Submitting job(s).
1 job(s) submitted to cluster 807.
Received disconnect from UNKNOWN port 65535:2: Too many authentication failures
Disconnected from UNKNOWN port 65535
make: *** [submit] Error 255

Edit: this time I wasn't testing ssh connections within mpich jobs in parallel universe. In fact I haven't used Blackett over the weekend.

rwf14f commented 8 months ago

I think that problem is caused by your ssh configuration and ssh-agent. In your config you're using:

Host *
  AddKeysToAgent yes

This means that every time you run an interactive condor job, or use condor_ssh_to_job, the temporary key used by htcondor is added to the active agent. Every time htcondor starts ssh, it tries all the keys in your agent in order (it prefers them over the command line option), all of which fail and count as a failed login attempt. Once you have six keys in your agent you get disconnected because of too many failed login attempts. There are several ways you can avoid this:

The hostnames HTCondor uses in its ssh commands are all prefixed with condor-job.. You can use this in your ssh config:

Host condor-job.*
    AddKeysToAgent no
    IdentitiesOnly yes
Host *
    AddKeysToAgent yes

Here's the recipe to reproduce the problem:

$ ssh -o ForwardAgent=no <user>@<schedd>  # prevent forwarding of any local ssh-agents when logging in

-bash-4.2$ cat .ssh/config 
Host *
    AddKeysToAgent yes

-bash-4.2$ eval `ssh-agent`
Agent pid 2409083

-bash-4.2$ ssh-add -l
The agent has no identities.

# submitting 6 jobs works fine
-bash-4.2$ for i in `seq 1 6`; do echo "exit" | condor_submit -i test.submit; done
Submitting job(s).
1 job(s) submitted to cluster 1788.
Pseudo-terminal will not be allocated because stdin is not a terminal.
Welcome to slot1_5@wn1!
Submitting job(s).
1 job(s) submitted to cluster 1789.
Waiting for job to start...
Pseudo-terminal will not be allocated because stdin is not a terminal.
Welcome to slot1_2@wn2!
Submitting job(s).
1 job(s) submitted to cluster 1790.
Waiting for job to start...
Pseudo-terminal will not be allocated because stdin is not a terminal.
Welcome to slot1_5@wn1!
Submitting job(s).
1 job(s) submitted to cluster 1791.
Waiting for job to start...
Pseudo-terminal will not be allocated because stdin is not a terminal.
Welcome to slot1_5@wn1!
Submitting job(s).
1 job(s) submitted to cluster 1792.
Waiting for job to start...
Pseudo-terminal will not be allocated because stdin is not a terminal.
Welcome to slot1_5@wn1!
Submitting job(s).
1 job(s) submitted to cluster 1793.
Waiting for job to start...
Pseudo-terminal will not be allocated because stdin is not a terminal.
Welcome to slot1_5@wn1!

# all interactive jobs fail now
-bash-4.2$ condor_submit -i test.submit
Submitting job(s).
1 job(s) submitted to cluster 1794.
Waiting for job to start...
Received disconnect from UNKNOWN port 65535:2: Too many authentication failures
Authentication failed.

-bash-4.2$ ssh-add -l
2048 SHA256:Y/kvlzVCBMQ/XsU+Tjz0/soNfMgEC2RrKsyOyfj1F4E  (RSA)
2048 SHA256:sAUJ6w3i/k//a5CcvKYOLOh+GSRvZZuonMPmV1miLP0  (RSA)
2048 SHA256:6ouggpK9SRvUsvHnVJ7CgfijtLt3DilWHwtPtVdrKHE  (RSA)
2048 SHA256:3ofKCm8+iBXKVys/NxYyi/1Ky03lNFhEoegbPl8Vn58  (RSA)
2048 SHA256:QKwAbXTUXnyXLWwzSUvmbmpw40FN/d5RGQ0220dTEic  (RSA)
2048 SHA256:EM3v9ACdvMUIXGdt3/r4+YEFIj2Sri5i86gjTxHFzBU  (RSA)

-bash-4.2$ ssh-add -D
All identities removed.

# it starts working again when identities have been removed from agent
-bash-4.2$ condor_submit -i test.submit
Submitting job(s).
1 job(s) submitted to cluster 1799.
Welcome to slot1_3@wn1!
bash-4.2$ logout
Connection to condor-job.wn1 closed.
ickc commented 8 months ago

Thank you for your thorough analysis. This issue indeed highlights an interesting side-effect between the two systems.

I am exploring ways to enhance robustness against individual user SSH configurations. My understanding of the order of precedence for SSH options is as follows:

  1. Command Line Interface (CLI) options
  2. User-specific configuration
  3. System-wide configuration

Given this hierarchy, system-wide configurations are unable to override user-specific settings. With respect to CLI options, it seems IdentitiesOnly is not available as a command-line flag. Moreover, it's uncertain whether HTCondor allows for modifications to these settings. If it is possible, implementing -o IdentitiesOnly=yes via the CLI might help prevent issues with dropping into an interactive node due to excessive key attempts. What are your thoughts on the feasibility of this approach?

ickc commented 8 months ago

Documented in 11207bb40f015fc4f72e2209f8e19571be17c6fa and f3cd79c859134f6ac1ac153ce4398f6b1650506e. We can close if the CLI option cannot be set automatically by HTCondor.

rwf14f commented 8 months ago

You can set additional command line options with condor_ssh_to_job because it has a -ssh option that allows you to specify the ssh command to which condor_ssh_to_job will add its own options. There's no way to set this when submitting an interactive job though. If you submit a normal sleep job, not an interactive one then you can use condor_ssh_to_job -ssh 'ssh -o IdentitiesOnly=yes' <job id>.

I think it's best to just document this, then users can fix their own configurations if they want to use AddKeysToAgent on the submission machine. If you really want to make it easier for users then you can provide them with a wrapper script that removes the SSH_AUTH_SOCK environment variable before submitting the interactive job.

Feel free to report this to the HTCondor developers, but considering the huge number of ssh options, they are probably not willing to start adding additional command line options to prevent all possible errors that could be caused by a user configuration on the submission machine.

ickc commented 8 months ago

Thanks. I'll consider pinging / PR HTCondor later.