simonsobs-uk / data-centre

This tracks the issues in the baseline design of the SO:UK Data Centre at Blackett
https://souk-data-centre.readthedocs.io
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

Setting up ssh connections between worker nodes in parallel universe #12

Open ickc opened 11 months ago

ickc commented 11 months ago

@rwf14f, I'm trying to establish ssh connections between worker nodes in parallel universe, which is a prerequisite to bootstrap MIPCH.

The following is a Minimal Working Example (MWE) demonstrating the problem. The key is that it follows the example of mp1script which uses sshd.sh provided by HTCondor to bootstrap a "contact" file between the processes. And then a command performing ssh -p PORT -i KEY_FILE USER@HOSTNAME date to show the error Host key verification failed.. Different variations of this example is performed, including less descriptive variants such as ssh HOSTNAME date.

Is it a configuration issue at Blackett?

MWE:

In mpi.ini,

universe = parallel
executable = mp1script
machine_count = 2
should_transfer_files = yes
when_to_transfer_output = ON_EXIT_OR_EVICT
request_cpus   = 2
request_memory = 1G
request_disk   = 10G

log = mpi.log
output = mpi-$(Node).out
error = mpi-$(Node).err
stream_error = True
stream_output = True

queue

In a modified mp1script,

#!/bin/sh

# modified from /usr/share/doc/condor-9.0.17/examples/mp1script

_CONDOR_PROCNO=$_CONDOR_PROCNO
_CONDOR_NPROCS=$_CONDOR_NPROCS

CONDOR_SSH=`condor_config_val libexec`
CONDOR_SSH=$CONDOR_SSH/condor_ssh

SSHD_SH=`condor_config_val libexec`
SSHD_SH=$SSHD_SH/sshd.sh

. $SSHD_SH $_CONDOR_PROCNO $_CONDOR_NPROCS 

# If not the head node, just sleep forever, to let the
# sshds run
if [ $_CONDOR_PROCNO -ne 0 ]
then
        wait
        sshd_cleanup
        exit 0
fi

export P4_RSHCOMMAND=$CONDOR_SSH

CONDOR_CONTACT_FILE=$_CONDOR_SCRATCH_DIR/contact
export CONDOR_CONTACT_FILE

echo "Created the following contact file:"
cat $CONDOR_CONTACT_FILE

# The second field in the contact file is the machine name
# that condor_ssh knows how to use
sort -n -k 1 < $CONDOR_CONTACT_FILE | awk '{print $2}' > machines

export idkey
awk '{
    print "ssh -p " $3 " -i " ENVIRON["idkey"] " " $4 "@" $2 " date";
}' $CONDOR_CONTACT_FILE > ssh.bash
echo "Trying to reach other hosts via ssh..."
cat ssh.bash
bash ./ssh.bash

sshd_cleanup
rm -f machines

exit $?

Submitting the job:

condor_submit mpi.ini

which results in error:

Host key verification failed.
Host key verification failed.
rwf14f commented 11 months ago

In theory you should be able to use

echo "Trying to reach other hosts via ssh..."
for h in $(<machines); do
    $CONDOR_SSH $h date
done

This doesn't work either though because the condor_ssh expects a different format of the contacts file. I currently don't know when and where the file is created, but in your case it starts with a number (probably the proc number), followed by the hostname, whereas condor_ssh expects it to start with the hostname. It could be a bug in our HTCondor version somewhere. Unfortunately, condor_ssh has the location of the contacts file semi-hardcoded ($_CONDOR_SCRATCH_DIR/contacts), so you can't just give it a different $CONDOR_CONTACT_FILE. You could hack your way around this, but it's probably not worth it.

Anyway, have a look at /usr/libexec/condor/condor_ssh to find out which options you need to set to get this working.

ickc commented 11 months ago

Thanks for the pointer @rwf14f, I got it further after that. There's another related issue, originated from the fact that MPICH3+ expects ssh HOST COMMAND ... to work without any way to customize the flags at all. So the recommended way to configure ssh when using MPICH3+ is to use ssh_config: [mpich-discuss] how to set ssh port in mpich

So is there any way I can configure ssh on the worker node? $HOME/.ssh/config could work, but first there's an issue that worker nodes has no HOME (#4), and even if I set my HOME in the job script, is the sshd reading that file in the current configuration (as defined in /etc/ssh/sshd_config that I cannot read)?

rwf14f commented 11 months ago

Does MPICH3+ honour the PATH ? If it does you could write an ssh wrapper script that calls /usr/bin/ssh with the appropriate options, put it into a directory and add the directory to the PATH.

ickc commented 11 months ago

Thanks for the pointer. Turns out it is hard-coded to /usr/bin/ssh but provide an option -launcher-exec to override it: https://github.com/search?q=repo%3Apmodels%2Fmpich+%2Fusr%2Fbin%2Fssh&type=code

A quick question: was either /usr/libexec/condor/sshd.sh or /usr/libexec/condor/condor_ssh changed/updated? I am a bit confused with my memory and the conversation here, as at one point we said the contact file generated by the first isn't that expected in the 2nd. But I just checked and in the 1st:

echo "$_CONDOR_PROCNO $hostname $PORT $user $currentDir $thisrun"  |
        $CONDOR_CHIRP put -mode cwa - $_CONDOR_REMOTE_SPOOL_DIR/contact 

whereas in the 2nd:

#proc=`echo $line | awk '{print $1}'`
host=`echo $line | awk '{print $2}'`
port=`echo $line | awk '{print $3}'`
username=`echo $line | awk '{print $4}'`
dir=`echo $line | awk '{print $5}'`

so it now seems that both are agreeing.

rwf14f commented 11 months ago

Ah, the usage of condor_ssh says:

Usage: condor_ssh hostname command arg1 arg2

so I was passing in the hostname as first parameter, but the script expects it to be the proc number in the contacts file:

proc=$1
...
line=`grep "^$proc " $contact`

if [ $? -ne 0 ]
then
    echo Proc $proc is not in contact file $contact
    exit 1
fi
ickc commented 11 months ago

It is quite confusing that HTCondor does that, but even in the OpenMPI wrapper script they are doing exactly that (calling host but actually is the process no).

I got it passes the ssh stage a few days ago but MPICH has other errors that I have yet to figure out. I'd have my annual leave soon so I'd revisit this probably in ~1 month time.