Open ickc opened 11 months ago
In theory you should be able to use
echo "Trying to reach other hosts via ssh..."
for h in $(<machines); do
$CONDOR_SSH $h date
done
This doesn't work either though because the condor_ssh
expects a different format of the contacts
file. I currently don't know when and where the file is created, but in your case it starts with a number (probably the proc number), followed by the hostname, whereas condor_ssh
expects it to start with the hostname. It could be a bug in our HTCondor version somewhere. Unfortunately, condor_ssh
has the location of the contacts
file semi-hardcoded ($_CONDOR_SCRATCH_DIR/contacts
), so you can't just give it a different $CONDOR_CONTACT_FILE
. You could hack your way around this, but it's probably not worth it.
Anyway, have a look at /usr/libexec/condor/condor_ssh
to find out which options you need to set to get this working.
Thanks for the pointer @rwf14f, I got it further after that. There's another related issue, originated from the fact that MPICH3+ expects ssh HOST COMMAND ...
to work without any way to customize the flags at all. So the recommended way to configure ssh when using MPICH3+ is to use ssh_config: [mpich-discuss] how to set ssh port in mpich
So is there any way I can configure ssh on the worker node? $HOME/.ssh/config
could work, but first there's an issue that worker nodes has no HOME (#4), and even if I set my HOME
in the job script, is the sshd reading that file in the current configuration (as defined in /etc/ssh/sshd_config
that I cannot read)?
Does MPICH3+ honour the PATH
? If it does you could write an ssh wrapper script that calls /usr/bin/ssh
with the appropriate options, put it into a directory and add the directory to the PATH.
Thanks for the pointer. Turns out it is hard-coded to /usr/bin/ssh
but provide an option -launcher-exec
to override it: https://github.com/search?q=repo%3Apmodels%2Fmpich+%2Fusr%2Fbin%2Fssh&type=code
A quick question: was either /usr/libexec/condor/sshd.sh
or /usr/libexec/condor/condor_ssh
changed/updated? I am a bit confused with my memory and the conversation here, as at one point we said the contact file generated by the first isn't that expected in the 2nd. But I just checked and in the 1st:
echo "$_CONDOR_PROCNO $hostname $PORT $user $currentDir $thisrun" |
$CONDOR_CHIRP put -mode cwa - $_CONDOR_REMOTE_SPOOL_DIR/contact
whereas in the 2nd:
#proc=`echo $line | awk '{print $1}'`
host=`echo $line | awk '{print $2}'`
port=`echo $line | awk '{print $3}'`
username=`echo $line | awk '{print $4}'`
dir=`echo $line | awk '{print $5}'`
so it now seems that both are agreeing.
Ah, the usage of condor_ssh says:
Usage: condor_ssh hostname command arg1 arg2
so I was passing in the hostname as first parameter, but the script expects it to be the proc number in the contacts file:
proc=$1
...
line=`grep "^$proc " $contact`
if [ $? -ne 0 ]
then
echo Proc $proc is not in contact file $contact
exit 1
fi
It is quite confusing that HTCondor does that, but even in the OpenMPI wrapper script they are doing exactly that (calling host but actually is the process no).
I got it passes the ssh stage a few days ago but MPICH has other errors that I have yet to figure out. I'd have my annual leave soon so I'd revisit this probably in ~1 month time.
@rwf14f, I'm trying to establish ssh connections between worker nodes in parallel universe, which is a prerequisite to bootstrap MIPCH.
The following is a Minimal Working Example (MWE) demonstrating the problem. The key is that it follows the example of
mp1script
which usessshd.sh
provided by HTCondor to bootstrap a "contact" file between the processes. And then a command performingssh -p PORT -i KEY_FILE USER@HOSTNAME date
to show the errorHost key verification failed.
. Different variations of this example is performed, including less descriptive variants such asssh HOSTNAME date
.Is it a configuration issue at Blackett?
MWE:
In
mpi.ini
,In a modified
mp1script
,Submitting the job:
which results in error: