Open Michaelvll opened 2 months ago
If you want to ssh from the head to the other workers and have it work for mpirun
, its sufficient to enable ssh-agent
. No need to setup/copy keys. I have an example of this for doing nccl-test
here with mpirun
and i've also pasted here for convenience an example for doing this for ssh
'ing between hosts. Might be sufficient to just include this in the docs/examples?
Andrews-MacBook-Air:skypilot asai$ eval $(ssh-agent -s)
Andrews-MacBook-Air:skypilot asai$ ssh-add ~/.ssh/sky-key
Andrews-MacBook-Air:skypilot asai$ sky launch -c test --num-nodes 2 --cloud gcp 'echo "$SKYPILOT_NODE_IPS"'
(worker1, rank=1, pid=3398, ip=10.128.0.12) 10.128.0.8
(worker1, rank=1, pid=3398, ip=10.128.0.12) 10.128.0.12
(head, rank=0, pid=3877) 10.128.0.8
(head, rank=0, pid=3877) 10.128.0.12
Andrews-MacBook-Air:skypilot asai$ ssh test # get onto head node
(base) gcpuser@test-ebd1-head-op1wrzgz-compute:~$ ssh 10.128.0.12 # ssh to worker via private IP
right now this doesn't work if you do sky jobs launch
since the controller doesn't have the ssh-agent
on. However it looks like if you just run ssh-agent
on the job controller it will similarly work
(sky) Andrews-MacBook-Air:skypilot asai$ sky jobs launch -c test 'echo "$SKYPILOT_NODE_IPS"; sleep 1000000' --num-nodes 2 --cloud gcp
(worker1, rank=1, pid=3343, ip=10.128.0.43) 10.128.0.42
(worker1, rank=1, pid=3343, ip=10.128.0.43) 10.128.0.43
(head, rank=0, pid=3762) 10.128.0.42
(head, rank=0, pid=3762) 10.128.0.43
(sky) Andrews-MacBook-Air:skypilot asai$ ssh sky-jobs-controller-ebd16671 # access job controller
(base) gcpuser@sky-jobs-controller-ebd16671-ebd1-head-duxhhecj-compute$ eval $(ssh-agent -s)
(base) gcpuser@sky-jobs-controller-ebd16671-ebd1-head-duxhhecj-compute$ ssh-add ~/.ssh/sky-key
(base) gcpuser@sky-jobs-controller-ebd16671-ebd1-head-duxhhecj-compute:~$ ssh test-1 # access head node of job
(base) gcpuser@test-1-ebd1-head-48o88sx0-compute:~$ ssh 10.128.0.43 # access other worker
(base) gcpuser@test-1-ebd1-worker-9udnab2g-compute:~$
If you want to ssh from the head to the other workers and have it work for
mpirun
, its sufficient to enablessh-agent
. No need to setup/copy keys. I have an example of this for doingnccl-test
here withmpirun
and i've also pasted here for convenience an example for doing this forssh
'ing between hosts. Might be sufficient to just include this in the docs/examples?Andrews-MacBook-Air:skypilot asai$ eval $(ssh-agent -s) Andrews-MacBook-Air:skypilot asai$ ssh-add ~/.ssh/sky-key Andrews-MacBook-Air:skypilot asai$ sky launch -c test --num-nodes 2 --cloud gcp 'echo "$SKYPILOT_NODE_IPS"' (worker1, rank=1, pid=3398, ip=10.128.0.12) 10.128.0.8 (worker1, rank=1, pid=3398, ip=10.128.0.12) 10.128.0.12 (head, rank=0, pid=3877) 10.128.0.8 (head, rank=0, pid=3877) 10.128.0.12 Andrews-MacBook-Air:skypilot asai$ ssh test # get onto head node (base) gcpuser@test-ebd1-head-op1wrzgz-compute:~$ ssh 10.128.0.12 # ssh to worker via private IP
right now this doesn't work if you do
sky jobs launch
since the controller doesn't have thessh-agent
on. However it looks like if you just runssh-agent
on the job controller it will similarly work(sky) Andrews-MacBook-Air:skypilot asai$ sky jobs launch -c test 'echo "$SKYPILOT_NODE_IPS"; sleep 1000000' --num-nodes 2 --cloud gcp (worker1, rank=1, pid=3343, ip=10.128.0.43) 10.128.0.42 (worker1, rank=1, pid=3343, ip=10.128.0.43) 10.128.0.43 (head, rank=0, pid=3762) 10.128.0.42 (head, rank=0, pid=3762) 10.128.0.43 (sky) Andrews-MacBook-Air:skypilot asai$ ssh sky-jobs-controller-ebd16671 # access job controller (base) gcpuser@sky-jobs-controller-ebd16671-ebd1-head-duxhhecj-compute$ eval $(ssh-agent -s) (base) gcpuser@sky-jobs-controller-ebd16671-ebd1-head-duxhhecj-compute$ ssh-add ~/.ssh/sky-key (base) gcpuser@sky-jobs-controller-ebd16671-ebd1-head-duxhhecj-compute:~$ ssh test-1 # access head node of job (base) gcpuser@test-1-ebd1-head-48o88sx0-compute:~$ ssh 10.128.0.43 # access other worker (base) gcpuser@test-1-ebd1-worker-9udnab2g-compute:~$
This is awesome! Thanks for mentioning this @asaiacai. The ssh-agent should work well in the interactive case, but it might not be sufficient for examples that require the SSH access in the run
section of the task, as the run
section is detached from the ssh connection.
@Michaelvll it also works for mpirun
tasks define via run
. I just tested this works on the latest commit skypilot, commit bd383e912a55f0afbd9cc3c239771dbbf3dcb900
using the same task definition example you have in #3693 but omitted mounting ~/.ssh/sky-key
. Output is shown here
Note that if we used sky jobs launch
this probably won't work, but maybe it would probably work just starting ssh-agent
on the job controller by default?
We currently do not set up the ssh connection from the head node to the workers, which is required for MPI workload.
One way to do so is to setup another public/private key pair for ssh for each cluster's head and worker nodes.
Version & Commit info:
sky -v
: PLEASE_FILL_INsky -c
: PLEASE_FILL_IN