skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.52k stars 464 forks source link

[Core] Allow ssh access from head to worker nodes #3690

Open Michaelvll opened 2 months ago

Michaelvll commented 2 months ago

We currently do not set up the ssh connection from the head node to the workers, which is required for MPI workload.

One way to do so is to setup another public/private key pair for ssh for each cluster's head and worker nodes.

Version & Commit info:

asaiacai commented 2 months ago

If you want to ssh from the head to the other workers and have it work for mpirun, its sufficient to enable ssh-agent. No need to setup/copy keys. I have an example of this for doing nccl-test here with mpirun and i've also pasted here for convenience an example for doing this for ssh'ing between hosts. Might be sufficient to just include this in the docs/examples?

Andrews-MacBook-Air:skypilot asai$ eval $(ssh-agent -s)
Andrews-MacBook-Air:skypilot asai$ ssh-add ~/.ssh/sky-key
Andrews-MacBook-Air:skypilot asai$ sky launch -c test --num-nodes 2 --cloud gcp 'echo "$SKYPILOT_NODE_IPS"'
(worker1, rank=1, pid=3398, ip=10.128.0.12) 10.128.0.8
(worker1, rank=1, pid=3398, ip=10.128.0.12) 10.128.0.12
(head, rank=0, pid=3877) 10.128.0.8
(head, rank=0, pid=3877) 10.128.0.12
Andrews-MacBook-Air:skypilot asai$ ssh test # get onto head node 
(base) gcpuser@test-ebd1-head-op1wrzgz-compute:~$  ssh 10.128.0.12 # ssh to worker via private IP

right now this doesn't work if you do sky jobs launch since the controller doesn't have the ssh-agent on. However it looks like if you just run ssh-agent on the job controller it will similarly work

(sky) Andrews-MacBook-Air:skypilot asai$ sky jobs launch -c test 'echo "$SKYPILOT_NODE_IPS"; sleep 1000000' --num-nodes 2 --cloud gcp
(worker1, rank=1, pid=3343, ip=10.128.0.43) 10.128.0.42
(worker1, rank=1, pid=3343, ip=10.128.0.43) 10.128.0.43
(head, rank=0, pid=3762) 10.128.0.42
(head, rank=0, pid=3762) 10.128.0.43
(sky) Andrews-MacBook-Air:skypilot asai$ ssh sky-jobs-controller-ebd16671 # access job controller
(base) gcpuser@sky-jobs-controller-ebd16671-ebd1-head-duxhhecj-compute$ eval $(ssh-agent -s)
(base) gcpuser@sky-jobs-controller-ebd16671-ebd1-head-duxhhecj-compute$ ssh-add ~/.ssh/sky-key
(base) gcpuser@sky-jobs-controller-ebd16671-ebd1-head-duxhhecj-compute:~$ ssh test-1 # access head node of job
(base) gcpuser@test-1-ebd1-head-48o88sx0-compute:~$ ssh 10.128.0.43 # access other worker
(base) gcpuser@test-1-ebd1-worker-9udnab2g-compute:~$
Michaelvll commented 2 months ago

If you want to ssh from the head to the other workers and have it work for mpirun, its sufficient to enable ssh-agent. No need to setup/copy keys. I have an example of this for doing nccl-test here with mpirun and i've also pasted here for convenience an example for doing this for ssh'ing between hosts. Might be sufficient to just include this in the docs/examples?

Andrews-MacBook-Air:skypilot asai$ eval $(ssh-agent -s)
Andrews-MacBook-Air:skypilot asai$ ssh-add ~/.ssh/sky-key
Andrews-MacBook-Air:skypilot asai$ sky launch -c test --num-nodes 2 --cloud gcp 'echo "$SKYPILOT_NODE_IPS"'
(worker1, rank=1, pid=3398, ip=10.128.0.12) 10.128.0.8
(worker1, rank=1, pid=3398, ip=10.128.0.12) 10.128.0.12
(head, rank=0, pid=3877) 10.128.0.8
(head, rank=0, pid=3877) 10.128.0.12
Andrews-MacBook-Air:skypilot asai$ ssh test # get onto head node 
(base) gcpuser@test-ebd1-head-op1wrzgz-compute:~$  ssh 10.128.0.12 # ssh to worker via private IP

right now this doesn't work if you do sky jobs launch since the controller doesn't have the ssh-agent on. However it looks like if you just run ssh-agent on the job controller it will similarly work

(sky) Andrews-MacBook-Air:skypilot asai$ sky jobs launch -c test 'echo "$SKYPILOT_NODE_IPS"; sleep 1000000' --num-nodes 2 --cloud gcp
(worker1, rank=1, pid=3343, ip=10.128.0.43) 10.128.0.42
(worker1, rank=1, pid=3343, ip=10.128.0.43) 10.128.0.43
(head, rank=0, pid=3762) 10.128.0.42
(head, rank=0, pid=3762) 10.128.0.43
(sky) Andrews-MacBook-Air:skypilot asai$ ssh sky-jobs-controller-ebd16671 # access job controller
(base) gcpuser@sky-jobs-controller-ebd16671-ebd1-head-duxhhecj-compute$ eval $(ssh-agent -s)
(base) gcpuser@sky-jobs-controller-ebd16671-ebd1-head-duxhhecj-compute$ ssh-add ~/.ssh/sky-key
(base) gcpuser@sky-jobs-controller-ebd16671-ebd1-head-duxhhecj-compute:~$ ssh test-1 # access head node of job
(base) gcpuser@test-1-ebd1-head-48o88sx0-compute:~$ ssh 10.128.0.43 # access other worker
(base) gcpuser@test-1-ebd1-worker-9udnab2g-compute:~$

This is awesome! Thanks for mentioning this @asaiacai. The ssh-agent should work well in the interactive case, but it might not be sufficient for examples that require the SSH access in the run section of the task, as the run section is detached from the ssh connection.

asaiacai commented 2 months ago

@Michaelvll it also works for mpirun tasks define via run. I just tested this works on the latest commit skypilot, commit bd383e912a55f0afbd9cc3c239771dbbf3dcb900 using the same task definition example you have in #3693 but omitted mounting ~/.ssh/sky-key. Output is shown here

Note that if we used sky jobs launch this probably won't work, but maybe it would probably work just starting ssh-agent on the job controller by default?