skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.8k stars 509 forks source link

AWS stuck on `Waiting for SSH access` #2689

Open mukundt opened 1 year ago

mukundt commented 1 year ago

Hey! I've been able to successfully deploy on GCP, but when I try AWS, SkyPilot times out after 600s:

I 10-10 16:33:21 provisioner.py:73] Launching on AWS us-west-2 (us-west-2a,us-west-2b,us-west-2c)
⠋ Launching - Waiting for SSH access
E 10-10 16:43:32 provisioner.py:491] *** Failed setting up cluster. ***
RuntimeError: Failed to SSH to xxx.xxx.xxx after timeout 600s.

Is something wrong with my AWS config? sky check returns AWS: enabled.

Michaelvll commented 1 year ago

Thanks for reporting the issue @mukundt! Could you share the skypilot version you are using as well as the yaml file used for launching the VM?

mukundt commented 1 year ago

@Michaelvll using skypilot==0.4.0.

Launch command: sky launch -c a10g finetune_skypilot_run.yaml -y --gpus A10G:1 --cloud aws

YML

resources:
  disk_size: 800
  ports:
    - 8000

file_mounts:
  /sky_data: .

setup: |
  conda activate train
  if [ $? -ne 0 ]; then
    conda create -n train python=3.9 -y
    conda activate train
  fi

  rm -rf axolotl
  git clone https://github.com/OpenAccess-AI-Collective/axolotl
  cd axolotl

  pip install wandb
  pip install vllm
  pip install fschat
  pip install packaging
  pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
  pip install -e '.[deepspeed]'
  pip install flash-attn==2.3.0 --no-build-isolation

run: |
  conda activate train
  cd axolotl
  echo 'Starting training...'
  accelerate launch -m axolotl.cli.train /sky_data/refinetune_7b_axolotl_config.yml
Michaelvll commented 1 year ago

Hey @mukundt, thanks for sharing the yaml file. Would you like to try the latest SkyPilot master branch or pip uninstall skypilot; pip install -U skypilot-nightly to see if that solves the issue?

PoornaSaiNagendra commented 10 months ago

Hi @Michaelvll, I've encountered the same issue with SkyPilot on AWS and attempted the solution you mentioned by installing skypilot-nightly, but the problem persists. Any additional suggestions or insights would be greatly appreciated.

Michaelvll commented 10 months ago

Hi @Michaelvll, I've encountered the same issue with SkyPilot on AWS and attempted the solution you mentioned by installing skypilot-nightly, but the problem persists. Any additional suggestions or insights would be greatly appreciated.

Hi @PoornaSaiNagendra, thanks for the question! Could you share the provision.log as mentioned in the output of your sky launch?

Also, it would be nice to try to check the AWS console to find the public IP of the cluster, and ssh <public-ip> -i ~/.ssh/sky-key to see if you can connect to the cluster to understand where the error comes from. : )

nshaposh commented 9 months ago

Anything on this? Just experienced the same issue...

Michaelvll commented 9 months ago

Anything on this? Just experienced the same issue...

Thanks for asking @nshaposh! Could you share the yaml and the provision.log that causes the issue, so we can take a deeper look into the issue?

nshaposh commented 9 months ago

provision.log

There seems to be a problem propagating the ssh keys to the instance or something... When I check from AWS console, the instance is there, although it is not associated with any ssh keys.

Screen Shot 2024-01-31 at 1 15 40 PM

Here yaml (can't attach it). It is yaml for llm-llama-tuner, I just modified instance type and cloud:

# llm-tuner.yaml

resources:
  accelerators: A10G:1  # 1x NVIDIA A10 GPU, about US$ 0.6 / hr on Lambda Cloud. Run `sky show-gpus` for supported GPU types, and `sky show-gpus [GPU_NAME]` for the detailed information of a GPU type.
  cloud: aws  # Optional; if left out, SkyPilot will automatically pick the cheapest cloud.

file_mounts:
  # Mount a presisted cloud storage that will be used as the data directory.
  # (to store train datasets trained models)
  # See https://skypilot.readthedocs.io/en/latest/reference/storage.html for details.
  /data:
    name: epigen-llm-tuner-data  # Make sure this name is unique or you own this bucket. If it does not exists, SkyPilot will try to create a bucket with this name.
    store: s3  # Could be either of [s3, gcs]

    mode: MOUNT

# Clone the LLaMA-LoRA Tuner repo and install its dependencies.
setup: |
  conda create -q python=3.8 -n llm-tuner -y
  conda activate llm-tuner

  # Clone the LLaMA-LoRA Tuner repo and install its dependencies
  [ ! -d llm_tuner ] && git clone https://github.com/zetavg/LLaMA-LoRA-Tuner.git llm_tuner
  echo 'Installing dependencies...'
  pip install -r llm_tuner/requirements.lock.txt

  # Optional: install wandb to enable logging to Weights & Biases
  pip install wandb

  # Optional: patch bitsandbytes to workaround error "libbitsandbytes_cpu.so: undefined symbol: cget_col_row_stats"
  BITSANDBYTES_LOCATION="$(pip show bitsandbytes | grep 'Location' | awk '{print $2}')/bitsandbytes"
  [ -f "$BITSANDBYTES_LOCATION/libbitsandbytes_cpu.so" ] && [ ! -f "$BITSANDBYTES_LOCATION/libbitsandbytes_cpu.so.bak" ] && [ -f "$BITSANDBYTES_LOCATION/libbitsandbytes_cuda121.so" ] && echo 'Patching bitsandbytes for GPU support...' && mv "$BITSANDBYTES_LOCATION/libbitsandbytes_cpu.so" "$BITSANDBYTES_LOCATION/libbitsandbytes_cpu.so.bak" && cp "$BITSANDBYTES_LOCATION/libbitsandbytes_cuda121.so" "$BITSANDBYTES_LOCATION/libbitsandbytes_cpu.so"
  conda install -q cudatoolkit -y

  echo 'Dependencies installed.'

  # Optional: Install and setup Cloudflare Tunnel to expose the app to the internet with a custom domain name
  [ -f /data/secrets/cloudflared_tunnel_token.txt ] && echo "Installing Cloudflare" && curl -L --output cloudflared.deb https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64.deb && sudo dpkg -i cloudflared.deb && sudo cloudflared service uninstall || : && sudo cloudflared service install "$(cat /data/secrets/cloudflared_tunnel_token.txt | tr -d '\n')"

  # Optional: pre-download models
  echo "Pre-downloading base models so that you won't have to wait for long once the app is ready..."
  python llm_tuner/download_base_model.py --base_model_names='decapoda-research/llama-7b-hf,nomic-ai/gpt4all-j'

# Start the app. `hf_access_token`, `wandb_api_key` and `wandb_project` are optional.
run: |
  conda activate llm-tuner
  python llm_tuner/app.py \
    --data_dir='/data' \
    --hf_access_token="$([ -f /data/secrets/hf_access_token.txt ] && cat /data/secrets/hf_access_token.txt | tr -d '\n')" \
    --wandb_api_key="$([ -f /data/secrets/wandb_api_key.txt ] && cat /data/secrets/wandb_api_key.txt | tr -d '\n')" \
    --wandb_project='llm-tuner' \
    --timezone='Atlantic/Reykjavik' \
    --base_model='decapoda-research/llama-7b-hf' \
    --base_model_choices='decapoda-research/llama-7b-hf,nomic-ai/gpt4all-j,databricks/dolly-v2-7b' \
    --share
Michaelvll commented 9 months ago

Thanks for sharing this @nshaposh! We don’t use the ssh key metadata on aws, but directly add the key to the ~/.ssh/authorized_keys on the instance created by SkyPilot using the cloud-init. https://github.com/skypilot-org/skypilot/blob/57cfa7cc2598210c86df307a9dee09216d96f151/sky/templates/aws-ray.yml.j2#L103-L109

Did you recall that there was some modification to your ~/.ssh folder you have been made, e.g. something that can cause the private and public keys ~/.ssh/sky-key and ~/.ssh/sky-key.pub of SkyPilot fail to match? Is it possible to have a try with the ssh command shown in the provision.log directly in your terminal with a -vvv added to see what the problem is?

If it is permission denied, could you try moving the two files to another place and sky launch another cluster and see if the problem fixes?

nshaposh commented 9 months ago

@Michaelvll Here is ssh output:

root@82068e909dca:/app# ssh -vvv -i ~/.ssh/sky-key ec2-user@34.200.237.170 
OpenSSH_9.2p1 Debian-2+deb12u1, OpenSSL 3.0.11 19 Sep 2023
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: include /etc/ssh/ssh_config.d/*.conf matched no files
debug1: /etc/ssh/ssh_config line 21: Applying options for *
debug2: resolve_canonicalize: hostname 34.200.237.170 is address
debug3: expanded UserKnownHostsFile '~/.ssh/known_hosts' -> '/root/.ssh/known_hosts'
debug3: expanded UserKnownHostsFile '~/.ssh/known_hosts2' -> '/root/.ssh/known_hosts2'
debug3: ssh_connect_direct: entering
debug1: Connecting to 34.200.237.170 [34.200.237.170] port 22.
debug3: set_sock_tos: set socket 3 IP_TOS 0x10
debug1: connect to address 34.200.237.170 port 22: Connection refused
ssh: connect to host 34.200.237.170 port 22: Connection refused
Michaelvll commented 9 months ago

ssh -vvv -i ~/.ssh/sky-key ec2-user@34.200.237.170 OpenSSH_9.2p1 Debian-2+deb12u1, OpenSSL 3.0.11 19 Sep 2023 debug1: Reading configuration data /etc/ssh/ssh_config debug1: /etc/ssh/ssh_config line 19: include /etc/ssh/ssh_config.d/.conf matched no files debug1: /etc/ssh/ssh_config line 21: Applying options for debug2: resolve_canonicalize: hostname 34.200.237.170 is address debug3: expanded UserKnownHostsFile '~/.ssh/known_hosts' -> '/root/.ssh/known_hosts' debug3: expanded UserKnownHostsFile '~/.ssh/known_hosts2' -> '/root/.ssh/known_hosts2' debug3: ssh_connect_direct: entering debug1: Connecting to 34.200.237.170 [34.200.237.170] port 22. debug3: set_sock_tos: set socket 3 IP_TOS 0x10 debug1: connect to address 34.200.237.170 port 22: Connection refused ssh: connect to host 34.200.237.170 port 22: Connection refused

Hmm, this is weird. It seems to be some firewall issue. Could you check if the security group assigned to the VM has the port 22 open to public? It would be also good to make sure that your AWS account have enough permission to create security groups that has the right rules: https://skypilot.readthedocs.io/en/latest/cloud-setup/cloud-permissions/aws.html

If the problem keeps happening, please feel free to join our slack channel: http://slack.skypilot.co so we can help in an interactive way or even start a debugging session with you : )

nshaposh commented 9 months ago

@Michaelvll I have noticed that the subnet, where this instance was placed, doesn't have a route table. May be this is the reason. How can I control, from the skypilot side, which VPC cluster should use?

nshaposh commented 9 months ago

@Michaelvll When I specified the correct VPC in ~/.sky/config.yaml, everything worked. Thank you!

Michaelvll commented 9 months ago

@Michaelvll I have noticed that the subnet, where this instance was placed, doesn't have a route table. May be this is the reason. How can I control, from the skypilot side, which VPC cluster should use?

@Michaelvll When I specified the correct VPC in ~/.sky/config.yaml, everything worked. Thank you!

Great to hear it works @nshaposh! Thanks for digging into this.

Just to confirm, is the VPC without route table originally exist in your AWS account? Did you manually specified the code to use that VPC before?

If it was not specified, we should definitely fix this by automatically using VPC that has route table. cc'ing @concretevitamin.

nshaposh commented 9 months ago

@Michaelvll On my first run I didn't have any VPC specified and I didn't have a ~/.sky/config.yaml file. I guess, skypilot chose a VPC from the list of available VPC on my AWS account. We have several VPCs. I believe, I saw somewhere in the docs, that in this case skypilot should choose a default VPC. In my case, skypilot picked VPC which is not default and it had a subnet without route table attached. As a result, created instance didn't have access to internet.

To summarize, looks like the problem is a bit on both sides. We have to put our AWS in order, or (prereffably stop using this crap completely) AND skypilot (may be) did something which it wasn't supposed to, namely, used non-default VPC. May be this is something for you guys to look at.

Again I am happy that we got it resolved. You have a very useful framework, which we will definitely keep using.

Cheers, Nick

concretevitamin commented 9 months ago

Thanks for the report @nshaposh! We'll look into it on our part. Which version sky -v or commit sky --commit are you using?

concretevitamin commented 9 months ago

Tried to reproduce this with master 3ffef36a23:

Then, tried sky launch again and it correctly errored out:

...
I 02-03 09:09:21 provisioner.py:79] Launching on AWS ap-southeast-1 (ap-southeast-1b)
E 02-03 09:09:23 provisioner.py:94] Failed to configure 'sky-7ec8-zongheng' on AWS Region(name='ap-southeast-1') (ap-southeast-1b) with the following error:
E 02-03 09:09:23 provisioner.py:94] RuntimeError: SKYPILOT_ERROR_NO_NODES_LAUNCHED: No usable subnets found, try manually creating an instance in your specified region to populate the list of subnets and trying this again. Note that the subnet must map public IPs on instance launch unless you set `use_internal_ips: true` in the `provider` config.
W 02-03 09:09:24 cloud_vm_ray_backend.py:2061] sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in ap-southeast-1b. Try changing resource requirements or use another zone.
...

To help us dig more, the version/commit info, as well as how the subnet/route table looked like, would be very useful @nshaposh!