Open mukundt opened 1 year ago
Thanks for reporting the issue @mukundt! Could you share the skypilot version you are using as well as the yaml file used for launching the VM?
@Michaelvll using skypilot==0.4.0
.
Launch command: sky launch -c a10g finetune_skypilot_run.yaml -y --gpus A10G:1 --cloud aws
YML
resources:
disk_size: 800
ports:
- 8000
file_mounts:
/sky_data: .
setup: |
conda activate train
if [ $? -ne 0 ]; then
conda create -n train python=3.9 -y
conda activate train
fi
rm -rf axolotl
git clone https://github.com/OpenAccess-AI-Collective/axolotl
cd axolotl
pip install wandb
pip install vllm
pip install fschat
pip install packaging
pip install torch==2.0.1+cu118 torchvision==0.15.2+cu118 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
pip install -e '.[deepspeed]'
pip install flash-attn==2.3.0 --no-build-isolation
run: |
conda activate train
cd axolotl
echo 'Starting training...'
accelerate launch -m axolotl.cli.train /sky_data/refinetune_7b_axolotl_config.yml
Hey @mukundt, thanks for sharing the yaml file. Would you like to try the latest SkyPilot master branch or pip uninstall skypilot; pip install -U skypilot-nightly
to see if that solves the issue?
Hi @Michaelvll, I've encountered the same issue with SkyPilot on AWS and attempted the solution you mentioned by installing skypilot-nightly, but the problem persists. Any additional suggestions or insights would be greatly appreciated.
Hi @Michaelvll, I've encountered the same issue with SkyPilot on AWS and attempted the solution you mentioned by installing skypilot-nightly, but the problem persists. Any additional suggestions or insights would be greatly appreciated.
Hi @PoornaSaiNagendra, thanks for the question! Could you share the provision.log
as mentioned in the output of your sky launch
?
Also, it would be nice to try to check the AWS console to find the public IP of the cluster, and ssh <public-ip> -i ~/.ssh/sky-key
to see if you can connect to the cluster to understand where the error comes from. : )
Anything on this? Just experienced the same issue...
Anything on this? Just experienced the same issue...
Thanks for asking @nshaposh! Could you share the yaml and the provision.log
that causes the issue, so we can take a deeper look into the issue?
There seems to be a problem propagating the ssh keys to the instance or something... When I check from AWS console, the instance is there, although it is not associated with any ssh keys.
Here yaml (can't attach it). It is yaml for llm-llama-tuner, I just modified instance type and cloud:
# llm-tuner.yaml
resources:
accelerators: A10G:1 # 1x NVIDIA A10 GPU, about US$ 0.6 / hr on Lambda Cloud. Run `sky show-gpus` for supported GPU types, and `sky show-gpus [GPU_NAME]` for the detailed information of a GPU type.
cloud: aws # Optional; if left out, SkyPilot will automatically pick the cheapest cloud.
file_mounts:
# Mount a presisted cloud storage that will be used as the data directory.
# (to store train datasets trained models)
# See https://skypilot.readthedocs.io/en/latest/reference/storage.html for details.
/data:
name: epigen-llm-tuner-data # Make sure this name is unique or you own this bucket. If it does not exists, SkyPilot will try to create a bucket with this name.
store: s3 # Could be either of [s3, gcs]
mode: MOUNT
# Clone the LLaMA-LoRA Tuner repo and install its dependencies.
setup: |
conda create -q python=3.8 -n llm-tuner -y
conda activate llm-tuner
# Clone the LLaMA-LoRA Tuner repo and install its dependencies
[ ! -d llm_tuner ] && git clone https://github.com/zetavg/LLaMA-LoRA-Tuner.git llm_tuner
echo 'Installing dependencies...'
pip install -r llm_tuner/requirements.lock.txt
# Optional: install wandb to enable logging to Weights & Biases
pip install wandb
# Optional: patch bitsandbytes to workaround error "libbitsandbytes_cpu.so: undefined symbol: cget_col_row_stats"
BITSANDBYTES_LOCATION="$(pip show bitsandbytes | grep 'Location' | awk '{print $2}')/bitsandbytes"
[ -f "$BITSANDBYTES_LOCATION/libbitsandbytes_cpu.so" ] && [ ! -f "$BITSANDBYTES_LOCATION/libbitsandbytes_cpu.so.bak" ] && [ -f "$BITSANDBYTES_LOCATION/libbitsandbytes_cuda121.so" ] && echo 'Patching bitsandbytes for GPU support...' && mv "$BITSANDBYTES_LOCATION/libbitsandbytes_cpu.so" "$BITSANDBYTES_LOCATION/libbitsandbytes_cpu.so.bak" && cp "$BITSANDBYTES_LOCATION/libbitsandbytes_cuda121.so" "$BITSANDBYTES_LOCATION/libbitsandbytes_cpu.so"
conda install -q cudatoolkit -y
echo 'Dependencies installed.'
# Optional: Install and setup Cloudflare Tunnel to expose the app to the internet with a custom domain name
[ -f /data/secrets/cloudflared_tunnel_token.txt ] && echo "Installing Cloudflare" && curl -L --output cloudflared.deb https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64.deb && sudo dpkg -i cloudflared.deb && sudo cloudflared service uninstall || : && sudo cloudflared service install "$(cat /data/secrets/cloudflared_tunnel_token.txt | tr -d '\n')"
# Optional: pre-download models
echo "Pre-downloading base models so that you won't have to wait for long once the app is ready..."
python llm_tuner/download_base_model.py --base_model_names='decapoda-research/llama-7b-hf,nomic-ai/gpt4all-j'
# Start the app. `hf_access_token`, `wandb_api_key` and `wandb_project` are optional.
run: |
conda activate llm-tuner
python llm_tuner/app.py \
--data_dir='/data' \
--hf_access_token="$([ -f /data/secrets/hf_access_token.txt ] && cat /data/secrets/hf_access_token.txt | tr -d '\n')" \
--wandb_api_key="$([ -f /data/secrets/wandb_api_key.txt ] && cat /data/secrets/wandb_api_key.txt | tr -d '\n')" \
--wandb_project='llm-tuner' \
--timezone='Atlantic/Reykjavik' \
--base_model='decapoda-research/llama-7b-hf' \
--base_model_choices='decapoda-research/llama-7b-hf,nomic-ai/gpt4all-j,databricks/dolly-v2-7b' \
--share
Thanks for sharing this @nshaposh! We don’t use the ssh key metadata on aws, but directly add the key to the ~/.ssh/authorized_keys
on the instance created by SkyPilot using the cloud-init. https://github.com/skypilot-org/skypilot/blob/57cfa7cc2598210c86df307a9dee09216d96f151/sky/templates/aws-ray.yml.j2#L103-L109
Did you recall that there was some modification to your ~/.ssh
folder you have been made, e.g. something that can cause the private and public keys ~/.ssh/sky-key
and ~/.ssh/sky-key.pub
of SkyPilot fail to match? Is it possible to have a try with the ssh command shown in the provision.log directly in your terminal with a -vvv
added to see what the problem is?
If it is permission denied, could you try moving the two files to another place and sky launch
another cluster and see if the problem fixes?
@Michaelvll Here is ssh output:
root@82068e909dca:/app# ssh -vvv -i ~/.ssh/sky-key ec2-user@34.200.237.170
OpenSSH_9.2p1 Debian-2+deb12u1, OpenSSL 3.0.11 19 Sep 2023
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: include /etc/ssh/ssh_config.d/*.conf matched no files
debug1: /etc/ssh/ssh_config line 21: Applying options for *
debug2: resolve_canonicalize: hostname 34.200.237.170 is address
debug3: expanded UserKnownHostsFile '~/.ssh/known_hosts' -> '/root/.ssh/known_hosts'
debug3: expanded UserKnownHostsFile '~/.ssh/known_hosts2' -> '/root/.ssh/known_hosts2'
debug3: ssh_connect_direct: entering
debug1: Connecting to 34.200.237.170 [34.200.237.170] port 22.
debug3: set_sock_tos: set socket 3 IP_TOS 0x10
debug1: connect to address 34.200.237.170 port 22: Connection refused
ssh: connect to host 34.200.237.170 port 22: Connection refused
ssh -vvv -i ~/.ssh/sky-key ec2-user@34.200.237.170 OpenSSH_9.2p1 Debian-2+deb12u1, OpenSSL 3.0.11 19 Sep 2023 debug1: Reading configuration data /etc/ssh/ssh_config debug1: /etc/ssh/ssh_config line 19: include /etc/ssh/ssh_config.d/.conf matched no files debug1: /etc/ssh/ssh_config line 21: Applying options for debug2: resolve_canonicalize: hostname 34.200.237.170 is address debug3: expanded UserKnownHostsFile '~/.ssh/known_hosts' -> '/root/.ssh/known_hosts' debug3: expanded UserKnownHostsFile '~/.ssh/known_hosts2' -> '/root/.ssh/known_hosts2' debug3: ssh_connect_direct: entering debug1: Connecting to 34.200.237.170 [34.200.237.170] port 22. debug3: set_sock_tos: set socket 3 IP_TOS 0x10 debug1: connect to address 34.200.237.170 port 22: Connection refused ssh: connect to host 34.200.237.170 port 22: Connection refused
Hmm, this is weird. It seems to be some firewall issue. Could you check if the security group assigned to the VM has the port 22 open to public? It would be also good to make sure that your AWS account have enough permission to create security groups that has the right rules: https://skypilot.readthedocs.io/en/latest/cloud-setup/cloud-permissions/aws.html
If the problem keeps happening, please feel free to join our slack channel: http://slack.skypilot.co so we can help in an interactive way or even start a debugging session with you : )
@Michaelvll I have noticed that the subnet, where this instance was placed, doesn't have a route table. May be this is the reason. How can I control, from the skypilot side, which VPC cluster should use?
@Michaelvll When I specified the correct VPC in ~/.sky/config.yaml, everything worked. Thank you!
@Michaelvll I have noticed that the subnet, where this instance was placed, doesn't have a route table. May be this is the reason. How can I control, from the skypilot side, which VPC cluster should use?
@Michaelvll When I specified the correct VPC in ~/.sky/config.yaml, everything worked. Thank you!
Great to hear it works @nshaposh! Thanks for digging into this.
Just to confirm, is the VPC without route table originally exist in your AWS account? Did you manually specified the code to use that VPC before?
If it was not specified, we should definitely fix this by automatically using VPC that has route table. cc'ing @concretevitamin.
@Michaelvll On my first run I didn't have any VPC specified and I didn't have a ~/.sky/config.yaml file. I guess, skypilot chose a VPC from the list of available VPC on my AWS account. We have several VPCs. I believe, I saw somewhere in the docs, that in this case skypilot should choose a default VPC. In my case, skypilot picked VPC which is not default and it had a subnet without route table attached. As a result, created instance didn't have access to internet.
To summarize, looks like the problem is a bit on both sides. We have to put our AWS in order, or (prereffably stop using this crap completely) AND skypilot (may be) did something which it wasn't supposed to, namely, used non-default VPC. May be this is something for you guys to look at.
Again I am happy that we got it resolved. You have a very useful framework, which we will definitely keep using.
Cheers, Nick
Thanks for the report @nshaposh! We'll look into it on our part. Which version sky -v
or commit sky --commit
are you using?
Tried to reproduce this with master 3ffef36a23:
tmp
VPC with 1 public and 1 private subnetaws.vpc_name
in config.yaml to use this VPCsky launch
can launch VM in this VPC Then, tried sky launch
again and it correctly errored out:
...
I 02-03 09:09:21 provisioner.py:79] Launching on AWS ap-southeast-1 (ap-southeast-1b)
E 02-03 09:09:23 provisioner.py:94] Failed to configure 'sky-7ec8-zongheng' on AWS Region(name='ap-southeast-1') (ap-southeast-1b) with the following error:
E 02-03 09:09:23 provisioner.py:94] RuntimeError: SKYPILOT_ERROR_NO_NODES_LAUNCHED: No usable subnets found, try manually creating an instance in your specified region to populate the list of subnets and trying this again. Note that the subnet must map public IPs on instance launch unless you set `use_internal_ips: true` in the `provider` config.
W 02-03 09:09:24 cloud_vm_ray_backend.py:2061] sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in ap-southeast-1b. Try changing resource requirements or use another zone.
...
To help us dig more, the version/commit info, as well as how the subnet/route table looked like, would be very useful @nshaposh!
Hey! I've been able to successfully deploy on GCP, but when I try AWS, SkyPilot times out after 600s:
Is something wrong with my AWS config?
sky check
returnsAWS: enabled
.