ray-project / ray-llm

RayLLM - LLMs on Ray
https://aviary.anyscale.com
Apache License 2.0
1.22k stars 87 forks source link

S3 bucket model download fails silently if the cluster doesn't have the right permissions #55

Open architkulkarni opened 11 months ago

architkulkarni commented 11 months ago

Reproduction: Add this to the Llama-2-7b model YAML, or probably any model YAML:

  s3_mirror_config:
    bucket_uri: s3://anyscale-staging-data-cld-kvedzwag2qa8i5bjxuevf5i7/org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/archit__kulkarni/ft_llms_with_deepspeed/meta-llama/Llama-2-7b-hf/demo-gsm-7b/

Any bucket for which your cluster doesn't have permissions should also reproduce the issue.

When you run aviary run for that model YAML, the following messages are printed, which are misleading:

(ServeReplica:meta-llama--Llama-2-7b-chat-hf:meta-llama--Llama-2-7b-chat-hf pid=3309, ip=172.31.60.146) [INFO 2023-09-14 15:05:23,682] utils.py: 63  Downloading meta-llama/Llama-2-7b-chat-hf from s3://anyscale-staging-data-cld-kvedzwag2qa8i5bjxuevf5i7/org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/archit__kulkarni/ft_llms_with_deepspeed/meta-llama/Llama-2-7b-hf/demo-gsm-7b/ to /home/ray/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-chat-hf/snapshots/0000000000000000000000000000000000000000
[...]
(ServeReplica:meta-llama--Llama-2-7b-chat-hf:meta-llama--Llama-2-7b-chat-hf pid=3309, ip=172.31.60.146) [INFO 2023-09-14 15:05:24,200] utils.py: 184  Done downloading the model from bucket!

In fact, the folder is empty:

(base) ray@aviary-raycluster-z48t9-worker-gpu-group-m78lw:~$ ls /home/ray/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-chat-hf/snapshots/0000000000000000000000000000000000000000/
(base) ray@aviary-raycluster-z48t9-worker-gpu-group-m78lw:~$

From what I understand from @kouroshHakha and @Yard1 (feel free to correct me): The folder is not supposed to be empty. Because it's empty, when you serve the model you're actually serving some other model (the "base model" or "chat model", I don't remember exactly), instead of the model from the specified S3 bucket.

architkulkarni commented 11 months ago

To verify the permissions issue we ran the following from the worker node:

aws s3 sync s3://anyscale-staging-data-cld-kvedzwag2qa8i5bjxuevf5i7/org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/archit__kulkarni/ft_llms_with_deepspeed/meta-llama/Llama-2-7b-hf/demo-gsm-7b/ /home/ray/.cache/huggingface/hub/models--meta-llama--Llama-2-7b-chat-hf/snapshots/0000000000000000000000000000000000000000/
fatal error: An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied

After giving the worker node the correct permissions, manually re-running the s3 sync command, and re-running aviary run (where my understanding is that it reuses the cached data in the 000... folder), the model started producing the correct fine-tuned output.