Error downloading and running model on clean deploy

The following is the error encountered when trying to run aviary run --model ./models/amazon--LightGPT.yaml as per the readme setup of doing the following steps

# Setup AWS env vars

# Perform the aviary cluster setup
git clone https://github.com/ray-project/aviary.git
cd aviary
ray up deploy/ray/aviary-cluster.yaml
ray attach deploy/ray/aviary-cluster.yaml

# The command with error
aviary run --model ./models/amazon--LightGPT.yaml

The error line is believed to the be the following

...
RuntimeError: Deployment default_amazon--LightGPT is UNHEALTHY: The Deployment failed to start 3 times in a row. This may be due to a problem with the deployment 
constructor or the initial health check failing. See controller logs for details. Retrying after 1 seconds. Error:
[36mray::ServeReplica:default_amazon--LightGPT.is_initialized()[39m (pid=1259, ip=172.31.76.164, actor_id=74666bc8c5fe4e8e2f51f68801000000, 
repr=<ray.serve._private.replica.ServeReplica:default_amazon--LightGPT object at 0x7f5193312f50>)
  File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 338, in is_initialized
    raise RuntimeError(traceback.format_exc()) from None
RuntimeError: Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 330, in is_initialized
    metadata = await self.reconfigure(deployment_config)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 347, in reconfigure
    raise RuntimeError(traceback.format_exc()) from None
RuntimeError: Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 344, in reconfigure
    await self.replica.reconfigure(deployment_config)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 631, in reconfigure
    await reconfigure_method(self.deployment_config.user_config)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/server/app.py", line 97, in reconfigure
    await self.rollover(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/predictor.py", line 268, in rollover
    self.new_worker_group = await self._create_worker_group(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/predictor.py", line 340, in _create_worker_group
    await asyncio.gather(
  File "/home/ray/anaconda3/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
    return (yield from awaitable.__await__())
ray.exceptions.RayTaskError(OSError): [36mray::PredictionWorker.init_model()[39m (pid=9567, ip=172.31.37.30, actor_id=872d1d4babacb73015b0edde01000000, 
repr=PredictionWorker:amazon/LightGPT)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/predictor.py", line 176, in init_model
    self.generator = init_model(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/utils.py", line 83, in inner
    ret = func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/predictor.py", line 67, in init_model
    pipeline = get_pipeline_cls_by_name(pipeline_name).from_initializer(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/pipelines/_base.py", line 79, in from_initializer
    model, tokenizer = initializer.load(model_id)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/initializers/hf_transformers/base.py", line 57, in load
    model = self.load_model(model_id)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/initializers/hf_transformers/deepspeed.py", line 132, in load_model
    model = AutoModelForCausalLM.from_pretrained(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 467, in from_pretrained
    return model_class.from_pretrained(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2387, in from_pretrained
    raise EnvironmentError(
OSError: Error no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory 
/home/ray/.cache/huggingface/hub/models--amazon--LightGPT/snapshots/ee9e7bc83ff435561d0bacfdf8dd2eeb6a5c6f9f.

Full error log as per attached

aviary-error.log

ray-project / ray-llm

Error downloading and running model on clean deploy #9