ray-project / ray-llm

RayLLM - LLMs on Ray
https://aviary.anyscale.com
Apache License 2.0
1.22k stars 89 forks source link

Error downloading and running model on clean deploy #9

Closed PicoCreator closed 1 year ago

PicoCreator commented 1 year ago

The following is the error encountered when trying to run aviary run --model ./models/amazon--LightGPT.yaml as per the readme setup of doing the following steps

# Setup AWS env vars

# Perform the aviary cluster setup
git clone https://github.com/ray-project/aviary.git
cd aviary
ray up deploy/ray/aviary-cluster.yaml
ray attach deploy/ray/aviary-cluster.yaml

# The command with error
aviary run --model ./models/amazon--LightGPT.yaml

The error line is believed to the be the following

...
RuntimeError: Deployment default_amazon--LightGPT is UNHEALTHY: The Deployment failed to start 3 times in a row. This may be due to a problem with the deployment 
constructor or the initial health check failing. See controller logs for details. Retrying after 1 seconds. Error:
[36mray::ServeReplica:default_amazon--LightGPT.is_initialized()[39m (pid=1259, ip=172.31.76.164, actor_id=74666bc8c5fe4e8e2f51f68801000000, 
repr=<ray.serve._private.replica.ServeReplica:default_amazon--LightGPT object at 0x7f5193312f50>)
  File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 338, in is_initialized
    raise RuntimeError(traceback.format_exc()) from None
RuntimeError: Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 330, in is_initialized
    metadata = await self.reconfigure(deployment_config)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 347, in reconfigure
    raise RuntimeError(traceback.format_exc()) from None
RuntimeError: Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 344, in reconfigure
    await self.replica.reconfigure(deployment_config)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/replica.py", line 631, in reconfigure
    await reconfigure_method(self.deployment_config.user_config)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/server/app.py", line 97, in reconfigure
    await self.rollover(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/predictor.py", line 268, in rollover
    self.new_worker_group = await self._create_worker_group(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/predictor.py", line 340, in _create_worker_group
    await asyncio.gather(
  File "/home/ray/anaconda3/lib/python3.10/asyncio/tasks.py", line 650, in _wrap_awaitable
    return (yield from awaitable.__await__())
ray.exceptions.RayTaskError(OSError): [36mray::PredictionWorker.init_model()[39m (pid=9567, ip=172.31.37.30, actor_id=872d1d4babacb73015b0edde01000000, 
repr=PredictionWorker:amazon/LightGPT)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/predictor.py", line 176, in init_model
    self.generator = init_model(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/utils.py", line 83, in inner
    ret = func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/predictor.py", line 67, in init_model
    pipeline = get_pipeline_cls_by_name(pipeline_name).from_initializer(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/pipelines/_base.py", line 79, in from_initializer
    model, tokenizer = initializer.load(model_id)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/initializers/hf_transformers/base.py", line 57, in load
    model = self.load_model(model_id)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/aviary/backend/llm/initializers/hf_transformers/deepspeed.py", line 132, in load_model
    model = AutoModelForCausalLM.from_pretrained(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 467, in from_pretrained
    return model_class.from_pretrained(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2387, in from_pretrained
    raise EnvironmentError(
OSError: Error no file named pytorch_model.bin, tf_model.h5, model.ckpt.index or flax_model.msgpack found in directory 
/home/ray/.cache/huggingface/hub/models--amazon--LightGPT/snapshots/ee9e7bc83ff435561d0bacfdf8dd2eeb6a5c6f9f.

Full error log as per attached

aviary-error.log

Yard1 commented 1 year ago

Thanks for the report! I believe I've found the issue (the logic in the initializer will incorrectly use an empty folder created by the download from S3 path, which should not be entered into in the first place). We'll get a fix out after the weekend.

In the meantime, you should be able to simply comment out this line - https://github.com/ray-project/aviary/blob/ac62571102ddd7d588da27c2aaff6e0454af8c61/aviary/backend/llm/initializers/hf_transformers/base.py#L87 so that model_id is passed to from_prertrained.