Issues serving other models from HF

kenthua commented 10 months ago

The examples load and serve without issue meta-llama/Llama-2-7b-chat-hf and amazon/LightGPT models.

However, anytime I try other models such as

tiiuae/falcon-7b
mistralai/Mistral-7B-v0.1

apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
  name: aviary
spec:
  # Ray head pod template
  headGroupSpec:
    # The `rayStartParams` are used to configure the `ray start` command.
    # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
    # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
    rayStartParams:
      resources: '"{\"accelerator_type_cpu\": 2}"'
      dashboard-host: '0.0.0.0'
      block: 'true'
    #pod template
    template:
      spec:
        containers:
        - name: ray-head
          image: anyscale/aviary:0.3.1
          resources:
            limits:
              cpu: 2
              memory: 8Gi
            requests:
              cpu: 2
              memory: 8Gi
          ports:
          - containerPort: 6379
            name: gcs-server
          - containerPort: 8265 # Ray dashboard
            name: dashboard
          - containerPort: 10001
            name: client
          - containerPort: 8000
            name: serve
  workerGroupSpecs:
  # the pod replicas in this group typed worker
  - replicas: 1
    minReplicas: 0
    maxReplicas: 1
    # logical group name, for this called small-group, also can be functional
    groupName: gpu-group
    rayStartParams:
      block: 'true'
      resources: '"{\"accelerator_type_cpu\": 8, \"accelerator_type_t4\": 2}"'
    # pod template
    template:
      spec:
        containers:
        - name: llm
          image: anyscale/aviary:0.3.1
          env:
          - name: HUGGING_FACE_HUB_TOKEN
            value: ${HF_API_TOKEN}
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          resources:
            limits:
              cpu: "8"
              memory: "20G"
              nvidia.com/gpu: 2
            requests:
              cpu: "8"
              memory: "20G"
              nvidia.com/gpu: 2
        # Please ensure the following taint has been applied to the GPU node in the cluster.
        tolerations:
          - key: "ray.io/node-type"
            operator: "Equal"
            value: "worker"
            effect: "NoSchedule"
        nodeSelector:
          cloud.google.com/gke-accelerator: nvidia-tesla-t4

aviary run --model model.yaml

I end up with the following error trying to load up the following models.

(ServeController pid=1548)   File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/cuda/__init__.py", line 247, in _lazy_init
(ServeController pid=1548)     torch._C._cuda_init()
(ServeController pid=1548) RuntimeError: No CUDA GPUs are available

Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> 
>>> print(torch.cuda.is_available())
True

mahmedk commented 10 months ago

it works with ray-llm (import it instead of aviary), as shown here

https://github.com/ray-project/ray-llm/blob/c2a22afce74676301ad796719431b437d15305c9/serve_configs/meta-llama--Llama-2-7b-chat-hf.yaml#L4

I could deploy and test mistralai/Mistral-7B-v0.1

akshay-anyscale commented 10 months ago

hi @kenthua did the fix described by @mahmedk work for you?

kenthua commented 10 months ago

I didn't get a chance to test on T4's like my original post. So maybe it has something to do with that.

With L4s on 0.4.0 I was able to load and serve mistralai/Mistral-7B-v0.1 - this one works.

The same setup with tiiuae/falcon-7b I get the stacktrace below:

(ServeController pid=1174)     engine_args, engine_configs = ray.get(
(ServeController pid=1174) ray.exceptions.RaySystemError: System error: No module named 'transformers_modules'
(ServeController pid=1174) traceback: Traceback (most recent call last):
(ServeController pid=1174) WARNING 2023-11-01 14:31:59,123 controller 1174 application_state.py:663 - The deployments ['VLLMDeployment:tiiuae--falcon-7b'] are UNHEALTHY.

kenthua commented 10 months ago

haven't had a chance to test further, but likely needing to add some applications.runtime_env.pip libraries?

kenthua commented 10 months ago

disregard the library reference, but using the meta 7b model as the base, setting trust_remote_code to false and reducing the tokens to 2048 in both batched and total allowed the model to spin up.

this was tested with 0.4.0

ray-project / ray-llm

Issues serving other models from HF #76