Loading 13B model across 6 GPUs (Distributed Inference)

WinsonSou commented 6 months ago

Hello folks,

i have a 5 Worker Node Kubernetes Cluster with 3 GPU nodes. Each GPU Node has 2x Nvidia T4 GPUs (16 gRAM) on it I have installed KubeRay, and instantiated with RayCluster using KubeRay with 3 Worker Pods, with requests for nvidia.com/gpu: 2 Here's my RayCluster Deployment manifest.

apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
  name: rayllm
spec:
  # Ray head pod template
  headGroupSpec:
    # The `rayStartParams` are used to configure the `ray start` command.
    # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
    # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
    rayStartParams:
      resources: '"{\"accelerator_type_cpu\": 2}"'
      dashboard-host: '0.0.0.0'
    #pod template
    template:
      spec:
        containers:
        - name: ray-head
          image: anyscale/ray-llm:latest
          resources:
            limits:
              cpu: 2
              memory: 8Gi
            requests:
              cpu: 2
              memory: 8Gi
          ports:
          - containerPort: 6379
            name: gcs-server
          - containerPort: 8265 # Ray dashboard
            name: dashboard
          - containerPort: 10001
            name: client
  workerGroupSpecs:
  # the pod replicas in this group typed worker
  - replicas: 3
    minReplicas: 0
    maxReplicas: 3
    # logical group name, for this called small-group, also can be functional
    groupName: gpu-group
    rayStartParams:
      resources: '"{\"accelerator_type_cpu\": 28, \"accelerator_type_t4\": 2}"'
    # pod template
    template:
      spec:
        containers:
        - name: llm
          image: anyscale/ray-llm:latest
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          resources:
            limits:
              cpu: "30"
              memory: "100G"
              nvidia.com/gpu: 2
            requests:
              cpu: "28"
              memory: "50G"
              nvidia.com/gpu: 2
        # Please ensure the following taint has been applied to the GPU node in the cluster.
        tolerations:
          - key: "ray.io/node-type"
            operator: "Equal"
            value: "worker"
            effect: "NoSchedule"

Once done, following the RayLLM guide, we exec -it into the Head Pod to run the inference job.

Here's the config that i'm using:

deployment_config:
  autoscaling_config:
    min_replicas: 4
    initial_replicas: 4
    max_replicas: 8
    target_num_ongoing_requests_per_replica: 16
    metrics_interval_s: 10.0
    look_back_period_s: 30.0
    smoothing_factor: 0.5
    downscale_delay_s: 300.0
    upscale_delay_s: 15.0
  max_concurrent_queries: 48
  ray_actor_options:
    resources:
      accelerator_type_t4: 0.01
engine_config:
  model_id: TheBloke/Llama-2-13B-chat-AWQ
  hf_model_id: TheBloke/Llama-2-13B-chat-AWQ
  type: VLLMEngine
  engine_kwargs:
    quantization: awq
    max_num_batched_tokens: 12288
    max_num_seqs: 48
  max_total_tokens: 4096
  generation:
    prompt_format:
      system: "<<SYS>>\n{instruction}\n<</SYS>>\n\n"
      assistant: " {instruction} </s><s>"
      trailing_assistant: ""
      user: "[INST] {system}{instruction} [/INST]"
      system_in_user: true
      default_system_message: ""
    stopping_sequences: ["<unk>"]
scaling_config:
  num_workers: 1
  num_gpus_per_worker: 1
  num_cpus_per_worker: 2
  placement_strategy: "STRICT_PACK"
  resources_per_worker:
    accelerator_type_t4: 0.01

However it seems that the model isnt loaded across all 6 GPUs. Setting the following doesnt seem to work. looking at the logs, it doesnt seem to be using tensor_parallel_size from the vLLM engine arguments.

deployment_config:
  autoscaling_config:
    min_replicas: 4 #<---- THIS
    initial_replicas: 4 #<---- THIS
    max_replicas: 8
    target_num_ongoing_requests_per_replica: 16
    metrics_interval_s: 10.0
    look_back_period_s: 30.0
    smoothing_factor: 0.5
    downscale_delay_s: 300.0
    upscale_delay_s: 15.0
  max_concurrent_queries: 48
  ray_actor_options:
    resources:
      accelerator_type_t4: 0.01

Setting the following also doesnt seem to work. looking at the logs, it does seem to be using tensor_parallel_size=4 from the vLLM engine arguments. but i'll get a error (after the config)

scaling_config:
  num_workers: 4 #<---- THIS
  num_gpus_per_worker: 1 #<---- THIS
  num_cpus_per_worker: 2
  placement_strategy: "STRICT_PACK"
  resources_per_worker:
    accelerator_type_t4: 0.01

(autoscaler +7s) Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'accelerator_type_t4': 0.05, 'CPU': 9.0, 'GPU': 4.0}). Add suitable node types to this cluster to resolve this issue.

here is the output of ray status

(base) ray@rayllm-head-frv6r:~$ ray status
======== Autoscaler status: 2024-01-07 21:14:38.338939 ========
Node status
---------------------------------------------------------------
Healthy:
 1 node_27d999790f86406e114afdf2df5f560bcd3cd853f215eb4ceeca53b3
 1 node_3c625f8032a84660687fb1c2281068240de7b208f0ca5aa0d2e32ccf
 1 node_d85f4cf921417420bb5818b9349e0b457010195db54fc0f2e6d328bb
 1 node_ac36fc573828eb9c1d1133f5c74e671f04db92caf265f0d2816f15b2
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/92.0 CPU
 0.0/6.0 GPU
 0.0/86.0 accelerator_type_cpu
 0.0/6.0 accelerator_type_t4
 0B/287.40GiB memory
 0B/85.89GiB object_store_memory

WinsonSou commented 6 months ago

Solved in Slack Use PACK instead of PACK_STRICT

SamComber commented 4 months ago

@WinsonSou - what were your final config settings? I have having same issue here? Did you need to adjust num_workers ?

ray-project / ray-llm

Loading 13B model across 6 GPUs (Distributed Inference) #113