i have a 5 Worker Node Kubernetes Cluster with 3 GPU nodes.
Each GPU Node has 2x Nvidia T4 GPUs (16 gRAM) on it
I have installed KubeRay, and instantiated with RayCluster using KubeRay with 3 Worker Pods, with requests for nvidia.com/gpu: 2
Here's my RayCluster Deployment manifest.
apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
name: rayllm
spec:
# Ray head pod template
headGroupSpec:
# The `rayStartParams` are used to configure the `ray start` command.
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
rayStartParams:
resources: '"{\"accelerator_type_cpu\": 2}"'
dashboard-host: '0.0.0.0'
#pod template
template:
spec:
containers:
- name: ray-head
image: anyscale/ray-llm:latest
resources:
limits:
cpu: 2
memory: 8Gi
requests:
cpu: 2
memory: 8Gi
ports:
- containerPort: 6379
name: gcs-server
- containerPort: 8265 # Ray dashboard
name: dashboard
- containerPort: 10001
name: client
workerGroupSpecs:
# the pod replicas in this group typed worker
- replicas: 3
minReplicas: 0
maxReplicas: 3
# logical group name, for this called small-group, also can be functional
groupName: gpu-group
rayStartParams:
resources: '"{\"accelerator_type_cpu\": 28, \"accelerator_type_t4\": 2}"'
# pod template
template:
spec:
containers:
- name: llm
image: anyscale/ray-llm:latest
lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","ray stop"]
resources:
limits:
cpu: "30"
memory: "100G"
nvidia.com/gpu: 2
requests:
cpu: "28"
memory: "50G"
nvidia.com/gpu: 2
# Please ensure the following taint has been applied to the GPU node in the cluster.
tolerations:
- key: "ray.io/node-type"
operator: "Equal"
value: "worker"
effect: "NoSchedule"
Once done, following the RayLLM guide, we exec -it into the Head Pod to run the inference job.
However it seems that the model isnt loaded across all 6 GPUs.
Setting the following doesnt seem to work. looking at the logs, it doesnt seem to be using tensor_parallel_size from the vLLM engine arguments.
Setting the following also doesnt seem to work. looking at the logs, it does seem to be using tensor_parallel_size=4 from the vLLM engine arguments. but i'll get a error (after the config)
scaling_config:
num_workers: 4 #<---- THIS
num_gpus_per_worker: 1 #<---- THIS
num_cpus_per_worker: 2
placement_strategy: "STRICT_PACK"
resources_per_worker:
accelerator_type_t4: 0.01
(autoscaler +7s) Error: No available node types can fulfill resource request defaultdict(<class 'float'>, {'accelerator_type_t4': 0.05, 'CPU': 9.0, 'GPU': 4.0}). Add suitable node types to this cluster to resolve this issue.
here is the output of ray status
(base) ray@rayllm-head-frv6r:~$ ray status
======== Autoscaler status: 2024-01-07 21:14:38.338939 ========
Node status
---------------------------------------------------------------
Healthy:
1 node_27d999790f86406e114afdf2df5f560bcd3cd853f215eb4ceeca53b3
1 node_3c625f8032a84660687fb1c2281068240de7b208f0ca5aa0d2e32ccf
1 node_d85f4cf921417420bb5818b9349e0b457010195db54fc0f2e6d328bb
1 node_ac36fc573828eb9c1d1133f5c74e671f04db92caf265f0d2816f15b2
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/92.0 CPU
0.0/6.0 GPU
0.0/86.0 accelerator_type_cpu
0.0/6.0 accelerator_type_t4
0B/287.40GiB memory
0B/85.89GiB object_store_memory
Hello folks,
i have a 5 Worker Node Kubernetes Cluster with 3 GPU nodes. Each GPU Node has 2x Nvidia T4 GPUs (16 gRAM) on it I have installed KubeRay, and instantiated with RayCluster using KubeRay with 3 Worker Pods, with requests for nvidia.com/gpu: 2 Here's my RayCluster Deployment manifest.
Once done, following the RayLLM guide, we exec -it into the Head Pod to run the inference job.
Here's the config that i'm using:
However it seems that the model isnt loaded across all 6 GPUs. Setting the following doesnt seem to work. looking at the logs, it doesnt seem to be using tensor_parallel_size from the vLLM engine arguments.
Setting the following also doesnt seem to work. looking at the logs, it does seem to be using tensor_parallel_size=4 from the vLLM engine arguments. but i'll get a error (after the config)
here is the output of ray status