Closed Ryojikn closed 7 months ago
I tried to deploy LLM on 4 nodes with each nodes has 4 GPU. But you have to make sure your cluster does support IB.
What do you mean by IB?
So were you able to deploy this model using Tensor Parallelism with 16 GPUs being splitted across nodes?
My objective is to be able to deploy larger models with my current infrastructure.
Yes, The whole Llama2-70b models was splitted across 4nodes and 16 GPUs. You just need to initialize each node with ray.
Wejoncy, by chance do you have any guidance of how starting each node with ray? A tutorial to be followed on kubernetes/Openshift.
Need to talk with my infrastructure team about this possibility, due to the fact that these nodes are not exclusive to gpu workloads, might be a problem use infiniband, they're nvidia baremetals
Hi @Ryojikn
Can I ask what's the purpose you want to run model across multi-node? My experience tell that it would be super slow if there is no IB support as vLLM doesn't support PP yet.
The steps are:
ray start --head --port 6379
ray start --address="head_node_IP:6379"
tensor_parallel_size=$GPU_COUNT --engine-use-ray
My expectations actually are of deploying a larger model in the already bought infrastructure, while maintaining the latency and throughput of the vllm.
However by what you're mentioning, it won't be possible 😞 I honestly expected it was just lack of knowledge and nothing else
Anyway. You can have a try and test the latency/throughput first. There may be other solutions to get you the performance.
Coming here late, I am finding it extremely difficult to make any progress with k8s + ray + vLLM because of lack of documentation. In the above steps you say "set the environment on all nodes" - can you elaborate @wejoncy?
Here is what I've done so far:
Followed the raycluster quickstart guide to deploy ray in my cluster (https://docs.ray.io/en/latest/cluster/kubernetes/getting-started/raycluster-quick-start.html). Is Ray now running or do I have to run ray start
commands? I'd expect not to have to..
Deploy a vLLM model as shown below. Unclear - what model args (ie. --engine-use-ray) are required? What env. vars? What about k8s settings resources.limits.nvidia.com/gpu: 1 and env vars like CUDA_VISIBLE_DEVICES?
Our whole goal here is to run larger models than a single instance can support. The largest instanced type at our disposale is g4dn.16xlarge. Do you think there's any performance gain to be had running a large model over many small nodes as opposed to one large node? Or is it a fools errant and 1 node is typicall best?
Pods is running but fails w/ error that too much GPU memory requested, which tells me it's not using the distributed GPU pool.
Thanks
In configuration below I am trying to run a large model on 4 single-gpu nodes. Each nodes has 16gb so together they have 64GB, which is enough for the model. But on any one pod, it has 16gb so the model will choke.
# Tinkering with a configuration that runs in ray cluster on distributed node pool
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm
labels:
app: vllm
spec:
replicas: 4 #<--- GPUs expensive so set to 0 when not using
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
# nodeSelector and toleration combined give the following behavior:
# 1. Only schedule on nodes with gpu-allowed
# 2. Prevent other workloads from scheduling on gpu nodepool
nodeSelector:
workload-type: gpu-small
tolerations:
- key: gpu-allowed
operator: Exists
- key: ray.io/node-type
operator: Equal
value: worker
effect: NoSchedule
containers:
- name: vllm-container
image: my-vllm-image:latest
imagePullPolicy: Always
ports:
- containerPort: 8000
name: api
resources:
limits:
nvidia.com/gpu: "1"
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
- name: CUDA_VISIBLE_DEVICES
value: "0" #<-- should this be # GPUs in the instance or in the pool?
args: [
"--model",
"TheBloke/Marcoroni-70B-v1-AWQ",
"--quantization", "awq",
"--dtype", "float16",
"--tensor-parallel-size", "4",
"--engine-use-ray" ] #<-- this necesary/correct?
volumeMounts:
- mountPath: /dev/shm
name: dshm
# In linux, this is an in-memory tmp file storage system. Ray depends on this and I have no idea what is the best
# amount, but since g4dn.12xlarge instances have like 192GB memory, figured this wouldn't hurt.
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: "3Gi"
---
apiVersion: v1
kind: Service
metadata:
name: vllm
annotations:
#Important Note: Need to add health check path annotations in service level if we are planning to use multiple targets in a load balancer
alb.ingress.kubernetes.io/healthcheck-path: /health
spec:
selector:
app: vllm
ports:
- name: http
port: 80
targetPort: 8000
type: ClusterIP
@Ryojikn @wejoncy @hughesadam87 i'm the maintainer of LiteLLM we allow you to do this today using the litellm router - load balance between multiple deployments (vLLLM, Azure, OpenAI) I'd love your feedback if this does not solve your problem
Here's how to use it Docs: https://docs.litellm.ai/docs/routing
from litellm import Router
model_list = [{ # list of model deployments
"model_name": "gpt-3.5-turbo", # model alias
"litellm_params": { # params for litellm completion/embedding call
"model": "azure/chatgpt-v-2", # actual model name
"api_key": os.getenv("AZURE_API_KEY"),
"api_version": os.getenv("AZURE_API_VERSION"),
"api_base": os.getenv("AZURE_API_BASE")
}
}, {
"model_name": "gpt-3.5-turbo",
"litellm_params": { # params for litellm completion/embedding call
"model": "azure/chatgpt-functioncalling",
"api_key": os.getenv("AZURE_API_KEY"),
"api_version": os.getenv("AZURE_API_VERSION"),
"api_base": os.getenv("AZURE_API_BASE")
}
}, {
"model_name": "gpt-3.5-turbo",
"litellm_params": { # params for litellm completion/embedding call
"model": "vllm/TheBloke/Marcoroni-70B-v1-AWQ",
"api_key": os.getenv("OPENAI_API_KEY"),
}
}]
router = Router(model_list=model_list)
# openai.ChatCompletion.create replacement
response = router.completion(model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hey, how's it going?"}])
print(response)
Hi @hughesadam87 I am not familiar with K8S, but you have to ensure ray and vllm is correctly intalled on each node, For example, there is four nodes in K8S cluster, with each has ip [10.0.0.2,10.0.0.3,10.0.0.4,10.0.0.5].
pip install vllm
on each node.ray start --head --port 6379
on 10.0.0.2
and run ray start --address="10.0.0.2:6379"
on the other nodes.args: [
"--model",
"TheBloke/Marcoroni-70B-v1-AWQ",
"--quantization", "awq",
"--dtype", "float16",
"--tensor-parallel-size", "4",
"--engine-use-ray" ] #<-- this necesary/correct? **## YES**
on Rank-0, 10.0.0.2. and the others do nothing.
The server would be 10.0.0.2:8000/generate
Here is how KubeAGI is running distributed inference using multiple GPUs, see if it helps: http://kubeagi.k8s.com.cn/docs/category/distributed-inference
m
I've managed to deploy vllm using vllm openai compatible entrypoint with success between all the gpus available in my kubernetes node.
However, how i have a question, can i leverage ray between multiple nodes? With different GPU types?
I was wondering in order to leverage bigger LLM models.
My current cluster has something like 36 gpus, however they're split in sets of 4 gpus per each node.