vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.02k stars 4.54k forks source link

How to deploy vllm model across multiple nodes in kubernetes? #1363

Closed Ryojikn closed 7 months ago

Ryojikn commented 1 year ago

I've managed to deploy vllm using vllm openai compatible entrypoint with success between all the gpus available in my kubernetes node.

However, how i have a question, can i leverage ray between multiple nodes? With different GPU types?

I was wondering in order to leverage bigger LLM models.

My current cluster has something like 36 gpus, however they're split in sets of 4 gpus per each node.

wejoncy commented 1 year ago

I tried to deploy LLM on 4 nodes with each nodes has 4 GPU. But you have to make sure your cluster does support IB.

Ryojikn commented 1 year ago

What do you mean by IB?

So were you able to deploy this model using Tensor Parallelism with 16 GPUs being splitted across nodes?

My objective is to be able to deploy larger models with my current infrastructure.

wejoncy commented 1 year ago

Yes, The whole Llama2-70b models was splitted across 4nodes and 16 GPUs. You just need to initialize each node with ray.

https://en.wikipedia.org/wiki/InfiniBand

Ryojikn commented 1 year ago

Wejoncy, by chance do you have any guidance of how starting each node with ray? A tutorial to be followed on kubernetes/Openshift.

Need to talk with my infrastructure team about this possibility, due to the fact that these nodes are not exclusive to gpu workloads, might be a problem use infiniband, they're nvidia baremetals

wejoncy commented 1 year ago

Hi @Ryojikn

Can I ask what's the purpose you want to run model across multi-node? My experience tell that it would be super slow if there is no IB support as vLLM doesn't support PP yet.

The steps are:

  1. start ray on the Head Node ray start --head --port 6379
  2. start ray the other nodes ray start --address="head_node_IP:6379"
  3. Set the envrionment on all nodes
  4. Download model on all node and the same path, of course you download on the fly
  5. start vLLM on anynode with tensor_parallel_size=$GPU_COUNT --engine-use-ray
Ryojikn commented 1 year ago

My expectations actually are of deploying a larger model in the already bought infrastructure, while maintaining the latency and throughput of the vllm.

However by what you're mentioning, it won't be possible 😞 I honestly expected it was just lack of knowledge and nothing else

wejoncy commented 1 year ago

Anyway. You can have a try and test the latency/throughput first. There may be other solutions to get you the performance.

hughesadam87 commented 11 months ago

Coming here late, I am finding it extremely difficult to make any progress with k8s + ray + vLLM because of lack of documentation. In the above steps you say "set the environment on all nodes" - can you elaborate @wejoncy?

Here is what I've done so far:

  1. Followed the raycluster quickstart guide to deploy ray in my cluster (https://docs.ray.io/en/latest/cluster/kubernetes/getting-started/raycluster-quick-start.html). Is Ray now running or do I have to run ray start commands? I'd expect not to have to..

  2. Deploy a vLLM model as shown below. Unclear - what model args (ie. --engine-use-ray) are required? What env. vars? What about k8s settings resources.limits.nvidia.com/gpu: 1 and env vars like CUDA_VISIBLE_DEVICES?

  3. Our whole goal here is to run larger models than a single instance can support. The largest instanced type at our disposale is g4dn.16xlarge. Do you think there's any performance gain to be had running a large model over many small nodes as opposed to one large node? Or is it a fools errant and 1 node is typicall best?

Pods is running but fails w/ error that too much GPU memory requested, which tells me it's not using the distributed GPU pool.

Thanks


In configuration below I am trying to run a large model on 4 single-gpu nodes. Each nodes has 16gb so together they have 64GB, which is enough for the model. But on any one pod, it has 16gb so the model will choke.

# Tinkering with a configuration that runs in ray cluster on distributed node pool

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm
  labels:
    app: vllm
spec:
  replicas: 4   #<--- GPUs expensive so set to 0 when not using
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      # nodeSelector and toleration combined give the following behavior:
      #  1. Only schedule on nodes with gpu-allowed
      #  2. Prevent other workloads from scheduling on gpu nodepool
      nodeSelector:
        workload-type: gpu-small
      tolerations:
        - key: gpu-allowed
          operator: Exists

        - key: ray.io/node-type
          operator: Equal
          value: worker
          effect: NoSchedule

      containers:
        - name: vllm-container
          image: my-vllm-image:latest
          imagePullPolicy: Always
          ports:
            - containerPort: 8000
              name: api
          resources:
            limits:
              nvidia.com/gpu: "1"
          env:
            - name: NVIDIA_VISIBLE_DEVICES
              value: "all"
            - name: CUDA_VISIBLE_DEVICES
              value: "0"    #<-- should this be # GPUs in the instance or in the pool?

          args: [
            "--model",
            "TheBloke/Marcoroni-70B-v1-AWQ",
            "--quantization", "awq",
            "--dtype", "float16",
            "--tensor-parallel-size", "4",   
            "--engine-use-ray" ]   #<-- this necesary/correct?

          volumeMounts:
            - mountPath: /dev/shm
              name: dshm

      # In linux, this is an in-memory tmp file storage system.  Ray depends on this and I have no idea what is the best
      # amount, but since g4dn.12xlarge instances have like 192GB memory, figured this wouldn't hurt.
      volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: "3Gi"

---
apiVersion: v1
kind: Service
metadata:
  name: vllm
  annotations:
    #Important Note:  Need to add health check path annotations in service level if we are planning to use multiple targets in a load balancer
    alb.ingress.kubernetes.io/healthcheck-path: /health
spec:
  selector:
    app: vllm
  ports:
    - name: http
      port: 80
      targetPort: 8000
  type: ClusterIP
ishaan-jaff commented 11 months ago

@Ryojikn @wejoncy @hughesadam87 i'm the maintainer of LiteLLM we allow you to do this today using the litellm router - load balance between multiple deployments (vLLLM, Azure, OpenAI) I'd love your feedback if this does not solve your problem

Here's how to use it Docs: https://docs.litellm.ai/docs/routing

from litellm import Router

model_list = [{ # list of model deployments 
    "model_name": "gpt-3.5-turbo", # model alias 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "azure/chatgpt-v-2", # actual model name
        "api_key": os.getenv("AZURE_API_KEY"),
        "api_version": os.getenv("AZURE_API_VERSION"),
        "api_base": os.getenv("AZURE_API_BASE")
    }
}, {
    "model_name": "gpt-3.5-turbo", 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "azure/chatgpt-functioncalling", 
        "api_key": os.getenv("AZURE_API_KEY"),
        "api_version": os.getenv("AZURE_API_VERSION"),
        "api_base": os.getenv("AZURE_API_BASE")
    }
}, {
    "model_name": "gpt-3.5-turbo", 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "vllm/TheBloke/Marcoroni-70B-v1-AWQ", 
        "api_key": os.getenv("OPENAI_API_KEY"),
    }
}]

router = Router(model_list=model_list)

# openai.ChatCompletion.create replacement
response = router.completion(model="gpt-3.5-turbo", 
                messages=[{"role": "user", "content": "Hey, how's it going?"}])

print(response)
wejoncy commented 11 months ago

Hi @hughesadam87 I am not familiar with K8S, but you have to ensure ray and vllm is correctly intalled on each node, For example, there is four nodes in K8S cluster, with each has ip [10.0.0.2,10.0.0.3,10.0.0.4,10.0.0.5].

  1. pip install vllm on each node.
  2. Assuem 10.0.0.2 is the head node with rank=0, run ray start --head --port 6379 on 10.0.0.2 and run ray start --address="10.0.0.2:6379" on the other nodes.
  3. run the command
    args: [
            "--model",
            "TheBloke/Marcoroni-70B-v1-AWQ",
            "--quantization", "awq",
            "--dtype", "float16",
            "--tensor-parallel-size", "4",   
            "--engine-use-ray" ]   #<-- this necesary/correct? **## YES**

    on Rank-0, 10.0.0.2. and the others do nothing.

The server would be 10.0.0.2:8000/generate

nkwangleiGIT commented 10 months ago

Here is how KubeAGI is running distributed inference using multiple GPUs, see if it helps: http://kubeagi.k8s.com.cn/docs/category/distributed-inference

zengqingfu1442 commented 3 months ago

m