vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.23k stars 4.57k forks source link

[Usage]: How do you setup vllm to work in k8s/openshift cluster #4462

Open jayteaftw opened 6 months ago

jayteaftw commented 6 months ago

Your current environment

Edit 1

Collecting environment information...
PyTorch version: 2.2.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.14.0-284.59.1.el9_2.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA L40S
GPU 1: NVIDIA L40S

Nvidia driver version: 550.54.14
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      52 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             128
On-line CPU(s) list:                0-127
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 9334 32-Core Processor
CPU family:                         25
Model:                              17
Thread(s) per core:                 2
Core(s) per socket:                 32
Socket(s):                          2
Stepping:                           1
Frequency boost:                    enabled
CPU max MHz:                        3910.2529
CPU min MHz:                        1500.0000
BogoMIPS:                           5399.76
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d
Virtualization:                     AMD-V
L1d cache:                          2 MiB (64 instances)
L1i cache:                          2 MiB (64 instances)
L2 cache:                           64 MiB (64 instances)
L3 cache:                           256 MiB (8 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-31,64-95
NUMA node1 CPU(s):                  32-63,96-127
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Mitigation; Safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.19.3
[pip3] torch==2.2.1
[pip3] triton==2.2.0
[pip3] vllm-nccl-cu12==2.18.1.0.3.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS     SYS     SYS     32-63,96-127    1               N/A
GPU1    SYS      X      SYS     SYS     32-63,96-127    1               N/A
NIC0    SYS     SYS      X      PIX
NIC1    SYS     SYS     PIX      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1

How would you like to use vllm

I want to run inference of a mixtral-8x7b-instruct on an openshift cluster that has nvidia GPU operator already installed. When I run the following yaml file

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mixtral-8x7b-instruct-vllm-pod
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mixtral-8x7b-instruct-vllm-pod
  template:
    metadata:
      labels:
        app: mixtral-8x7b-instruct-vllm-pod
    spec:
      containers:
      - name: mixtral-8x7b-instruct-vllm-pod
        image:  vllm/vllm-openai:v0.2.7
        args: ["--model", "mistralai/Mixtral-8x7B-Instruct-v0.1", "--tensor-parallel-size", "2", "--dtype", "half"]
        ports:
        - containerPort: 8000
        volumeMounts:
        - name: huggingface-cache
          mountPath: /root/.cache/huggingface
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          value: xxxxxxx
        resources:
          limits:
            nvidia.com/gpu: "2"
      volumes:
      - name: huggingface-cache
        persistentVolumeClaim:
          claimName: example-pv-filesystem
      hostIPC: true

I get the following that freezes when I print out of the logs

INFO 04-29 23:42:57 api_server.py:727] args: Namespace(host=None, port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], served_model_name=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, model='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='half', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
WARNING 04-29 23:42:57 config.py:457] Casting torch.bfloat16 to torch.float16.
2024-04-29 23:42:59,442 INFO worker.py:1724 -- Started a local Ray instance.****

Note if I use the smaller mistral model with one GPU, it functions as intended. It only freezes when I add 2 or more GPUs

agt commented 6 months ago

v0.2.7 is fairly old; can you try with the current v0.4.1?

Also, output from 'collect_env.py' from within the container would be helpful, e.g.

$ kubectl exec -- python3 -c 'import requests; exec(requests.get("https://raw.githubusercontent.com/vllm-project/vllm/main/collect_env.py").text)'

jayteaftw commented 6 months ago

v0.2.7 is fairly old; can you try with the current v0.4.1?

Also, output from 'collect_env.py' from within the container would be helpful, e.g.

$ kubectl exec -- python3 -c 'import requests; exec(requests.get("https://raw.githubusercontent.com/vllm-project/vllm/main/collect_env.py").text.text))'

Thank you for the response! I should mention I tested tags latest, v0.4.1, v0.4.0, v0.3.2, and v0.2.7 because of an issue related to #4455

However, Inside of a vllm/vllm-openai:latest pod, I ran the collect_env.py

Collecting environment information...
PyTorch version: 2.2.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.14.0-284.59.1.el9_2.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA L40S
GPU 1: NVIDIA L40S

Nvidia driver version: 550.54.14
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      52 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             128
On-line CPU(s) list:                0-127
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 9334 32-Core Processor
CPU family:                         25
Model:                              17
Thread(s) per core:                 2
Core(s) per socket:                 32
Socket(s):                          2
Stepping:                           1
Frequency boost:                    enabled
CPU max MHz:                        3910.2529
CPU min MHz:                        1500.0000
BogoMIPS:                           5399.76
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d
Virtualization:                     AMD-V
L1d cache:                          2 MiB (64 instances)
L1i cache:                          2 MiB (64 instances)
L2 cache:                           64 MiB (64 instances)
L3 cache:                           256 MiB (8 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-31,64-95
NUMA node1 CPU(s):                  32-63,96-127
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Mitigation; Safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.19.3
[pip3] torch==2.2.1
[pip3] triton==2.2.0
[pip3] vllm-nccl-cu12==2.18.1.0.3.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS     SYS     SYS     32-63,96-127    1               N/A
GPU1    SYS      X      SYS     SYS     32-63,96-127    1               N/A
NIC0    SYS     SYS      X      PIX
NIC1    SYS     SYS     PIX      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
gujingit commented 6 months ago

vllm 0.4.1 + qwen-14b-chat the yaml as below:

apiVersion: apps/v1 
kind: Deployment
metadata:
  name: vllm
  labels:
    app: vllm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      volumes:
      - name: model
        persistentVolumeClaim:
          claimName: llm-model
      - name: dshm
        emptyDir:
          medium: Memory
          sizeLimit: "2Gi"
      containers:
      - name: vllm
        image: vllm/vllm-openai:0.4.1
        command:
        - "python3 -m vllm.entrypoints.openai.api_server  --port 8080  --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen-14B-Chat --gpu-memory-utilization 0.95 --tensor-parallel-size 2"
        ports:
        - containerPort: 8080
        readinessProbe:
          tcpSocket:
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 30
        resources:
          limits:
            nvidia.com/gpu: "2"
          requests:
            cpu: 4
            memory: 8Gi
            nvidia.com/gpu: "2"
        volumeMounts:
        - mountPath: /mnt/models
          name: model
        - name: dshm
          mountPath: /dev/shm
jayteaftw commented 6 months ago

Okay so I followed your example with a few modifications

apiVersion: apps/v1 
kind: Deployment
metadata:
  name: vllm
  labels:
    app: vllm
spec:
  replicas: 1
  revisionHistoryLimit: 1
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      volumes:
      - name: model
        persistentVolumeClaim:
          claimName: example-pv-filesystem
      - name: dshm
        emptyDir:
          medium: Memory
          sizeLimit: "15Gi"
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.4.1
        command:
        - "python3 -m vllm.entrypoints.openai.api_server  --port 8080  --trust-remote-code --served-model-name mistral --model /mnt/models/models--mistralai--Mixtral-8x7B-Instruct-v0.1 --gpu-memory-utilization 0.95 --tensor-parallel-size 2"
        ports:
        - containerPort: 8080
        readinessProbe:
          tcpSocket:
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 30
        resources:
          limits:
            nvidia.com/gpu: "2"
          requests:
            cpu: 4
            memory: 8Gi
            nvidia.com/gpu: "2"
        volumeMounts:
        - mountPath: /mnt/models
          name: model
        - name: dshm
          mountPath: /dev/shm

The image vllm/vllm-openai:0.4.1 doesnt exist but vllm/vllm-openai:v0.4.1 does The application wont start and I get the error

Error: container create failed: time="2024-05-16T02:11:29Z" level=error msg="runc create failed: unable to start container process: exec: \"python3 -m vllm.entrypoints.openai.api_server  --port 8080  --trust-remote-code --served-model-name mistral --model /mnt/models/models--mistralai--Mixtral-8x7B-Instruct-v0.1 --gpu-memory-utilization 0.95 --tensor-parallel-size 2\": stat python3 -m vllm.entrypoints.openai.api_server  --port 8080  --trust-remote-code --served-model-name mistral --model /mnt/models/models--mistralai--Mixtral-8x7B-Instruct-v0.1 --gpu-memory-utilization 0.95 --tensor-parallel-size 2: no such file or directory"

But when I comment out the command and search for /mnt/models/models--mistralai--Mixtral-8x7B-Instruct-v0.1 within the pod, I can find the directory

# pwd
/mnt/models/models--mistralai--Mixtral-8x7B-Instruct-v0.1
# ls
blobs  refs  snapshots
# 
yangcao77 commented 5 months ago

I'm seeing the same issue

python3 -m vllm.entrypoints.openai.api_server  --model /model/model.file --port 8001 --trust-remote-code --gpu-memory-utilization 0.95: no such file or directory

any luck on this @jayteaftw ?

jayteaftw commented 5 months ago

I'm seeing the same issue

python3 -m vllm.entrypoints.openai.api_server  --model /model/model.file --port 8001 --trust-remote-code --gpu-memory-utilization 0.95: no such file or directory

any luck on this @jayteaftw ?

I am failing to see how we have the same issue but yes my issue is still occurring even in 4.0.3

yangcao77 commented 5 months ago

@jayteaftw I'm seeing RH has a ubi vllm image, and it does work for me, you might want to try this out as well. quay.io/rh-aiservices-bu/vllm-openai-ubi9:0.4.2

it will help you download the image from huggingface, so for you case, set --model mistralai/Mixtral-8x7B-Instruct-v0.1 in container.args

jayteaftw commented 5 months ago

@jayteaftw I'm seeing RH has a ubi vllm image, and it does work for me, you might want to try this out as well. quay.io/rh-aiservices-bu/vllm-openai-ubi9:0.4.2

it will help you download the image from huggingface, so for you case, set --model mistralai/Mixtral-8x7B-Instruct-v0.1 in container.args

Thank you for the advice. However, I still have the same problem when trying to run their image in Openshift. It just freezes when using more than 1 GPU. I even tried compiling from source and changing it over to use the new 0.4.3 and It stills has the same outcome

as-herr commented 5 months ago

Running into the same problem as OP, have tried with 0.4.2 and 0.4.3 with no success. Would love some feedback from VLLM on proper implementation

Cognitus-Stuti commented 3 weeks ago

Has this been resolved yet getting a 500 error with no other logs when trying to run this