vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
31.12k stars 4.73k forks source link

[Bug]: Issue while passing chat template of llama3.2 11b to vllm server #10023

Closed sourabh-patil closed 3 weeks ago

sourabh-patil commented 3 weeks ago

Your current environment

The output of `python collect_env.py` ```text Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 20.04.3 LTS (x86_64) GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 Clang version: Could not collect CMake version: version 3.28.1 Libc version: glibc-2.31 Python version: 3.10.15 (main, Sep 7 2024, 18:35:33) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-5.15.0-1029-nvidia-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: 11.4.120 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100-SXM4-40GB Nvidia driver version: 535.54.03 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.2.4 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.2.4 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.2.4 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.2.4 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.2.4 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.2.4 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.2.4 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 43 bits physical, 48 bits virtual CPU(s): 256 On-line CPU(s) list: 0-255 Thread(s) per core: 2 Core(s) per socket: 64 Socket(s): 2 NUMA node(s): 8 Vendor ID: AuthenticAMD CPU family: 23 Model: 49 Model name: AMD EPYC 7742 64-Core Processor Stepping: 0 Frequency boost: enabled CPU MHz: 3393.223 CPU max MHz: 2250.0000 CPU min MHz: 1500.0000 BogoMIPS: 4491.63 Virtualization: AMD-V L1d cache: 4 MiB L1i cache: 4 MiB L2 cache: 64 MiB L3 cache: 512 MiB NUMA node0 CPU(s): 0-15,128-143 NUMA node1 CPU(s): 16-31,144-159 NUMA node2 CPU(s): 32-47,160-175 NUMA node3 CPU(s): 48-63,176-191 NUMA node4 CPU(s): 64-79,192-207 NUMA node5 CPU(s): 80-95,208-223 NUMA node6 CPU(s): 96-111,224-239 NUMA node7 CPU(s): 112-127,240-255 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es Versions of relevant libraries: [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.6.77 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] nvidia-pytriton==0.2.5 [pip3] pyzmq==23.2.1 [pip3] torch==2.4.0 [pip3] torchaudio==2.4.1 [pip3] torchvision==0.19.0 [pip3] transformers==4.45.2 [pip3] triton==3.0.0 [pip3] tritonclient==2.33.0 [conda] magma-cuda110 2.5.2 5 local [conda] mkl 2019.5 281 conda-forge [conda] mkl-include 2019.5 281 conda-forge [conda] numpy 1.22.4 pypi_0 pypi [conda] nvidia-cublas-cu11 11.10.3.66 pypi_0 pypi [conda] nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi [conda] nvidia-cuda-cupti-cu11 11.7.101 pypi_0 pypi [conda] nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu11 11.7.99 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-runtime-cu11 11.7.99 pypi_0 pypi [conda] nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cudnn-cu11 8.5.0.96 pypi_0 pypi [conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi [conda] nvidia-cufft-cu11 10.9.0.58 pypi_0 pypi [conda] nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi [conda] nvidia-curand-cu11 10.2.10.91 pypi_0 pypi [conda] nvidia-curand-cu12 10.3.2.106 pypi_0 pypi [conda] nvidia-cusolver-cu11 11.4.0.1 pypi_0 pypi [conda] nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi [conda] nvidia-cusparse-cu11 11.7.4.91 pypi_0 pypi [conda] nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi [conda] nvidia-dali-cuda110 1.6.0 pypi_0 pypi [conda] nvidia-dlprof-pytorch-nvtx 1.6.0 pypi_0 pypi [conda] nvidia-dlprofviewer 1.6.0 pypi_0 pypi [conda] nvidia-ml-py 12.560.30 pypi_0 pypi [conda] nvidia-nccl-cu11 2.14.3 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] nvidia-nvjitlink-cu12 12.3.101 pypi_0 pypi [conda] nvidia-nvtx-cu11 11.7.91 pypi_0 pypi [conda] nvidia-nvtx-cu12 12.1.105 pypi_0 pypi [conda] nvidia-pyindex 1.0.9 pypi_0 pypi [conda] pytorch-quantization 2.1.0 pypi_0 pypi [conda] pyzmq 25.1.2 pypi_0 pypi [conda] sentence-transformers 2.2.2 pypi_0 pypi [conda] torch 2.4.0 pypi_0 pypi [conda] torchtext 0.11.0a0 pypi_0 pypi [conda] torchvision 0.19.0 pypi_0 pypi [conda] transformers 4.32.0 pypi_0 pypi [conda] triton 3.0.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.3.post1 vLLM Build Flags: CUDA Archs: 5.2 6.0 6.1 7.0 7.5 8.0 8.6+PTX; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 80-95,208-223 5 N/A NIC0 SYS X PXB SYS SYS SYS SYS SYS SYS SYS SYS NIC1 SYS PXB X SYS SYS SYS SYS SYS SYS SYS SYS NIC2 SYS SYS SYS X PXB SYS SYS SYS SYS SYS SYS NIC3 SYS SYS SYS PXB X SYS SYS SYS SYS SYS SYS NIC4 SYS SYS SYS SYS SYS X PXB SYS SYS SYS SYS NIC5 SYS SYS SYS SYS SYS PXB X SYS SYS SYS SYS NIC6 PXB SYS SYS SYS SYS SYS SYS X PXB SYS SYS NIC7 PXB SYS SYS SYS SYS SYS SYS PXB X SYS SYS NIC8 SYS SYS SYS SYS SYS SYS SYS SYS SYS X PIX NIC9 SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks NIC Legend: NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_2 NIC3: mlx5_3 NIC4: mlx5_4 NIC5: mlx5_5 NIC6: mlx5_6 NIC7: mlx5_7 NIC8: mlx5_8 NIC9: mlx5_9 (env_llama_3.2) root@06 ```

Model Input Dumps

No response

🐛 Describe the bug

I am setting up the vllm server for llama3.2 11b VLM model. Following is the command I am using

vllm serve meta-llama/Llama-3.2-11B-Vision --host 172.17.0.2 --port 6006 --gpu-memory-utilization 0.9 --trust-remote-code --limit-mm-per-prompt image=2 --max-model-len 8000 --max-num-seqs 16 --enforce-eager --chat-template /workspace/llm_test_dev/src/llama3.2/vllm_server/tool_chat_template_llama3.2_json.jinja

The reason to add chat template was, when I did it without it I could setup the server successfully but when I try to infer using OpenAI like client, it throws error that

openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': 'As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one.', 'type': 'BadRequestError', 'param': None, 'code': 400}

But even though adding chat template to the the cli commad, it is throwing error as

ValueError: The supplied chat template string (/workspace/llm_test_dev/src/llama3.2/vllm_server/tool_chat_template_llama3.2_json.jinja) appears path-like, but doesn't exist!

I have downloaded this jinja file from the vllm repo

https://github.com/vllm-project/vllm/blob/main/examples/tool_chat_template_llama3.2_json.jinja

Can anyone plase help with this? Thanks!

Before submitting a new issue...

sourabh-patil commented 3 weeks ago

OpenAI compatible client I am using as per docs

from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://X.X.X.X:12003/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

# Single-image input inference
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"

chat_response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-11B-Vision",
    messages=[{
        "role": "user",
        "content": [
            # NOTE: The prompt formatting with the image token `<image>` is not needed
            # since the prompt will be processed automatically by the API server.
            {"type": "text", "text": "What’s in this image?"},
            {"type": "image_url", "image_url": {"url": image_url}},
        ],
    }],
)
print("Chat completion output:", chat_response.choices[0].message.content)
ch9hn commented 2 weeks ago

@sourabh-patil can you tell where it was fixed ?

sourabh-patil commented 2 weeks ago

@ch9hn I was doing a silly mistake. The path given for the chat-template was wrong. Secondly, I ended up using Llama-3.2-11B-Vision-instruct which does not require chat-template.

Here is the command that runs smoothly for me

vllm serve meta-llama/Llama-3.2-11B-Vision-Instruct --host X.X.X.X --port XXXX --gpu-memory-utilization 0.9 --trust-remote-code --dowload_dir llama32_11b --limit-mm-per-prompt image=1 --max-model-len 4000 --max-num-seqs 4 --enforce-eager

Can access it using the OpenAI compatible client given above.