[Bug] custom chat template sends to model [{'type': 'text', 'text': '...'}]

Your current environment

The output of `python collect_env.py`

```text Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.35 Python version: 3.12.7 (main, Oct 1 2024, 08:52:12) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-5.15.0-118-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA H100 80GB HBM3 GPU 1: NVIDIA H100 80GB HBM3 GPU 2: NVIDIA H100 80GB HBM3 GPU 3: NVIDIA H100 80GB HBM3 GPU 4: NVIDIA H100 80GB HBM3 GPU 5: NVIDIA H100 80GB HBM3 GPU 6: NVIDIA H100 80GB HBM3 GPU 7: NVIDIA H100 80GB HBM3 Nvidia driver version: 555.42.02 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Versions of relevant libraries: [pip3] flashinfer==0.1.6+cu121torch2.4 [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.6.77 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.45.2 [pip3] triton==3.0.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.3.post1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: NVIDIA_VISIBLE_DEVICES=all NVIDIA_REQUIRE_CUDA=cuda>=12.4 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526 brand=tesla,driver>=535,driver<536 brand=unknown,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=geforce,driver>=535,driver<536 brand=geforcertx,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=titan,driver>=535,driver<536 brand=titanrtx,driver>=535,driver<536 NVIDIA_DRIVER_CAPABILITIES=compute,utility VLLM_USAGE_SOURCE=production-docker-image CUDA_VERSION=12.4.1 LD_LIBRARY_PATH=/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 CUDA_MODULE_LOADING=LAZY ```

Model Input Dumps

prompt: "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 14 Nov 2024\n\n[{'type': 'text', 'text': 'you are a helpful assistant'}]<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n[{'type': 'text', 'text': 'hello\\n'}]<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=131004, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None)

🐛 Describe the bug

Hello. I created a simple container image that contains latest tool_chat_template_llama3.2_json.jinja

FROM docker.io/vllm/vllm-openai:v0.6.3.post1
COPY tool_chat_template_llama3.2_json.jinja vllm-workspace/tool_chat_template_llama3.2_json.jinja

The container is started using

localhost/vllm/vllm-openai:v0.6.3.post1-tools \
  --model neuralmagic/Llama-3.2-90B-Vision-Instruct-FP8-dynamic \
  --tensor-parallel-size 8 \
  --served-model-name "Llama3.2 90B" \
  --trust-remote-code \
  --gpu-memory-utilization 0.95 \
  --distributed-executor-backend mp \
  --enforce-eager \
  --max-num-seqs 2 \
  --limit-mm-per-prompt image=5 \
  --tool-call-parser llama3_json --chat-template /vllm-workspace/examples/tool_chat_template_llama3.2_json.jinja --enable-auto-tool-choice

Vllm openai receives following request

curl -v http://localhost:8000/v1/chat/completions -H 'content-type: application/json' --data '{"stream": false, "model": "Llama3.2 90B", "messages": [{"role": "system", "content": "you are a helpful assistant"}, {"role": "user", "content": "hello\n"}]}'

but in vllm logs i see user<|end_header_id|>\n\n[{'type': 'text', 'text': 'hello\n'}]<|eot_id|

INFO 11-14 03:51:42 logger.py:37] Received request chat-585357994ead43ab8d485844b632d641: prompt: "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 14 Nov 2024\n\n[{'type': 'text', 'text': 'you are a helpful assistant'}]<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n[{'type': 'text', 'text': 'hello\\n'}]<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=131004, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None), prompt_token_ids: [128000, 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 198, 15724, 2696, 25, 220, 975, 4723, 220, 2366, 19, 271, 58, 13922, 1337, 1232, 364, 1342, 518, 364, 1342, 1232, 364, 9514, 527, 264, 11190, 18328, 8439, 60, 128009, 128006, 882, 128007, 271, 58, 13922, 1337, 1232, 364, 1342, 518, 364, 1342, 1232, 364, 15339, 1734, 8439, 60, 128009, 128006, 78191, 128007, 271], lora_request: None, prompt_adapter_request: None.

However, if i remove only

--chat-template /vllm-workspace/examples/tool_chat_template_llama3.2_json.jinja

from vllm start options, the model receives expected text (user<|end_header_id|>\n\nhello\n<|eot_id|)

curl -v http://localhost:8000/v1/chat/completions -H 'content-type: application/json' --data '{"stream": false, "model": "Llama3.2 90B", "messages": [{"role": "system", "content": "you are a helpful assistant"}, {"role": "user", "content": "hello\n"}]}'

INFO 11-14 04:00:43 logger.py:37] Received request chat-fb75d50bb91b4eb68814b86dbe0d4833: prompt: "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 14 Nov 2024\n\n[{'type': 'text', 'text': 'you are a helpful assistant'}]<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nhello\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=131017, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None), prompt_token_ids: [128000, 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 198, 15724, 2696, 25, 220, 975, 4723, 220, 2366, 19, 271, 58, 13922, 1337, 1232, 364, 1342, 518, 364, 1342, 1232, 364, 9514, 527, 264, 11190, 18328, 8439, 60, 128009, 128006, 882, 128007, 271, 15339, 198, 128009, 128006, 78191, 128007, 271], lora_request: None, prompt_adapter_request: None.

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

vllm-project / vllm