Open victorserbu2709 opened 2 days ago
prompt: "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 14 Nov 2024\n\n[{'type': 'text', 'text': 'you are a helpful assistant'}]<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n[{'type': 'text', 'text': 'hello\\n'}]<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=131004, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None)
Hello. I created a simple container image that contains latest tool_chat_template_llama3.2_json.jinja
FROM docker.io/vllm/vllm-openai:v0.6.3.post1 COPY tool_chat_template_llama3.2_json.jinja vllm-workspace/tool_chat_template_llama3.2_json.jinja
The container is started using
localhost/vllm/vllm-openai:v0.6.3.post1-tools \ --model neuralmagic/Llama-3.2-90B-Vision-Instruct-FP8-dynamic \ --tensor-parallel-size 8 \ --served-model-name "Llama3.2 90B" \ --trust-remote-code \ --gpu-memory-utilization 0.95 \ --distributed-executor-backend mp \ --enforce-eager \ --max-num-seqs 2 \ --limit-mm-per-prompt image=5 \ --tool-call-parser llama3_json --chat-template /vllm-workspace/examples/tool_chat_template_llama3.2_json.jinja --enable-auto-tool-choice
Vllm openai receives following request
curl -v http://localhost:8000/v1/chat/completions -H 'content-type: application/json' --data '{"stream": false, "model": "Llama3.2 90B", "messages": [{"role": "system", "content": "you are a helpful assistant"}, {"role": "user", "content": "hello\n"}]}'
but in vllm logs i see user<|end_header_id|>\n\n[{'type': 'text', 'text': 'hello\n'}]<|eot_id|
INFO 11-14 03:51:42 logger.py:37] Received request chat-585357994ead43ab8d485844b632d641: prompt: "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 14 Nov 2024\n\n[{'type': 'text', 'text': 'you are a helpful assistant'}]<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n[{'type': 'text', 'text': 'hello\\n'}]<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=131004, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None), prompt_token_ids: [128000, 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 198, 15724, 2696, 25, 220, 975, 4723, 220, 2366, 19, 271, 58, 13922, 1337, 1232, 364, 1342, 518, 364, 1342, 1232, 364, 9514, 527, 264, 11190, 18328, 8439, 60, 128009, 128006, 882, 128007, 271, 58, 13922, 1337, 1232, 364, 1342, 518, 364, 1342, 1232, 364, 15339, 1734, 8439, 60, 128009, 128006, 78191, 128007, 271], lora_request: None, prompt_adapter_request: None.
However, if i remove only
--chat-template /vllm-workspace/examples/tool_chat_template_llama3.2_json.jinja
from vllm start options, the model receives expected text (user<|end_header_id|>\n\nhello\n<|eot_id|)
INFO 11-14 04:00:43 logger.py:37] Received request chat-fb75d50bb91b4eb68814b86dbe0d4833: prompt: "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 14 Nov 2024\n\n[{'type': 'text', 'text': 'you are a helpful assistant'}]<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nhello\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n", params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=131017, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), guided_decoding=GuidedDecodingParams(json=None, regex=None, choice=None, grammar=None, json_object=None, backend=None, whitespace_pattern=None), prompt_token_ids: [128000, 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 198, 15724, 2696, 25, 220, 975, 4723, 220, 2366, 19, 271, 58, 13922, 1337, 1232, 364, 1342, 518, 364, 1342, 1232, 364, 9514, 527, 264, 11190, 18328, 8439, 60, 128009, 128006, 882, 128007, 271, 15339, 198, 128009, 128006, 78191, 128007, 271], lora_request: None, prompt_adapter_request: None.
Can you try out #10164?
Thank you @DarkLight1337 , it works
Your current environment
The output of `python collect_env.py`
```text Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.35 Python version: 3.12.7 (main, Oct 1 2024, 08:52:12) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-5.15.0-118-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA H100 80GB HBM3 GPU 1: NVIDIA H100 80GB HBM3 GPU 2: NVIDIA H100 80GB HBM3 GPU 3: NVIDIA H100 80GB HBM3 GPU 4: NVIDIA H100 80GB HBM3 GPU 5: NVIDIA H100 80GB HBM3 GPU 6: NVIDIA H100 80GB HBM3 GPU 7: NVIDIA H100 80GB HBM3 Nvidia driver version: 555.42.02 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Versions of relevant libraries: [pip3] flashinfer==0.1.6+cu121torch2.4 [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.6.77 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.45.2 [pip3] triton==3.0.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.3.post1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: NVIDIA_VISIBLE_DEVICES=all NVIDIA_REQUIRE_CUDA=cuda>=12.4 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526 brand=tesla,driver>=535,driver<536 brand=unknown,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=geforce,driver>=535,driver<536 brand=geforcertx,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=titan,driver>=535,driver<536 brand=titanrtx,driver>=535,driver<536 NVIDIA_DRIVER_CAPABILITIES=compute,utility VLLM_USAGE_SOURCE=production-docker-image CUDA_VERSION=12.4.1 LD_LIBRARY_PATH=/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 CUDA_MODULE_LOADING=LAZY ```Model Input Dumps
🐛 Describe the bug
Hello. I created a simple container image that contains latest tool_chat_template_llama3.2_json.jinja
The container is started using
Vllm openai receives following request
but in vllm logs i see user<|end_header_id|>\n\n[{'type': 'text', 'text': 'hello\n'}]<|eot_id|
However, if i remove only
from vllm start options, the model receives expected text (user<|end_header_id|>\n\nhello\n<|eot_id|)
Before submitting a new issue...