vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
28.84k stars 4.28k forks source link

[Bug]: when `echo=True`, vllm will append chat template(`assistant`) after the last message #7681

Open DIYer22 opened 2 months ago

DIYer22 commented 2 months ago

Your current environment

The output of `python collect_env.py` ```text root@vllm-cpu:/workspace# python3 collect_env.py Collecting environment information... INFO 08-20 07:37:37 importing.py:10] Triton not installed; certain GPU-related functions will be not be available. PyTorch version: 2.4.0+cpu Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 Clang version: Could not collect CMake version: version 3.30.2 Libc version: glibc-2.35 Python version: 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-5.10.25-nvidia-gpu-x86_64-with-glibc2.35 Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 17 On-line CPU(s) list: 0-16 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz CPU family: 6 Model: 106 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 17 Stepping: 6 BogoMIPS: 4000.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat avx512vbmi umip pku avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid md_clear arch_capabilities Hypervisor vendor: KVM Virtualization type: full L1d cache: 544 KiB (17 instances) L1i cache: 544 KiB (17 instances) L2 cache: 68 MiB (17 instances) L3 cache: 272 MiB (17 instances) Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] intel_extension_for_pytorch==2.4.0+gitfbaa4bc [pip3] numpy==1.26.4 [pip3] pyzmq==26.1.0 [pip3] torch==2.4.0+cpu [pip3] torchvision==0.19.0+cpu [pip3] transformers==4.44.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.4@3f674a49b5033a6ed778ab960e86e03cfa64aa1f vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: Could not collect ```

🐛 Describe the bug

Sometimes we want to guide the model's output by prefilling some of the model responses. However, calling the legacy completions API and manually concatenate the chat template is inconvenient. So I used the echo=True parameter of chat completion:

curl -X POST "http://127.0.0.1:8000/v1/chat/completions" \
     -H "Content-Type: application/json" \
     -d '{
           "model": "meta-llama/Meta-Llama-3-8B-Instruct",
           "temperature": 0,
           "stream": false,
           "messages": [
             {
               "role": "user",
               "content": "tell me a common saying"
             },
             {
               "role": "assistant",
               "content": "Here is a common saying about apple. An apple a day, keeps"
             }
           ],
           "echo": true,
           "add_generation_prompt": false
         }'

Response:

{"role":"assistant","content":"Here is a common saying about apple. An apple a day, keeps<|start_header_id|>assistant<|end_header_id|>\n\nI think I can finish that one for you!\n\n\"An apple a day keeps the doctor away!\"","tool_calls":[]}

But the expectation should be

{"role":"assistant","content":"Here is a common saying about apple. An apple a day, keeps the doctor away!\"","tool_calls":[]}
Tostino commented 2 months ago

That is most likely a limitation of the chat template. Some don't allow you to continue a message like that (because the template itself is adding the extra start_header_id). Fix the template and you won't have any issues.

This has nothing to do with vLLM.

DIYer22 commented 2 months ago

@Tostino Thank you for the explanation. Could you provide some names of models that do not append extra "assistant" content when using the echo mode? I would like to find such a model for testing purposes.

Tostino commented 2 months ago

No, you are right... I just tried with the llama3.1 chat template, and I can see in there that it supports "add_generation_prompt": false, but I am still seeing it added like you are.

~So there is a problem here.~

Edit: Never mind, they still didnt fix their chat template to support it. Give me a bit and i'll get you a fixed version. God this thing is hard to read.

Tostino commented 2 months ago

Here is the modified chat template that worked when I just tested it.

from transformers import AutoTokenizer

# Define the custom chat template
custom_chat_template = "{{- bos_token }}\n{%- if custom_tools is defined %}\n    {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n    {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n    {%- set date_string = \"26 Jul 2024\" %}\n{%- endif %}\n{%- if not tools is defined %}\n    {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n    {%- set system_message = messages[0]['content']|trim %}\n    {%- set messages = messages[1:] %}\n{%- else %}\n    {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message + builtin tools #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if builtin_tools is defined or tools is not none %}\n    {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{%- if builtin_tools is defined %}\n    {{- \"Tools: \" + builtin_tools | reject('equalto', 'code_interpreter') | join(\", \") + \"\\n\\n\"}}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n    {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n    {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n    {{- \"Do not use variables.\\n\\n\" }}\n    {%- for t in tools %}\n        {{- t | tojson(indent=4) }}\n        {{- \"\\n\\n\" }}\n    {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n    {#- Extract the first user message so we can plug it in here #}\n    {%- if messages | length != 0 %}\n        {%- set first_user_message = messages[0]['content']|trim %}\n        {%- set messages = messages[1:] %}\n    {%- else %}\n        {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n    {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n    {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n    {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n    {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n    {{- \"Do not use variables.\\n\\n\" }}\n    {%- for t in tools %}\n        {{- t | tojson(indent=4) }}\n        {{- \"\\n\\n\" }}\n    {%- endfor %}\n    {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n    {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n        {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim }}\n        {%- if not loop.last or add_generation_prompt %}\n            {{- '<|eot_id|>' }}\n        {%- endif %}\n    {%- elif 'tool_calls' in message %}\n        {%- if not message.tool_calls|length == 1 %}\n            {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n        {%- endif %}\n        {%- set tool_call = message.tool_calls[0].function %}\n        {%- if builtin_tools is defined and tool_call.name in builtin_tools %}\n            {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n            {{- \"<|python_tag|>\" + tool_call.name + \".call(\" }}\n            {%- for arg_name, arg_val in tool_call.arguments | items %}\n                {{- arg_name + '=\"' + arg_val + '\"' }}\n                {%- if not loop.last %}\n                    {{- \", \" }}\n                {%- endif %}\n                {%- endfor %}\n            {{- \")\" }}\n        {%- else  %}\n            {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n            {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n            {{- '\"parameters\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- \"}\" }}\n        {%- endif %}\n        {%- if builtin_tools is defined %}\n            {#- This means we're in ipython mode #}\n            {{- \"<|eom_id|>\" }}\n        {%- else %}\n            {{- \"<|eot_id|>\" }}\n        {%- endif %}\n    {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n        {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n        {%- if message.content is mapping or message.content is iterable %}\n            {{- message.content | tojson }}\n        {%- else %}\n            {{- message.content }}\n        {%- endif %}\n        {{- \"<|eot_id|>\" }}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n"

def apply_custom_chat_template(messages, add_generation_prompt=False):
    tokenizer = AutoTokenizer.from_pretrained("unsloth/Meta-Llama-3.1-8B-Instruct")

    # Apply the custom chat template
    chat_text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=add_generation_prompt,
        chat_template=custom_chat_template
    )

    return chat_text

def test_custom_chat_template():
    messages = [
        {"role": "user", "content": "tell me a common saying"},
        {"role": "assistant", "content": "Here is a common saying about apple. An apple a day, keeps"}
    ]

    # Test with add_generation_prompt=False
    result_false = apply_custom_chat_template(messages, add_generation_prompt=False)
    print("Result with add_generation_prompt=False:")
    print(result_false)
    print("\n" + "="*50 + "\n")

    # Test with add_generation_prompt=True
    result_true = apply_custom_chat_template(messages, add_generation_prompt=True)
    print("Result with add_generation_prompt=True:")
    print(result_true)

    # Check for the absence of <|eot_id|> at the end when add_generation_prompt is False
    if not result_false.strip().endswith("<|eot_id|>"):
        print("\nSUCCESS: <|eot_id|> is correctly absent at the end when add_generation_prompt is False.")
    else:
        print("\nERROR: <|eot_id|> is present at the end when add_generation_prompt is False.")

    # Check for the presence of an empty assistant turn when add_generation_prompt is True
    if result_true.strip().endswith("<|start_header_id|>assistant<|end_header_id|>"):
        print("SUCCESS: An empty assistant turn is correctly added when add_generation_prompt is True.")
    else:
        print("ERROR: No empty assistant turn is added when add_generation_prompt is True.")

if __name__ == "__main__":
    test_custom_chat_template()

You can pass that into the vllm openai server with --chat-template "past template here"

DIYer22 commented 2 months ago

@Tostino The template works and can be used. Thank you!

I'm curious, will llama fix this issue in the future? Or will vllm fix it? Or is this just a temporary solution and not intended to be merged into the main branch?

Tostino commented 2 months ago

@DIYer22, I opened a PR for each of the llama 3.1 repos to fix it: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/discussions/108 https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct/discussions/26 https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct/discussions/24

I guess we will see.

Tostino commented 1 month ago

Well there has been some discussion in https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct/discussions/26 and it looks like we simply need to use different chat templates at inference and training time for this to work unless we get some changes up-streamed to transformers for apply_chat_template.

That sucks and was not the outcome I was hoping for.

Tostino commented 1 month ago

I decided to open an issue with the transformers project to see if I could move this along: https://github.com/huggingface/transformers/issues/33096