vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
31.04k stars 4.72k forks source link

[Feature]: chat API assistant prefill #6772

Open pseudotensor opened 4 months ago

pseudotensor commented 4 months ago

🚀 The feature, motivation and pitch

https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/prefill-claudes-response https://www.anthropic.com/news/claude-2-1-prompting

I expected I could prefill the assistant response, but seems like it doesn't work.

I should be able to do:

messages = [
    {
        "role": "user",
        "content": prompt,
    },
        "role": "assistant",
        "content": 'According to ',
    }
]

And it should use this to generate up through assistant's response but not ending it, so the model continues

Anthropic has this feature, and it helps to control the responses.

Alternatives

Yes, one can avoid the chat API, but since the chat template work is so pervasive and useful, it would be great to add this extension.

Additional context

I'm unclear if it's even possible within the general chat framework. Maybe the jinja2 template would support it OOTB or not, I'm not sure how much it depends upon the template writer, but even if one had to tweak an existing template, still would require the chat API to handle.

michelg10 commented 4 months ago

this is purely a chat template thing; i have it implemented on my models with a custom-written chat template. for example, the following is a llama3 template for supporting assistant response prefill:

{%- set loop_messages = messages %}
{%- for message in loop_messages %}
    {%- set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim %}
    {%- if loop.index0 == 0 %}
        {%- set content = bos_token + content %}
    {%- endif %}
    {%- if not (loop.last and message['role'] == 'assistant') %}
        {%- set content = content + '<|eot_id|>' %}
    {%- endif %}
    {{- content }}
{%- endfor %}
{%- if messages[-1]['role'] != 'assistant' %}
  {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
{%- endif %}
Kelcin2 commented 3 months ago

this is purely a chat template thing; i have it implemented on my models with a custom-written chat template. for example, the following is a llama3 template for supporting assistant response prefill:

{%- set loop_messages = messages %}
{%- for message in loop_messages %}
    {%- set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim %}
    {%- if loop.index0 == 0 %}
        {%- set content = bos_token + content %}
    {%- endif %}
    {%- if not (loop.last and message['role'] == 'assistant') %}
        {%- set content = content + '<|eot_id|>' %}
    {%- endif %}
    {{- content }}
{%- endfor %}
{%- if messages[-1]['role'] != 'assistant' %}
  {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
{%- endif %}

It works! Thanks

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

ikarth commented 2 weeks ago

Ideally, this should be supported out of the box; or at least a note in the documentation about how to use a template to enable the behavior. It's a pretty common technique (both Anthropic and Mistral explicitly document supporting it).

anthony2261 commented 4 days ago

You might find the continue_final_message extra call argument helpful - docs

Example:

curl -X POST "http://<vllm-address>/v1/chat/completions" -H "Content-Type: application/json" -d '
{
  "model": "<model name>",
  "messages": [
    {"role": "user", "content": "Hello there!"},
    {"role": "assistant", "content": "Hi! My name is"}
  ],
  "add_generation_prompt": false,
  "continue_final_message": true
}'

Make sure you're running the latest vllm server version