vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.8k stars 4.5k forks source link

[Feature]: chat API assistant prefill #6772

Open pseudotensor opened 3 months ago

pseudotensor commented 3 months ago

🚀 The feature, motivation and pitch

https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/prefill-claudes-response https://www.anthropic.com/news/claude-2-1-prompting

I expected I could prefill the assistant response, but seems like it doesn't work.

I should be able to do:

messages = [
    {
        "role": "user",
        "content": prompt,
    },
        "role": "assistant",
        "content": 'According to ',
    }
]

And it should use this to generate up through assistant's response but not ending it, so the model continues

Anthropic has this feature, and it helps to control the responses.

Alternatives

Yes, one can avoid the chat API, but since the chat template work is so pervasive and useful, it would be great to add this extension.

Additional context

I'm unclear if it's even possible within the general chat framework. Maybe the jinja2 template would support it OOTB or not, I'm not sure how much it depends upon the template writer, but even if one had to tweak an existing template, still would require the chat API to handle.

michelg10 commented 3 months ago

this is purely a chat template thing; i have it implemented on my models with a custom-written chat template. for example, the following is a llama3 template for supporting assistant response prefill:

{%- set loop_messages = messages %}
{%- for message in loop_messages %}
    {%- set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim %}
    {%- if loop.index0 == 0 %}
        {%- set content = bos_token + content %}
    {%- endif %}
    {%- if not (loop.last and message['role'] == 'assistant') %}
        {%- set content = content + '<|eot_id|>' %}
    {%- endif %}
    {{- content }}
{%- endfor %}
{%- if messages[-1]['role'] != 'assistant' %}
  {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
{%- endif %}
Kelcin2 commented 3 months ago

this is purely a chat template thing; i have it implemented on my models with a custom-written chat template. for example, the following is a llama3 template for supporting assistant response prefill:

{%- set loop_messages = messages %}
{%- for message in loop_messages %}
    {%- set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim %}
    {%- if loop.index0 == 0 %}
        {%- set content = bos_token + content %}
    {%- endif %}
    {%- if not (loop.last and message['role'] == 'assistant') %}
        {%- set content = content + '<|eot_id|>' %}
    {%- endif %}
    {{- content }}
{%- endfor %}
{%- if messages[-1]['role'] != 'assistant' %}
  {{- '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
{%- endif %}

It works! Thanks

github-actions[bot] commented 2 days ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!