Generated answers with Llama 3 include <|start_header_id|>assistant<|end_header_id|>

erickrf commented 3 weeks ago

Bug description

I have set up a local endpoint serving Llama 3. All the answers I get from it start with <|start_header_id|>assistant<|end_header_id|>.

Steps to reproduce

Set up Llama 3 in a local endpoint. In my .env.local, it is defined as the following:

MODELS=`[
    {
      "name": "llama3",
      "displayName": "Llama 3 loaded from GCS",
      "chatPromptTemplate": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{{preprompt}}<|eot_id|>{{#each messages}}{{#ifUser}}<|start_header_id|>user<|end_header_id|>\n\n{{content}}<|eot_id|><|start_header_id|>assistant<|end_header_id|>{{/ifUser}}{{#ifAssistant}}{{content}}<|eot_id|>{{/ifAssistant}}{{/each}}",
      "preprompt": "You are a helpful AI assistant.",
      "parameters": {
        "stop": ["<|endoftext|>", "<|eot_id|>"],
        "temperature": 0.4,
        "max_new_tokens": 1024,
        "truncate": 3071
      },
      "endpoints": [{
        "type": "openai",
        "baseURL": "http://localhost:8080/openai/v1"
      }],
    }
]`

Context

I have tried variations of the chat template, also not providing any. The <|start_header_id|>assistant<|end_header_id|> is always there.

AFAIK, these tokens should be the last ones in the prompt, so that the model knows that it should continue the prompt with the assistant's answer. It seems they are not properly appended to the prompt, but the model still realizes it should add them itself.

Logs

This a sample request that my local server receives (running VLLM):

INFO 08-21 11:47:18 async_llm_engine.py:529] Received request cmpl-d1482c4eb4ce49c2a259a2f782ee3712-0: prompt: "<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful AI assistant. Unless otherwise specified, give concise and straightforward answers.<|eot_id|><|start_header_id|>user<|end_header_id|>

[ChatCompletionRequestMessageContentPartText(type='text', text='Hi, what is pizza?')]<|eot_id|>", sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.4, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['<|endoftext|>', '<|eot_id|>'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1024, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [128000, 128000, 128006, 9125, 128007, 271, 2675, 527, 264, 11190, 15592, 18328, 13, 11115, 6062, 5300, 11, 3041, 64694, 323, 31439, 11503, 13, 128009, 128006, 882, 128007, 271, 58, 16047, 34290, 1939, 2097, 2831, 5920, 1199, 5930, 1151, 1342, 518, 1495, 1151, 13347, 11, 1148, 374, 23317, 30, 52128, 128009], lora_request: None.

Specs

OS: macOS
Browser: Firefox 129.0.1
chat-ui commit: 28351dfefa581e4494b2047de3c093eaa7a7cdbc

Config

MONGODB_URL=mongodb://localhost:27017
HF_TOKEN=...

Notes

I'm not sure what the ChatCompletionRequestMessageContentPartText(...) in the prompt is supposed to mean. Is it some internal request object rendered as a string?

nsarrazin commented 3 weeks ago

Have you tried sourcing the chat template from the tokenizer ? This is what we do for most models on HuggingChat and it works great see here

erickrf commented 3 weeks ago

You mean, just providing the tokenizer key in the .env.local file instead of the chat template? I've tried that, and got the same result.

It really looks like Chat UI is sending this string ChatCompletionRequestMessageContentPartText(...) as part of the prompt.

nsarrazin commented 2 weeks ago

Could you share your VLLM config? will try to reproduce this locally

erickrf commented 2 weeks ago

My vLLM is running on a k8s cluster via kserve, basically this. I don't have complete access to it, so I can't tell all the details.

But it turned out that this behavior with returning the assistant header also happens with other clients, like sending a simple request via curl. I'll try to run vLLM locally and further debug.

nsarrazin commented 2 weeks ago

Thanks for the update! So if I understood correctly this is not an issue on the chat-ui side ? In that case i'll close this issue but if it turns out to be chat-ui specific let me know and I'll reopen

huggingface / chat-ui