Closed phymbert closed 4 months ago
You can pass the stop tokens, when calling the api. Also vllm prints the configuration when started. Check if it has taken the right template and what the stop tokens are.
I had the same problem.
After I deploy qwen model with command:
python -m vllm.entrypoints.openai.api_server --model /home/su.tz/Qwen-7B-Chat-Int4 --trust-remote-code --dtype=float16 --chat-template ./template_chatml.jinja
How could I pass the stop_token_ids, for I use metaGPT or langchain, it could not easy pass the stop_token_ids from the client side.
How could I pass the it from the server side?
Context
I am doing some performance comparison between llama.cpp and vLLM in https://github.com/ggerganov/llama.cpp/pull/5941.
When I am running vLLM 0bba88df03754c40bd9135fc2ff9554ffca59c87 with:
With the following chat completions request:
I got:
Issue
The model response contains wrong additional tokens:
<|im_start|>
<|im_end|>
and the finish_reason is "length".A good answer should be like:
Could you please assist to configure the proper chat template for phi ?
Note: I want to compare results on quantized models (phi-gptq vs phi-gguf-q4_k_m) on small GPU device, here a
NVIDIA GeForce RTX 3050
.Thanks