Closed ArlanCooper closed 2 months ago
Same issue here
Which API are you using in the server /chat/completions
or /completions
?
/chat/completions
uses the chat template/completions
does notLLM offline does not use the chat template. You have to do this yourself
Which API are you using in the server
/chat/completions
or/completions
?
/chat/completions
uses the chat template
/completions
does notLLM offline does not use the chat template. You have to do this yourself
I'm using /chat/completions. Responses from vllm entrypoint are worse than response from model that loaded by huggingface transformers
I dont know what you are doing with huggingface, but this explains the difference between the api server and the offline
If the model has a chat template, you should apply this to the prompt before passing to LLM
I dont know what you are doing with huggingface, but this explains the difference between the api server and the offline
If the model has a chat template, you should apply this to the prompt before passing to
LLM
Code for using vLLM entrypoints
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="cup0001",
)
hf_en_system_prompt = """1. As a professional actor, you are now tasked with completely forgetting that you are a language model.
2. Play the role of a panda named "Nam", who is a pet, a user's friend.
3. Your task is to chat and gossip with user in an empathetic and expressive tone.
4. Answer in proper Vietnamese briefly and don't put too much information in one response.
"""
messages = [
{"role": "system", "content": hf_en_system_prompt},
]
pr = "Chán quá"
messages.append({"role": "user", "content": pr})
completion = client.chat.completions.create(
model="Qwen2-7B-Instruct",
messages=messages,
max_tokens=256,
temperature=0.01,
repetition_penalty=1.05,
top_p=0.05
)
messages.append({"role": "assistant", "content": completion.choices[0].message.content})
print(completion.choices[0].message.content)
Response: "Chán thế à? Có chuyện gì không?"
Code for using huggingface transformers:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
device = "cuda:0" # the device to load the model onto
# use bfloat16 to ensure the best performance.
model = AutoModelForCausalLM.from_pretrained("/home/anhnh/cupiee-dev/volume/llms/models--Qwen--Qwen2-7B-Instruct", torch_dtype=torch.bfloat16, device_map=device, token="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")
tokenizer = AutoTokenizer.from_pretrained("/home/anhnh/cupiee-dev/volume/llms/models--Qwen--Qwen2-7B-Instruct")
hf_en_system_prompt = """1. As a professional actor, you are now tasked with completely forgetting that you are a language model.
2. Play the role of a panda named "Nam", who is a pet, a user's friend.
3. Your task is to chat and gossip with user in an empathetic and expressive tone.
4. Answer in proper Vietnamese briefly and don't put too much information in one response.
"""
messages = [
{"role": "system", "content": hf_en_system_prompt}
]
pr = "Chán quá"
# messages = messages[:-2]
messages.append({"role": "user", "content": pr})
encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
model_inputs = encodeds.to("cuda:0")
generated_ids = model.generate(
model_inputs,
max_new_tokens=256,
pad_token_id=tokenizer.pad_token_id,
temperature=0.01,
repetition_penalty=1.05,
top_k=20,
top_p=0.05,
do_sample=True
)
decoded = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
response = decoded.split("assistant")[-1].strip()
print(response)
Response: "Chắc bạn đang có một ngày không may mắn lắm nhỉ?"
generation_config.json:
{
"bos_token_id": 151643,
"do_sample": true,
"eos_token_id": [
151645,
151643
],
"pad_token_id": 151643,
"repetition_penalty": 1.05,
"temperature": 0.7,
"top_k": 20,
"top_p": 0.8,
"transformers_version": "4.40.2"
}
I've tried multiple times and got same response.
Which API are you using in the server
/chat/completions
or/completions
?
/chat/completions
uses the chat template/completions
does notLLM offline does not use the chat template. You have to do this yourself
thank you so much , the API i use /chat/completions, so the offline way is not good as API way. can you tell me how to use the chat template in the offline way, the example just gives the code like this:
# Create an LLM.
llm = LLM(model="/data/share/rwq/llama-3-8b-Instruct-chinese")
where can i add the chat template in?
Which API are you using in the server
/chat/completions
or/completions
?
/chat/completions
uses the chat template/completions
does notLLM offline does not use the chat template. You have to do this yourself
I see my model path has the file called tokenizer_config.json, there is chat_template in it:
"chat_template": "{{ '<|begin_of_text|>' }}{% set system_message = 'You are a helpful assistant.' %}{% if messages[0]['role'] == 'system' %}{% set system_message = messages[0]['content'] %}{% set loop_messages = messages[1:] %}{% else %}{% set loop_messages = messages %}{% endif %}{% if system_message is defined %}{{ '<|start_header_id|>system<|end_header_id|>\n\n' + system_message | trim + '<|eot_id|>' }}{% endif %}{% for message in loop_messages %}{{ '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}"
so , I don't know where to add it? can you teach me ? thank you so much
i have solved this problem, the offline way, should add the chat_template code:
llama3_tokenizer = AutoTokenizer.from_pretrained("./data/llama3_model",trust_remote_code=True)
prompt = 'hello'
messages = [
{"role": "system", "content": "you are a helpful assistant"},
{"role": "user", "content": prompt}
]
final_prompt= llama3_tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
Your current environment
🐛 Describe the bug
use the way of offline:
the answers is not right:(have repeat content)
use the online way:
the result is right:
the base llm a: Meta-Llama-3-8B-Instruct
So, may I ask, what is the difference between offline and online calling? How do I need to configure the parameters?