Open lonngxiang opened 1 year ago
+1 After comparing the fastchat reasoning code and the qwen official reasoning, the functions used are different, and the role and format of qwen are special. I hope fastchat can provide better support.
@lonngxiang When I use the new qwen-14b-chat model and qwen-chat template, the inference becomes normal and strange information no longer appears.
If you find the solutions (e.g., fix model loading or chat templates), please contribute a PR.
@lonngxiang When I use the new qwen-14b-chat model and qwen-chat template, the inference becomes normal and strange information no longer appears.
i update fschat and use new qwen-7b-chat model ,the inference also same error; can you show you qwen-chat template?
import requests
import json
headers = {"Content-Type": "application/json"}
pload = {
# "model": "qwen-6b-model",
"model": "Qwen-7B-Chat",
"prompt": "<|im_start|>user\n你能做啥" +"<|im_end|>\n<|im_start|>assistant\n",
# "prompt": "你能做啥",
# "stop": "<|im_start|><|im_end|>",
"stop":["<|im_end|>", "<|im_start|>"],
"max_new_tokens": 512,
}
response = requests.post("http://*****:21002/worker_generate_stream", headers=headers, json=pload, stream=True,timeout=3)
# print(response.text)
for chunk in response.iter_lines(chunk_size=1024,decode_unicode=False, delimiter=b"\0"):
if chunk:
# print(chunk.decode("utf-8"))
data = json.loads(chunk.decode("utf-8"))
print(data["text"])
cc @infwinston, who successfully deployed QWen chat on our website https://chat.lmsys.org/
cc @infwinston, who successfully deployed QWen chat on our website https://chat.lmsys.org/
tks,i found the solutions by change the api pload ;but the speed is very slow
python -m fastchat.serve.model_worker --model-path ./Qwen-7B-Chat --num-gpus 2 --host=0.0.0 --port=21002
import requests
import json
headers = {"Content-Type": "application/json"}
pload = {
# "model": "qwen-6b-model",
"model": "Qwen-7B-Chat",
"prompt": "<|im_start|>user\n你能做啥" +"<|im_end|>\n<|im_start|>assistant\n",
"stop": '<|endoftext|>',
"stop_token_ids": [151643, 151644, 151645],
"max_new_tokens": 512,
}
response = requests.post("http://*****:21002/worker_generate_stream", headers=headers, json=pload, stream=True,timeout=3)
# print(response.text)
for chunk in response.iter_lines(chunk_size=1024,decode_unicode=False, delimiter=b"\0"):
if chunk:
# print(chunk.decode("utf-8"))
data = json.loads(chunk.decode("utf-8"))
print(data["text"])
why run qwen's speed is very slow; meanwhile same command chatglm2 is very fast
python -m fastchat.serve.model_worker --model-path ./Qwen-7B-Chat --num-gpus 2 --host=0.0.0 --port=21002
@lonngxiang In fact, you can use our openai api server (see guide below), so you don't need to worry about the template or pload. https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md
just use the below command to chat with the model
completion = openai.ChatCompletion.create(
model=model,
messages=[{"role": "user", "content": "Hello! What is your name?"}]
)
r openai api server
Well, but my idea is that if you want a front-end page to use, if you use openai api server, you have to write a front-end page service based on this api