lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
36.92k stars 4.55k forks source link

Run the qwen model and the response is messy #2473

Open lonngxiang opened 1 year ago

lonngxiang commented 1 year ago

image image

lonngxiang commented 1 year ago

image

kyriekevin commented 1 year ago

+1 After comparing the fastchat reasoning code and the qwen official reasoning, the functions used are different, and the role and format of qwen are special. I hope fastchat can provide better support.

kyriekevin commented 1 year ago

@lonngxiang When I use the new qwen-14b-chat model and qwen-chat template, the inference becomes normal and strange information no longer appears.

merrymercy commented 1 year ago

If you find the solutions (e.g., fix model loading or chat templates), please contribute a PR.

lonngxiang commented 1 year ago

@lonngxiang When I use the new qwen-14b-chat model and qwen-chat template, the inference becomes normal and strange information no longer appears.

i update fschat and use new qwen-7b-chat model ,the inference also same error; can you show you qwen-chat template?

lonngxiang commented 1 year ago

import requests
import json

headers = {"Content-Type": "application/json"}
pload = {
    # "model": "qwen-6b-model",
    "model": "Qwen-7B-Chat",
    "prompt":  "<|im_start|>user\n你能做啥" +"<|im_end|>\n<|im_start|>assistant\n",
    # "prompt": "你能做啥",
    # "stop": "<|im_start|><|im_end|>",
    "stop":["<|im_end|>", "<|im_start|>"],

    "max_new_tokens": 512,
  }
response = requests.post("http://*****:21002/worker_generate_stream", headers=headers, json=pload, stream=True,timeout=3)
# print(response.text)
for chunk in response.iter_lines(chunk_size=1024,decode_unicode=False, delimiter=b"\0"):
    if chunk:
        # print(chunk.decode("utf-8"))
        data = json.loads(chunk.decode("utf-8"))
        print(data["text"])

image

merrymercy commented 1 year ago

cc @infwinston, who successfully deployed QWen chat on our website https://chat.lmsys.org/

lonngxiang commented 1 year ago

cc @infwinston, who successfully deployed QWen chat on our website https://chat.lmsys.org/

tks,i found the solutions by change the api pload ;but the speed is very slow

python -m fastchat.serve.model_worker --model-path ./Qwen-7B-Chat --num-gpus 2 --host=0.0.0 --port=21002

import requests
import json

headers = {"Content-Type": "application/json"}
pload = {
    # "model": "qwen-6b-model",
    "model": "Qwen-7B-Chat",
    "prompt":  "<|im_start|>user\n你能做啥" +"<|im_end|>\n<|im_start|>assistant\n",
     "stop": '<|endoftext|>',
    "stop_token_ids": [151643, 151644, 151645],

    "max_new_tokens": 512,
  }
response = requests.post("http://*****:21002/worker_generate_stream", headers=headers, json=pload, stream=True,timeout=3)
# print(response.text)
for chunk in response.iter_lines(chunk_size=1024,decode_unicode=False, delimiter=b"\0"):
    if chunk:
        # print(chunk.decode("utf-8"))
        data = json.loads(chunk.decode("utf-8"))
        print(data["text"])
lonngxiang commented 1 year ago

why run qwen's speed is very slow; meanwhile same command chatglm2 is very fast

python -m fastchat.serve.model_worker --model-path ./Qwen-7B-Chat --num-gpus 2 --host=0.0.0 --port=21002

infwinston commented 1 year ago

@lonngxiang In fact, you can use our openai api server (see guide below), so you don't need to worry about the template or pload. https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md

just use the below command to chat with the model

completion = openai.ChatCompletion.create(
  model=model,
  messages=[{"role": "user", "content": "Hello! What is your name?"}]
)
lonngxiang commented 1 year ago

r openai api server

Well, but my idea is that if you want a front-end page to use, if you use openai api server, you have to write a front-end page service based on this api