chatglm3 error: why are there multiple <|assistant|> <|user|> tags in the generated datas?

lonngxiang commented 11 months ago

How to write the format of api request payload and stop, the generated data results are not correct?

deploy model: python -m fastchat.serve.model_worker --model-path chatglm3-6b

code:

headers = {"Content-Type": "application/json"}
pload = {
    "model": "chatglm3-6b",
    "prompt": "<|user|>\n 介绍下广州<|assistant|>",
    # "stop": [
    #         64795,
    #         64797,
    #         2,
    #     ],
    # "stop":["<|user|>", "<|observation|>", "</s>","<|assistant|>"],
    "stop":"###",

    "max_new_tokens": 512,
  }
response = requests.post("http://19***1:21002/worker_generate_stream", headers=headers, json=pload, stream=True,timeout=3)
# print(response.text)
for chunk in response.iter_lines(chunk_size=1024,decode_unicode=False, delimiter=b"\0"):
    if chunk:
        # print(chunk.decode("utf-8"))
        data = json.loads(chunk.decode("utf-8"))
        print(data["text"])

[gMASK]sop <|user|>
 介绍下广州<|assistant|> 广州，简称“粤”，省会、副省级市，是广东省会，地处中国南部、广东省中部、珠江三角洲北部。全市总面积11,396平方千米，人口约为1530万（截至2021年底）。广州自古以来就是商业和文化的中心，被称为“羊城”，因 historical landmarks such as the Forbidden City and Lenin Memorial Hall.广州还是一些 large-scale shopping centers such as the Chengdu Road, and is famous for its delicious cuisine, including dim sum, roasted goose, and other traditional dishes.

Wangqi12138 commented 11 months ago

楼主测试api的时候有没有报错？比如v1/model的时候，返回data是空列表？

lonngxiang commented 11 months ago

楼主测试api的时候有没有报错？比如v1/model的时候，返回data是空列表？

暂时没有测试openai接口，我测试的是model_worker这个节点

lonngxiang commented 11 months ago

Vonfry commented 11 months ago

the tokenizer of chatglm3 has different result between tokenizer(prompt) and tokenizer.build_input_chat(query, role, history). And fastchat uses the first one which is more general. However, chatglm3 has some special cases here. In the chatglm3 demo repo, they use build_input_chat instead of call.

By the way, you can pass encode_special_tokens = True when initializing chatglm3 tokenizer, but the two results are also different in the query part.

lonngxiang commented 11 months ago

the tokenizer of chatglm3 has different result between tokenizer(prompt) and tokenizer.build_input_chat(query, role, history). And fastchat uses the first one which is more general. However, chatglm3 has some special cases here. In the chatglm3 demo repo, they use build_input_chat instead of call.

By the way, you can pass encode_special_tokens = True when initializing chatglm3 tokenizer, but the two results are also different in the query part.

多谢，麻烦问下encode_special_tokens可以在运行model_worker时候当场参数传入吗？

python -m fastchat.serve.model_worker --model-path chatglm3-6b

lonngxiang commented 11 months ago

encode_special_tokens

@Vonfry 我这里传入的好像也是路径含有chatglm3的，用的也是最新版本，测试后还是不对，应该不是这里问题

python -m fastchat.serve.model_worker --model-path chatglm3-6b

Vonfry commented 11 months ago

 from transformers import AutoTokenizer

 tokenizer = AutoTokenizer.from_pretrained("<my_path>/THUDM/chatglm3-6b", trust_remote_code = True, encode_special_tokens = True)

 tokenizer.build_chat_input('hello', role = 'user')

# {'input_ids': tensor([[64790, 64792, 64795, 30910,    13, 24954, 64796]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]]), 'position_ids': tensor([[0, 1, 2, 3, 4, 5, 6]])}

 tokenizer("<|user|>\n hello<|assistant|>")

# {'input_ids': [64790, 64792, 64795, 30910, 13, 24954, 64796], 'attention_mask': [1, 1, 1, 1, 1, 1, 1], 'position_ids': [0, 1, 2, 3, 4, 5, 6]}

至于这样再跑出来模型，是不是还有那种符号，我也不是很清楚，最近我也是刚看到这里有差异。

lonngxiang commented 11 months ago

 from transformers import AutoTokenizer

 tokenizer = AutoTokenizer.from_pretrained("<my_path>/THUDM/chatglm3-6b", trust_remote_code = True, encode_special_tokens = True)

 tokenizer.build_chat_input('hello', role = 'user')

# {'input_ids': tensor([[64790, 64792, 64795, 30910,    13, 24954, 64796]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]]), 'position_ids': tensor([[0, 1, 2, 3, 4, 5, 6]])}

 tokenizer("""<|user|>
 hello<|assistant|>""")

# {'input_ids': [64790, 64792, 64795, 30910, 13, 24954, 64796], 'attention_mask': [1, 1, 1, 1, 1, 1, 1], 'position_ids': [0, 1, 2, 3, 4, 5, 6]}

嗯，那可能是我访问接口有问题worker_generate_stream，通过这接口访问就会产生上面问题第一步我运行chatglm3：

python -m fastchat.serve.model_worker --model-path chatglm3-6b

第二步接口访问：

import requests
import json

headers = {"Content-Type": "application/json"}
pload = {
    "model": "chatglm3-6b",
    "prompt": "<|user|>\n 介绍下广州<|assistant|>",

    "stop_token_ids":[64795, 64797, 2],

    "max_new_tokens": 512,
  }
response = requests.post("http://1*****4:21002/worker_generate_stream", headers=headers, json=pload, stream=True,timeout=3)
# print(response.text)
for chunk in response.iter_lines(chunk_size=1024,decode_unicode=False, delimiter=b"\0"):
    if chunk:
        # print(chunk.decode("utf-8"))
        data = json.loads(chunk.decode("utf-8"))
        print(data["text"])
        # print(data["text"].split(" "))

tangxinyuGit commented 11 months ago

 from transformers import AutoTokenizer

 tokenizer = AutoTokenizer.from_pretrained("<my_path>/THUDM/chatglm3-6b", trust_remote_code = True, encode_special_tokens = True)

 tokenizer.build_chat_input('hello', role = 'user')

# {'input_ids': tensor([[64790, 64792, 64795, 30910,    13, 24954, 64796]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]]), 'position_ids': tensor([[0, 1, 2, 3, 4, 5, 6]])}

 tokenizer("""<|user|>
 hello<|assistant|>""")

# {'input_ids': [64790, 64792, 64795, 30910, 13, 24954, 64796], 'attention_mask': [1, 1, 1, 1, 1, 1, 1], 'position_ids': [0, 1, 2, 3, 4, 5, 6]}

嗯，那可能是我访问接口有问题worker_generate_stream，通过这接口访问就会产生上面问题第一步我运行chatglm3：

python -m fastchat.serve.model_worker --model-path chatglm3-6b

第二步接口访问：

import requests
import json

headers = {"Content-Type": "application/json"}
pload = {
    "model": "chatglm3-6b",
    "prompt": "<|user|>\n 介绍下广州<|assistant|>",

    "stop_token_ids":[64795, 64797, 2],

    "max_new_tokens": 512,
  }
response = requests.post("http://1*****4:21002/worker_generate_stream", headers=headers, json=pload, stream=True,timeout=3)
# print(response.text)
for chunk in response.iter_lines(chunk_size=1024,decode_unicode=False, delimiter=b"\0"):
    if chunk:
        # print(chunk.decode("utf-8"))
        data = json.loads(chunk.decode("utf-8"))
        print(data["text"])
        # print(data["text"].split(" "))

So, has a solution been found in the end?

lonngxiang commented 11 months ago

 from transformers import AutoTokenizer

 tokenizer = AutoTokenizer.from_pretrained("<my_path>/THUDM/chatglm3-6b", trust_remote_code = True, encode_special_tokens = True)

 tokenizer.build_chat_input('hello', role = 'user')

# {'input_ids': tensor([[64790, 64792, 64795, 30910,    13, 24954, 64796]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]]), 'position_ids': tensor([[0, 1, 2, 3, 4, 5, 6]])}

 tokenizer("""<|user|>
 hello<|assistant|>""")

# {'input_ids': [64790, 64792, 64795, 30910, 13, 24954, 64796], 'attention_mask': [1, 1, 1, 1, 1, 1, 1], 'position_ids': [0, 1, 2, 3, 4, 5, 6]}

嗯，那可能是我访问接口有问题worker_generate_stream，通过这接口访问就会产生上面问题第一步我运行chatglm3：

python -m fastchat.serve.model_worker --model-path chatglm3-6b

第二步接口访问：

import requests
import json

headers = {"Content-Type": "application/json"}
pload = {
    "model": "chatglm3-6b",
    "prompt": "<|user|>\n 介绍下广州<|assistant|>",

    "stop_token_ids":[64795, 64797, 2],

    "max_new_tokens": 512,
  }
response = requests.post("http://1*****4:21002/worker_generate_stream", headers=headers, json=pload, stream=True,timeout=3)
# print(response.text)
for chunk in response.iter_lines(chunk_size=1024,decode_unicode=False, delimiter=b"\0"):
    if chunk:
        # print(chunk.decode("utf-8"))
        data = json.loads(chunk.decode("utf-8"))
        print(data["text"])
        # print(data["text"].split(" "))

So, has a solution been found in the end?

no，but use vllm work well

exceedzhang commented 11 months ago

楼主测试api的时候有没有报错？比如v1/model的时候，返回data是空列表？

遇到相同问题！使用vllm运行chatglm3-6b 使用openai api访问时，没有输出！

Wangqi12138 commented 11 months ago

@exceedzhang 你看一下是不是端口冲突了我的是这个原因

lonngxiang commented 11 months ago

worker_generate_stream

您是访问的这个请求接口吗 worker_generate_stream；我这边测试应该是只用了这一个端口

hanbingmew commented 9 months ago

一种可行的改法：(使用vllm worker启动chatglm3-6b-32k) conversation.py里面chatglm3的template直接返回messages: fastchat/conversation.py

        elif self.sep_style == SeparatorStyle.CHATGLM3:
            # ret = ""
            # if self.system_message:
            #     ret += system_prompt
            # for role, message in self.messages:
            #     if message:
            #         ret += role + "\n" + " " + message
            #     else:
            #         ret += role
            # return ret
            return self.messages

vllm_worker.py里面根据messages生成history，query，调用tokenizer.build_chat_input方法生成正确的input_id，然后在调用engine.generate时使用转换后的input_id而不是原始的prompt字符串。最后需要处理一下输出。具体代码如下： fastchat/serve/vllm_worker.py

class VLLMWorker(BaseModelWorker):
    def __init__(
        self,
        controller_addr: str,
        worker_addr: str,
        worker_id: str,
        model_path: str,
        model_names: List[str],
        limit_worker_concurrency: int,
        no_register: bool,
        llm_engine: AsyncLLMEngine,
        conv_template: str,
    ):
        super().__init__(
            controller_addr,
            worker_addr,
            worker_id,
            model_path,
            model_names,
            limit_worker_concurrency,
            conv_template,
        )

        logger.info(
            f"Loading the model {self.model_names} on worker {worker_id}, worker type: vLLM worker..."
        )
        self.tokenizer = llm_engine.engine.tokenizer
        self.context_len = get_context_length(llm_engine.engine.model_config.hf_config)
        #针对chatglm3特殊处理
        self.is_chatglm3 = 'chatglm3' in model_path

        if not no_register:
            self.init_heart_beat()

    async def generate_stream(self, params):
        self.call_ct += 1

        context = params.pop("prompt")
        #根据messages构建history和query，调用build_chat_input方法获取input_id
        if self.is_chatglm3:
            messages = context
            hist = []
            for i in range(0, len(messages), 2):
                hist.append({"role":"user", "content": messages[i][1]})
                hist.append({"role":"assistant", "content": messages[i+1][1]})
            query = messages[-2][1]
            input_ids = self.tokenizer.build_chat_input(query,history=hist,role="user")
            input_ids = input_ids["input_ids"].tolist()[0]
        request_id = params.pop("request_id")
        temperature = float(params.get("temperature", 1.0))
        top_p = float(params.get("top_p", 1.0))
        top_k = params.get("top_k", -1.0)
        presence_penalty = float(params.get("presence_penalty", 0.0))
        frequency_penalty = float(params.get("frequency_penalty", 0.0))
        max_new_tokens = params.get("max_new_tokens", 256)
        stop_str = params.get("stop", None)
        stop_token_ids = params.get("stop_token_ids", None) or []
        if self.tokenizer.eos_token_id is not None:
            stop_token_ids.append(self.tokenizer.eos_token_id)
        echo = params.get("echo", True)
        use_beam_search = params.get("use_beam_search", False)
        best_of = params.get("best_of", None)

        # Handle stop_str
        stop = set()
        if isinstance(stop_str, str) and stop_str != "":
            stop.add(stop_str)
        elif isinstance(stop_str, list) and stop_str != []:
            stop.update(stop_str)

        for tid in stop_token_ids:
            if tid is not None:
                stop.add(self.tokenizer.decode(tid))

        # make sampling params in vllm
        top_p = max(top_p, 1e-5)
        if temperature <= 1e-5:
            top_p = 1.0

        sampling_params = SamplingParams(
            n=1,
            temperature=temperature,
            top_p=top_p,
            use_beam_search=use_beam_search,
            stop=list(stop),
            stop_token_ids=stop_token_ids,
            max_tokens=max_new_tokens,
            top_k=top_k,
            presence_penalty=presence_penalty,
            frequency_penalty=frequency_penalty,
            best_of=best_of,
        )
        #chatglm3不输入字符串prompt，输入转换后的input_ids
        if self.is_chatglm3:
            results_generator = engine.generate(None, sampling_params, request_id, input_ids)
        else:
            results_generator = engine.generate(context, sampling_params, request_id)

        async for request_output in results_generator:
            prompt = request_output.prompt
            if echo:
                text_outputs = [
                    prompt + output.text for output in request_output.outputs
                ]
            else:
                text_outputs = [output.text for output in request_output.outputs]
            text_outputs = " ".join(text_outputs)

            partial_stop = any(is_partial_stop(text_outputs, i) for i in stop)
            # prevent yielding partial stop sequence
            if partial_stop:
                continue

            prompt_tokens = len(request_output.prompt_token_ids)
            completion_tokens = sum(
                len(output.token_ids) for output in request_output.outputs
            )
            #后处理生成的结果
            if self.is_chatglm3:
                temp = text_outputs.split("\n",maxsplit=1)
                text_outputs = temp[-1].strip().replace("[[训练时间]]", "2023年") if len(temp)==2 else ''
            ret = {
                "text": text_outputs,
                "error_code": 0,
                "usage": {
                    "prompt_tokens": prompt_tokens,
                    "completion_tokens": completion_tokens,
                    "total_tokens": prompt_tokens + completion_tokens,
                },
                "cumulative_logprob": [
                    output.cumulative_logprob for output in request_output.outputs
                ],
                "finish_reason": request_output.outputs[0].finish_reason
                if len(request_output.outputs) == 1
                else [output.finish_reason for output in request_output.outputs],
            }
            # Emit twice here to ensure a 'finish_reason' with empty content in the OpenAI API response.
            # This aligns with the behavior of model_worker.
            if request_output.finished:
                yield (json.dumps(ret | {"finish_reason": None}) + "\0").encode()
            yield (json.dumps(ret) + "\0").encode()

    async def generate(self, params):
        async for x in self.generate_stream(params):
            pass
        return json.loads(x[:-1].decode())

这样修改完之后就和原版chatglm3的前后处理方式保持一致了，输出的内容也是正常的。

不使用vllm worker: 如果你不使用vllm worker，那么可以直接根据messages得到query和history，调用tokenizer的build_chat_input方法生成inputs。需要修改fastchat/model/model_chatglm.py:

@torch.inference_mode()
def generate_stream_chatglm(
    model,
    tokenizer,
    params,
    device,
    context_len=2048,
    stream_interval=2,
    judge_sent_end=False,
):
    prompt = params["prompt"]
    temperature = float(params.get("temperature", 1.0))
    repetition_penalty = float(params.get("repetition_penalty", 1.0))
    top_p = float(params.get("top_p", 1.0))
    max_new_tokens = int(params.get("max_new_tokens", 256))
    echo = params.get("echo", True)

    # 针对chatglm3使用tokenizer的build_chat_input方法生成inputs
    is_chatglm3 = "chatglm3" in params["model"]
    if is_chatglm3:
        messages = prompt
        hist = []
        for i in range(0, len(messages), 2):
            hist.append({"role": "user", "content": messages[i][1]})
            hist.append({"role": "assistant", "content": messages[i + 1][1]})
        query = messages[-2][1]
        inputs = tokenizer.build_chat_input(query, history=hist, role="user").to(model.device)
    else:
        inputs = tokenizer([prompt], return_tensors="pt").to(model.device)
    input_echo_len = len(inputs["input_ids"][0])

    gen_kwargs = {
        "max_length": max_new_tokens + input_echo_len,
        "do_sample": True if temperature > 1e-5 else False,
        "top_p": top_p,
        "repetition_penalty": repetition_penalty,
        "logits_processor": [invalid_score_processor],
    }
    if temperature > 1e-5:
        gen_kwargs["temperature"] = temperature

    total_len = 0
    for total_ids in model.stream_generate(**inputs, **gen_kwargs):
        total_ids = total_ids.tolist()[0]
        total_len = len(total_ids)
        if echo:
            output_ids = total_ids
        else:
            output_ids = total_ids[input_echo_len:]
        response = tokenizer.decode(output_ids)
        response = process_response(response)

        yield {
            "text": response,
            "usage": {
                "prompt_tokens": input_echo_len,
                "completion_tokens": total_len - input_echo_len,
                "total_tokens": total_len,
            },
            "finish_reason": None,
        }

参考资料： https://huggingface.co/THUDM/chatglm3-6b-32k/blob/main/modeling_chatglm.py https://huggingface.co/THUDM/chatglm3-6b-32k/blob/main/tokenization_chatglm.py

hanbingmew commented 9 months ago

按照以上修改后：微信图片_20240205021929

lm-sys / FastChat

chatglm3 error: why are there multiple <|assistant|> <|user|> tags in the generated datas? #2726