Open lonngxiang opened 11 months ago
楼主测试api的时候有没有报错?比如v1/model的时候,返回data是空列表?
楼主测试api的时候有没有报错?比如v1/model的时候,返回data是空列表?
暂时没有测试openai接口,我测试的是model_worker这个节点
the tokenizer of chatglm3 has different result between tokenizer(prompt)
and tokenizer.build_input_chat(query, role, history)
. And fastchat uses the first one which is more general. However, chatglm3 has some special cases here. In the chatglm3 demo repo, they use build_input_chat instead of call.
By the way, you can pass encode_special_tokens = True
when initializing chatglm3 tokenizer, but the two results are also different in the query part.
the tokenizer of chatglm3 has different result between
tokenizer(prompt)
andtokenizer.build_input_chat(query, role, history)
. And fastchat uses the first one which is more general. However, chatglm3 has some special cases here. In the chatglm3 demo repo, they use build_input_chat instead of call.By the way, you can pass
encode_special_tokens = True
when initializing chatglm3 tokenizer, but the two results are also different in the query part.
多谢,麻烦问下encode_special_tokens可以在运行model_worker时候当场参数传入吗?
python -m fastchat.serve.model_worker --model-path chatglm3-6b
encode_special_tokens
@Vonfry 我这里传入的好像也是路径含有chatglm3的,用的也是最新版本,测试后还是不对,应该不是这里问题
python -m fastchat.serve.model_worker --model-path chatglm3-6b
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("<my_path>/THUDM/chatglm3-6b", trust_remote_code = True, encode_special_tokens = True)
tokenizer.build_chat_input('hello', role = 'user')
# {'input_ids': tensor([[64790, 64792, 64795, 30910, 13, 24954, 64796]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]]), 'position_ids': tensor([[0, 1, 2, 3, 4, 5, 6]])}
tokenizer("<|user|>\n hello<|assistant|>")
# {'input_ids': [64790, 64792, 64795, 30910, 13, 24954, 64796], 'attention_mask': [1, 1, 1, 1, 1, 1, 1], 'position_ids': [0, 1, 2, 3, 4, 5, 6]}
至于这样再跑出来模型,是不是还有那种符号,我也不是很清楚,最近我也是刚看到这里有差异。
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("<my_path>/THUDM/chatglm3-6b", trust_remote_code = True, encode_special_tokens = True) tokenizer.build_chat_input('hello', role = 'user') # {'input_ids': tensor([[64790, 64792, 64795, 30910, 13, 24954, 64796]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]]), 'position_ids': tensor([[0, 1, 2, 3, 4, 5, 6]])} tokenizer("""<|user|> hello<|assistant|>""") # {'input_ids': [64790, 64792, 64795, 30910, 13, 24954, 64796], 'attention_mask': [1, 1, 1, 1, 1, 1, 1], 'position_ids': [0, 1, 2, 3, 4, 5, 6]}
嗯,那可能是我访问接口有问题worker_generate_stream,通过这接口访问就会产生上面问题 第一步我运行chatglm3:
python -m fastchat.serve.model_worker --model-path chatglm3-6b
第二步接口访问:
import requests
import json
headers = {"Content-Type": "application/json"}
pload = {
"model": "chatglm3-6b",
"prompt": "<|user|>\n 介绍下广州<|assistant|>",
"stop_token_ids":[64795, 64797, 2],
"max_new_tokens": 512,
}
response = requests.post("http://1*****4:21002/worker_generate_stream", headers=headers, json=pload, stream=True,timeout=3)
# print(response.text)
for chunk in response.iter_lines(chunk_size=1024,decode_unicode=False, delimiter=b"\0"):
if chunk:
# print(chunk.decode("utf-8"))
data = json.loads(chunk.decode("utf-8"))
print(data["text"])
# print(data["text"].split(" "))
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("<my_path>/THUDM/chatglm3-6b", trust_remote_code = True, encode_special_tokens = True) tokenizer.build_chat_input('hello', role = 'user') # {'input_ids': tensor([[64790, 64792, 64795, 30910, 13, 24954, 64796]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]]), 'position_ids': tensor([[0, 1, 2, 3, 4, 5, 6]])} tokenizer("""<|user|> hello<|assistant|>""") # {'input_ids': [64790, 64792, 64795, 30910, 13, 24954, 64796], 'attention_mask': [1, 1, 1, 1, 1, 1, 1], 'position_ids': [0, 1, 2, 3, 4, 5, 6]}
嗯,那可能是我访问接口有问题worker_generate_stream,通过这接口访问就会产生上面问题 第一步我运行chatglm3:
python -m fastchat.serve.model_worker --model-path chatglm3-6b
第二步接口访问:
import requests import json headers = {"Content-Type": "application/json"} pload = { "model": "chatglm3-6b", "prompt": "<|user|>\n 介绍下广州<|assistant|>", "stop_token_ids":[64795, 64797, 2], "max_new_tokens": 512, } response = requests.post("http://1*****4:21002/worker_generate_stream", headers=headers, json=pload, stream=True,timeout=3) # print(response.text) for chunk in response.iter_lines(chunk_size=1024,decode_unicode=False, delimiter=b"\0"): if chunk: # print(chunk.decode("utf-8")) data = json.loads(chunk.decode("utf-8")) print(data["text"]) # print(data["text"].split(" "))
So, has a solution been found in the end?
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("<my_path>/THUDM/chatglm3-6b", trust_remote_code = True, encode_special_tokens = True) tokenizer.build_chat_input('hello', role = 'user') # {'input_ids': tensor([[64790, 64792, 64795, 30910, 13, 24954, 64796]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]]), 'position_ids': tensor([[0, 1, 2, 3, 4, 5, 6]])} tokenizer("""<|user|> hello<|assistant|>""") # {'input_ids': [64790, 64792, 64795, 30910, 13, 24954, 64796], 'attention_mask': [1, 1, 1, 1, 1, 1, 1], 'position_ids': [0, 1, 2, 3, 4, 5, 6]}
嗯,那可能是我访问接口有问题worker_generate_stream,通过这接口访问就会产生上面问题 第一步我运行chatglm3:
python -m fastchat.serve.model_worker --model-path chatglm3-6b
第二步接口访问:
import requests import json headers = {"Content-Type": "application/json"} pload = { "model": "chatglm3-6b", "prompt": "<|user|>\n 介绍下广州<|assistant|>", "stop_token_ids":[64795, 64797, 2], "max_new_tokens": 512, } response = requests.post("http://1*****4:21002/worker_generate_stream", headers=headers, json=pload, stream=True,timeout=3) # print(response.text) for chunk in response.iter_lines(chunk_size=1024,decode_unicode=False, delimiter=b"\0"): if chunk: # print(chunk.decode("utf-8")) data = json.loads(chunk.decode("utf-8")) print(data["text"]) # print(data["text"].split(" "))
So, has a solution been found in the end?
no,but use vllm work well
楼主测试api的时候有没有报错?比如v1/model的时候,返回data是空列表?
遇到相同问题!使用vllm运行chatglm3-6b 使用openai api访问时,没有输出!
@exceedzhang 你看一下是不是端口冲突了 我的是这个原因
worker_generate_stream
您是访问的这个请求接口吗 worker_generate_stream;我这边测试应该是只用了这一个端口
问题分析 这个是因为fastchat里面conversation.py里chatglm3是按模板拼接的prompt,然后调用的tokenizer.encode生成的input_id,然而原版的chatglm3里是用tokenizer的build_chat_input方法根据history和query生成的,原版里面<|user|>和<|assistant|>是直接对应的tokenizer的special_tokens字典里面的id。这就是说,用tokenizer.encode方法不能把<|user|>和<|assistant|>当成特殊字符,必须进行特殊处理才可以。所以fastchat这个版本得到的chatglm3的input_id和原版是不一致的,导致了上述问题。
一种可行的改法:(使用vllm worker启动chatglm3-6b-32k) conversation.py里面chatglm3的template直接返回messages: fastchat/conversation.py
elif self.sep_style == SeparatorStyle.CHATGLM3:
# ret = ""
# if self.system_message:
# ret += system_prompt
# for role, message in self.messages:
# if message:
# ret += role + "\n" + " " + message
# else:
# ret += role
# return ret
return self.messages
vllm_worker.py里面根据messages生成history,query,调用tokenizer.build_chat_input方法生成正确的input_id,然后在调用engine.generate时使用转换后的input_id而不是原始的prompt字符串。最后需要处理一下输出。具体代码如下: fastchat/serve/vllm_worker.py
class VLLMWorker(BaseModelWorker):
def __init__(
self,
controller_addr: str,
worker_addr: str,
worker_id: str,
model_path: str,
model_names: List[str],
limit_worker_concurrency: int,
no_register: bool,
llm_engine: AsyncLLMEngine,
conv_template: str,
):
super().__init__(
controller_addr,
worker_addr,
worker_id,
model_path,
model_names,
limit_worker_concurrency,
conv_template,
)
logger.info(
f"Loading the model {self.model_names} on worker {worker_id}, worker type: vLLM worker..."
)
self.tokenizer = llm_engine.engine.tokenizer
self.context_len = get_context_length(llm_engine.engine.model_config.hf_config)
#针对chatglm3特殊处理
self.is_chatglm3 = 'chatglm3' in model_path
if not no_register:
self.init_heart_beat()
async def generate_stream(self, params):
self.call_ct += 1
context = params.pop("prompt")
#根据messages构建history和query,调用build_chat_input方法获取input_id
if self.is_chatglm3:
messages = context
hist = []
for i in range(0, len(messages), 2):
hist.append({"role":"user", "content": messages[i][1]})
hist.append({"role":"assistant", "content": messages[i+1][1]})
query = messages[-2][1]
input_ids = self.tokenizer.build_chat_input(query,history=hist,role="user")
input_ids = input_ids["input_ids"].tolist()[0]
request_id = params.pop("request_id")
temperature = float(params.get("temperature", 1.0))
top_p = float(params.get("top_p", 1.0))
top_k = params.get("top_k", -1.0)
presence_penalty = float(params.get("presence_penalty", 0.0))
frequency_penalty = float(params.get("frequency_penalty", 0.0))
max_new_tokens = params.get("max_new_tokens", 256)
stop_str = params.get("stop", None)
stop_token_ids = params.get("stop_token_ids", None) or []
if self.tokenizer.eos_token_id is not None:
stop_token_ids.append(self.tokenizer.eos_token_id)
echo = params.get("echo", True)
use_beam_search = params.get("use_beam_search", False)
best_of = params.get("best_of", None)
# Handle stop_str
stop = set()
if isinstance(stop_str, str) and stop_str != "":
stop.add(stop_str)
elif isinstance(stop_str, list) and stop_str != []:
stop.update(stop_str)
for tid in stop_token_ids:
if tid is not None:
stop.add(self.tokenizer.decode(tid))
# make sampling params in vllm
top_p = max(top_p, 1e-5)
if temperature <= 1e-5:
top_p = 1.0
sampling_params = SamplingParams(
n=1,
temperature=temperature,
top_p=top_p,
use_beam_search=use_beam_search,
stop=list(stop),
stop_token_ids=stop_token_ids,
max_tokens=max_new_tokens,
top_k=top_k,
presence_penalty=presence_penalty,
frequency_penalty=frequency_penalty,
best_of=best_of,
)
#chatglm3不输入字符串prompt,输入转换后的input_ids
if self.is_chatglm3:
results_generator = engine.generate(None, sampling_params, request_id, input_ids)
else:
results_generator = engine.generate(context, sampling_params, request_id)
async for request_output in results_generator:
prompt = request_output.prompt
if echo:
text_outputs = [
prompt + output.text for output in request_output.outputs
]
else:
text_outputs = [output.text for output in request_output.outputs]
text_outputs = " ".join(text_outputs)
partial_stop = any(is_partial_stop(text_outputs, i) for i in stop)
# prevent yielding partial stop sequence
if partial_stop:
continue
prompt_tokens = len(request_output.prompt_token_ids)
completion_tokens = sum(
len(output.token_ids) for output in request_output.outputs
)
#后处理生成的结果
if self.is_chatglm3:
temp = text_outputs.split("\n",maxsplit=1)
text_outputs = temp[-1].strip().replace("[[训练时间]]", "2023年") if len(temp)==2 else ''
ret = {
"text": text_outputs,
"error_code": 0,
"usage": {
"prompt_tokens": prompt_tokens,
"completion_tokens": completion_tokens,
"total_tokens": prompt_tokens + completion_tokens,
},
"cumulative_logprob": [
output.cumulative_logprob for output in request_output.outputs
],
"finish_reason": request_output.outputs[0].finish_reason
if len(request_output.outputs) == 1
else [output.finish_reason for output in request_output.outputs],
}
# Emit twice here to ensure a 'finish_reason' with empty content in the OpenAI API response.
# This aligns with the behavior of model_worker.
if request_output.finished:
yield (json.dumps(ret | {"finish_reason": None}) + "\0").encode()
yield (json.dumps(ret) + "\0").encode()
async def generate(self, params):
async for x in self.generate_stream(params):
pass
return json.loads(x[:-1].decode())
这样修改完之后就和原版chatglm3的前后处理方式保持一致了,输出的内容也是正常的。
不使用vllm worker: 如果你不使用vllm worker,那么可以直接根据messages得到query和history,调用tokenizer的build_chat_input方法生成inputs。需要修改fastchat/model/model_chatglm.py:
@torch.inference_mode()
def generate_stream_chatglm(
model,
tokenizer,
params,
device,
context_len=2048,
stream_interval=2,
judge_sent_end=False,
):
prompt = params["prompt"]
temperature = float(params.get("temperature", 1.0))
repetition_penalty = float(params.get("repetition_penalty", 1.0))
top_p = float(params.get("top_p", 1.0))
max_new_tokens = int(params.get("max_new_tokens", 256))
echo = params.get("echo", True)
# 针对chatglm3使用tokenizer的build_chat_input方法生成inputs
is_chatglm3 = "chatglm3" in params["model"]
if is_chatglm3:
messages = prompt
hist = []
for i in range(0, len(messages), 2):
hist.append({"role": "user", "content": messages[i][1]})
hist.append({"role": "assistant", "content": messages[i + 1][1]})
query = messages[-2][1]
inputs = tokenizer.build_chat_input(query, history=hist, role="user").to(model.device)
else:
inputs = tokenizer([prompt], return_tensors="pt").to(model.device)
input_echo_len = len(inputs["input_ids"][0])
gen_kwargs = {
"max_length": max_new_tokens + input_echo_len,
"do_sample": True if temperature > 1e-5 else False,
"top_p": top_p,
"repetition_penalty": repetition_penalty,
"logits_processor": [invalid_score_processor],
}
if temperature > 1e-5:
gen_kwargs["temperature"] = temperature
total_len = 0
for total_ids in model.stream_generate(**inputs, **gen_kwargs):
total_ids = total_ids.tolist()[0]
total_len = len(total_ids)
if echo:
output_ids = total_ids
else:
output_ids = total_ids[input_echo_len:]
response = tokenizer.decode(output_ids)
response = process_response(response)
yield {
"text": response,
"usage": {
"prompt_tokens": input_echo_len,
"completion_tokens": total_len - input_echo_len,
"total_tokens": total_len,
},
"finish_reason": None,
}
参考资料: https://huggingface.co/THUDM/chatglm3-6b-32k/blob/main/modeling_chatglm.py https://huggingface.co/THUDM/chatglm3-6b-32k/blob/main/tokenization_chatglm.py
按照以上修改后:
How to write the format of api request payload and stop, the generated data results are not correct?
deploy model: python -m fastchat.serve.model_worker --model-path chatglm3-6b
code: