Open Geaming-CHN opened 1 month ago
As descibed in https://huggingface.co/THUDM/glm-4-9b-chat: vllm can support GLM-4-9B-Chat directly.
When i use openai_api_server to call the model, it can't stop talking.
When i use openai_api_server to call the model, it can't stop talking.
same error
When i use openai_api_server to call the model, it can't stop talking.
Did you add the stop_token_ids
of ChatGLM when configuring openai_api_server? Maybe you can configure the generate
function in api_server.py with non-default SamplingParams
, or just pass the stop_token_ids params through the request_dict (see vllm/entrypoints/api_server.py line 47).
I think you can refer to:
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
# GLM-4-9B-Chat-1M
# max_model_len, tp_size = 1048576, 4
# GLM-4-9B-Chat
# 如果遇见 OOM 现象,建议减少max_model_len,或者增加tp_size
max_model_len, tp_size = 131072, 1
model_name = "THUDM/glm-4-9b-chat"
prompt = [{"role": "user", "content": "你好"}]
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
llm = LLM(
model=model_name,
tensor_parallel_size=tp_size,
max_model_len=max_model_len,
trust_remote_code=True,
enforce_eager=True,
# GLM-4-9B-Chat-1M 如果遇见 OOM 现象,建议开启下述参数
# enable_chunked_prefill=True,
# max_num_batched_tokens=8192
)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)
inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
This snippet is copyed from https://huggingface.co/THUDM/glm-4-9b-chat
I think you can refer to:
from transformers import AutoTokenizer from vllm import LLM, SamplingParams # GLM-4-9B-Chat-1M # max_model_len, tp_size = 1048576, 4 # GLM-4-9B-Chat # 如果遇见 OOM 现象,建议减少max_model_len,或者增加tp_size max_model_len, tp_size = 131072, 1 model_name = "THUDM/glm-4-9b-chat" prompt = [{"role": "user", "content": "你好"}] tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) llm = LLM( model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True, # GLM-4-9B-Chat-1M 如果遇见 OOM 现象,建议开启下述参数 # enable_chunked_prefill=True, # max_num_batched_tokens=8192 ) stop_token_ids = [151329, 151336, 151338] sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids) inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True) outputs = llm.generate(prompts=inputs, sampling_params=sampling_params) print(outputs[0].outputs[0].text)
This snippet is copyed from https://huggingface.co/THUDM/glm-4-9b-chat
如果是openai 接口要怎么使用呢
AsyncEngineArgs
具体openai接口要怎么修改呢,麻烦能写完整点吗
@godcrying 已解决;openai接口传入extra_body对应参数即可
from openai import OpenAI
# from openai._client import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://192.1****:10860/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
## 流式回答
stream = client.chat.completions.create(
model="/glm-9b",
messages=[{"role": "user", "content": "你是谁"}],
extra_body={
"stop_token_ids": [151329, 151336, 151338]
},
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
# print(chunk.choices[0].delta.content)
print(chunk.choices[0].delta.content, end="")
or chunk in stream:
请问不用不用openai的接口能否实现流式输出呢?有example吗 Is it possible to achieve streaming output without using the openai interface? Is there an example?
or chunk in stream:
请问不用不用openai的接口能否实现流式输出呢?有example吗 Is it possible to achieve streaming output without using the openai interface? Is there an example?
import requests
url = "http://*****:10860/v1/chat/completions"
headers = {
"Content-Type": "application/json",
"Authorization": "EMPTY"
}
data = {
"model": "/glm-9b",
"messages": [{"role": "user", "content": "你是谁"}],
"stop_token_ids": [151329, 151336, 151338],
"stream": True
}
response = requests.post(url, headers=headers, json=data, stream=True)
or chunk in stream:
请问不用不用openai的接口能否实现流式输出呢?有example吗 Is it possible to achieve streaming output without using the openai interface? Is there an example?
import requests url = "http://*****:10860/v1/chat/completions" headers = { "Content-Type": "application/json", "Authorization": "EMPTY" } data = { "model": "/glm-9b", "messages": [{"role": "user", "content": "你是谁"}], "stop_token_ids": [151329, 151336, 151338], "stream": True } response = requests.post(url, headers=headers, json=data, stream=True)
能否请教下 那边服务端是怎么写的? Could you please tell me how the server is written?
or chunk in stream:
请问不用不用openai的接口能否实现流式输出呢?有example吗 Is it possible to achieve streaming output without using the openai interface? Is there an example?
import requests url = "http://*****:10860/v1/chat/completions" headers = { "Content-Type": "application/json", "Authorization": "EMPTY" } data = { "model": "/glm-9b", "messages": [{"role": "user", "content": "你是谁"}], "stop_token_ids": [151329, 151336, 151338], "stream": True } response = requests.post(url, headers=headers, json=data, stream=True)
能否请教下 那边服务端是怎么写的? Could you please tell me how the server is written?
https://blog.csdn.net/weixin_42357472/article/details/139504731
https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#
hey guys, I wonder do u how to use function call with chatglm4-9b-chat through vLLM
能否请教下 那边服务端是怎么写的? Could you please tell me how the server is written?
cp config.json generation_config.json
in model dir.
when i use vllm to start glm-4-9b-chat-1m model serveing,there is a error:RuntimeError: Failed to load the model config. If the model is a custom model not yet available in the HuggingFace transformers library, consider setting trust_remote_code=True
in LLM or using the --trust-remote-code
flag in the CLI.
or chunk in stream:
请问不用不用openai的接口能否实现流式输出呢?有example吗 Is it possible to achieve streaming output without using the openai interface? Is there an example?
import requests url = "http://*****:10860/v1/chat/completions" headers = { "Content-Type": "application/json", "Authorization": "EMPTY" } data = { "model": "/glm-9b", "messages": [{"role": "user", "content": "你是谁"}], "stop_token_ids": [151329, 151336, 151338], "stream": True } response = requests.post(url, headers=headers, json=data, stream=True)
能否请教下 那边服务端是怎么写的? Could you please tell me how the server is written?
python -m vllm.entrypoints.openai.api_server \
--model /path/to/glm-4-9b-chat \
--served-model-name glm-4-9b-chat \
--max-model-len 1024 \
--trust-remote-code \
--host=0.0.0.0 --port=8001 --enforce-eager
running command above to launch the server, use the request codes provided by @lonngxiang , perfect
The model to consider.
https://huggingface.co/THUDM/glm-4-9b-chat
The closest model vllm already supports.
chatglm
What's your difficulty of supporting the model you want?
No response