vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
23.06k stars 3.27k forks source link

GLM-4-9B-Chat: #5306

Open Geaming-CHN opened 1 month ago

Geaming-CHN commented 1 month ago

The model to consider.

https://huggingface.co/THUDM/glm-4-9b-chat

The closest model vllm already supports.

chatglm

What's your difficulty of supporting the model you want?

No response

jeejeelee commented 1 month ago

As descibed in https://huggingface.co/THUDM/glm-4-9b-chat: vllm can support GLM-4-9B-Chat directly.

godcrying commented 1 month ago

When i use openai_api_server to call the model, it can't stop talking.

lonngxiang commented 1 month ago

When i use openai_api_server to call the model, it can't stop talking.

same error

ShangmingCai commented 1 month ago

When i use openai_api_server to call the model, it can't stop talking.

Did you add the stop_token_ids of ChatGLM when configuring openai_api_server? Maybe you can configure the generate function in api_server.py with non-default SamplingParams, or just pass the stop_token_ids params through the request_dict (see vllm/entrypoints/api_server.py line 47).

jeejeelee commented 1 month ago

I think you can refer to:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

# GLM-4-9B-Chat-1M
# max_model_len, tp_size = 1048576, 4

# GLM-4-9B-Chat
# 如果遇见 OOM 现象,建议减少max_model_len,或者增加tp_size
max_model_len, tp_size = 131072, 1
model_name = "THUDM/glm-4-9b-chat"
prompt = [{"role": "user", "content": "你好"}]

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
llm = LLM(
    model=model_name,
    tensor_parallel_size=tp_size,
    max_model_len=max_model_len,
    trust_remote_code=True,
    enforce_eager=True,
    # GLM-4-9B-Chat-1M 如果遇见 OOM 现象,建议开启下述参数
    # enable_chunked_prefill=True,
    # max_num_batched_tokens=8192
)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)

inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)

This snippet is copyed from https://huggingface.co/THUDM/glm-4-9b-chat

lonngxiang commented 1 month ago

I think you can refer to:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

# GLM-4-9B-Chat-1M
# max_model_len, tp_size = 1048576, 4

# GLM-4-9B-Chat
# 如果遇见 OOM 现象,建议减少max_model_len,或者增加tp_size
max_model_len, tp_size = 131072, 1
model_name = "THUDM/glm-4-9b-chat"
prompt = [{"role": "user", "content": "你好"}]

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
llm = LLM(
    model=model_name,
    tensor_parallel_size=tp_size,
    max_model_len=max_model_len,
    trust_remote_code=True,
    enforce_eager=True,
    # GLM-4-9B-Chat-1M 如果遇见 OOM 现象,建议开启下述参数
    # enable_chunked_prefill=True,
    # max_num_batched_tokens=8192
)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)

inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)

This snippet is copyed from https://huggingface.co/THUDM/glm-4-9b-chat

如果是openai 接口要怎么使用呢

lonngxiang commented 1 month ago

AsyncEngineArgs

具体openai接口要怎么修改呢,麻烦能写完整点吗

lonngxiang commented 1 month ago

@godcrying 已解决;openai接口传入extra_body对应参数即可

from openai import OpenAI
# from openai._client import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://192.1****:10860/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
## 流式回答
stream = client.chat.completions.create(
model="/glm-9b",

messages=[{"role": "user", "content": "你是谁"}],
extra_body={
    "stop_token_ids": [151329, 151336, 151338]
  },
stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        # print(chunk.choices[0].delta.content)
        print(chunk.choices[0].delta.content, end="")
orderer0001 commented 1 month ago

or chunk in stream:

请问不用不用openai的接口能否实现流式输出呢?有example吗 Is it possible to achieve streaming output without using the openai interface? Is there an example?

lonngxiang commented 1 month ago

or chunk in stream:

请问不用不用openai的接口能否实现流式输出呢?有example吗 Is it possible to achieve streaming output without using the openai interface? Is there an example?

import requests

url = "http://*****:10860/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": "EMPTY"
}
data = {

    "model": "/glm-9b",
    "messages": [{"role": "user", "content": "你是谁"}],
    "stop_token_ids": [151329, 151336, 151338],
    "stream": True
}

response = requests.post(url, headers=headers, json=data, stream=True)
orderer0001 commented 1 month ago

or chunk in stream:

请问不用不用openai的接口能否实现流式输出呢?有example吗 Is it possible to achieve streaming output without using the openai interface? Is there an example?

import requests

url = "http://*****:10860/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": "EMPTY"
}
data = {

    "model": "/glm-9b",
    "messages": [{"role": "user", "content": "你是谁"}],
    "stop_token_ids": [151329, 151336, 151338],
    "stream": True
}

response = requests.post(url, headers=headers, json=data, stream=True)

能否请教下 那边服务端是怎么写的? Could you please tell me how the server is written?

lonngxiang commented 1 month ago

or chunk in stream:

请问不用不用openai的接口能否实现流式输出呢?有example吗 Is it possible to achieve streaming output without using the openai interface? Is there an example?

import requests

url = "http://*****:10860/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": "EMPTY"
}
data = {

    "model": "/glm-9b",
    "messages": [{"role": "user", "content": "你是谁"}],
    "stop_token_ids": [151329, 151336, 151338],
    "stream": True
}

response = requests.post(url, headers=headers, json=data, stream=True)

能否请教下 那边服务端是怎么写的? Could you please tell me how the server is written?

https://blog.csdn.net/weixin_42357472/article/details/139504731

https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#

lllloda commented 1 month ago

hey guys, I wonder do u how to use function call with chatglm4-9b-chat through vLLM

lockmatrix commented 1 month ago

能否请教下 那边服务端是怎么写的? Could you please tell me how the server is written?

cp config.json generation_config.json in model dir.

gabohouhou commented 1 month ago

when i use vllm to start glm-4-9b-chat-1m model serveing,there is a error:RuntimeError: Failed to load the model config. If the model is a custom model not yet available in the HuggingFace transformers library, consider setting trust_remote_code=True in LLM or using the --trust-remote-code flag in the CLI.

SunLemuria commented 2 weeks ago

or chunk in stream:

请问不用不用openai的接口能否实现流式输出呢?有example吗 Is it possible to achieve streaming output without using the openai interface? Is there an example?

import requests

url = "http://*****:10860/v1/chat/completions"
headers = {
    "Content-Type": "application/json",
    "Authorization": "EMPTY"
}
data = {

    "model": "/glm-9b",
    "messages": [{"role": "user", "content": "你是谁"}],
    "stop_token_ids": [151329, 151336, 151338],
    "stream": True
}

response = requests.post(url, headers=headers, json=data, stream=True)

能否请教下 那边服务端是怎么写的? Could you please tell me how the server is written?

python -m vllm.entrypoints.openai.api_server \
    --model /path/to/glm-4-9b-chat \
    --served-model-name glm-4-9b-chat \
    --max-model-len 1024 \
    --trust-remote-code \
    --host=0.0.0.0 --port=8001 --enforce-eager

running command above to launch the server, use the request codes provided by @lonngxiang , perfect