Closed anon998 closed 3 months ago
According to the OpenAI's documentation, the logprobs are returned in a different format when using the Chat Completion API, which is different from the format used in the old Completions one: https://platform.openai.com/docs/api-reference/chat/create
But vLLM uses the Completion API's way to return the logprobs for the ChatCompletion's responses too. This is the example chat completion output from the OpenAI API's documentation:
{ "id": "chatcmpl-123", "object": "chat.completion", "created": 1702685778, "model": "gpt-3.5-turbo-0125", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "Hello" }, "logprobs": { "content": [ { "token": "Hello", "logprob": -0.31725305, "bytes": [72, 101, 108, 108, 111], "top_logprobs": [ { "token": "Hello", "logprob": -0.31725305, "bytes": [72, 101, 108, 108, 111] }, { "token": "Hi", "logprob": -1.3190403, "bytes": [72, 105] } ] } ] }, "finish_reason": "length" } ], "usage": { "prompt_tokens": 10, "completion_tokens": 1, "total_tokens": 11 }, "system_fingerprint": null }
And this is what vLLM is returning with commit 901cf4c:
{ "id": "cmpl-feb5333e02ef436e95c09b8f2255e4c0", "object": "chat.completion", "created": 997761, "model": "Qwen/Qwen1.5-72B-Chat-GPTQ-Int4", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "Hello" }, "logprobs": { "text_offset": [ 0 ], "token_logprobs": [ -0.015691734850406647 ], "tokens": [ "Hello" ], "top_logprobs": [ { "Hello": -0.015691734850406647, "How": -5.140691757202148 } ] }, "finish_reason": "length" } ], "usage": { "prompt_tokens": 10, "total_tokens": 11, "completion_tokens": 1 } }
Here's how the Completion's LogProbs schema is reused inside the ChatCompletion response:
And then the
_create_logprobs
function that was made for the Completion API is reused inserving_chat.py
:
Hello! How do you get logprob using vllm? Can you please release the code public?
@siyuyuan In the example the vLLM server was started with something like this:
python -m vllm.entrypoints.openai.api_server \
--model /mnt/models/gptq/Qwen1.5-72B-Chat-GPTQ-Int4 \
--served-model-name Qwen/Qwen1.5-72B-Chat-GPTQ-Int4 \
--max-model-len 2000 \
--quantization gptq \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.98 \
--kv-cache-dtype fp8_e5m2 \
--disable-custom-all-reduce \
--enforce-eager \
--host 127.0.0.1 \
--port 5000
And then curl was used to make a request from the command line:
curl --silent http://127.0.0.1:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen1.5-72B-Chat-GPTQ-Int4",
"messages": [{"role": "user", "content": "Hi!"}],
"stream": false,
"temperature": 1,
"max_tokens": 1,
"logprobs": true,
"top_logprobs": 2
}'
The difference is in the "logprobs"
field returned. Notice how in the OpenAI response, one of the values can be accessed with logprobs.content[0].token
, while in the vLLM one is with logprobs.tokens[0]
.
Thank you for your response! I have tried use Langchain to get logprobs but it is not work:( For your reference:
chat = ChatOpenAI(model=model_name, temperature=0, model_kwargs={"logprobs": True, "top_logprobs": 1})
content = "XXXXXX"
result = chat.generate([[HumanMessage(content=content)]])
I will try your method. Thanks again!
@anon998 Sorry to bother you.
result = openai.ChatCompletion.create(
temperature=0,
model=model,
messages=[{"role": "user", "content": "Hi!"}],
max_tokens=500,
top_logprobs=2,
logprobs=True,)
This is my code. However, the result is:
<OpenAIObject chat.completion id=cmpl-d954fdf1cb924488a7fd68eb1f2d7641 at 0x7fe2713f2860> JSON: {
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": " Hello! How can I help you today?",
"role": "assistant"
}
}
],
"created": 7253,
"id": "cmpl-d954fdf1cb924488a7fd68eb1f2d7641",
"model": "/home/huggingface_models/models--openchat--openchat-3.5-1210/snapshots/e5df841b685e5b5ca11ce142f29c6c731bf087a0",
"object": "chat.completion",
"usage": {
"completion_tokens": 10,
"prompt_tokens": 19,
"total_tokens": 29
}
}
Do you know why it is?
@siyuyuan It could be vLLM's version, the chat logprobs was added on v0.3.3. I have the version 1.12.0 of the openai module installed and I built vLLM from the latest git source code, with this code:
from openai import OpenAI
client = OpenAI(
base_url="http://127.0.0.1:5000/v1",
api_key="test",
)
result = client.chat.completions.create(
temperature=0,
model="Qwen/Qwen1.5-72B-Chat-GPTQ-Int4",
messages=[{"role": "user", "content": "Hi!"}],
max_tokens=1,
top_logprobs=2,
logprobs=True,)
print(result)
It returns this:
ChatCompletion(id='cmpl-3d38fe12182145f498944537677a6b18',
choices=[
Choice(
finish_reason='length',
index=0,
logprobs=ChoiceLogprobs(
content=None,
text_offset=[0],
token_logprobs=[-0.041397884488105774],
tokens=['Hello'],
top_logprobs=[
{'Hello': -0.041397884488105774, 'How': -4.072648048400879}
]
),
message=ChatCompletionMessage(
content='Hello',
role='assistant',
function_call=None,
tool_calls=None
)
)
],
created=35590,
model='Qwen/Qwen1.5-72B-Chat-GPTQ-Int4',
object='chat.completion',
system_fingerprint=None,
usage=CompletionUsage(completion_tokens=1, prompt_tokens=20, total_tokens=21)
)
But the logprobs part should be more like this instead:
logprobs=ChoiceLogprobs(
content=[
ChatCompletionTokenLogprob(
token='Hello',
bytes=None,
logprob=-0.04109497368335724,
top_logprobs=[
TopLogprob(
token='Hello',
bytes=None,
logprob=-0.04109497368335724
),
TopLogprob(
token='How',
bytes=None,
logprob=-4.080157279968262)
]
)
]
),
@anon998 Thank you so much! After I update vllm, it works!!
@anon998 have you found a workaround to fix this issue yet, or just patch it on client?
Fixed by #5029
According to the OpenAI's documentation, the logprobs are returned in a different format when using the Chat Completion API, which is different from the format used in the old Completions one: https://platform.openai.com/docs/api-reference/chat/create
But vLLM uses the Completion API's way to return the logprobs for the ChatCompletion's responses too. This is the example chat completion output from the OpenAI API's documentation:
And this is what vLLM is returning with commit 901cf4c5:
Here's how the Completion's LogProbs schema is reused inside the ChatCompletion response: https://github.com/vllm-project/vllm/blob/901cf4c52bf65472ca13aa4f996d631d00c2228d/vllm/entrypoints/openai/protocol.py#L245-L249
https://github.com/vllm-project/vllm/blob/901cf4c52bf65472ca13aa4f996d631d00c2228d/vllm/entrypoints/openai/protocol.py#L289-L293
And then the
_create_logprobs
function that was made for the Completion API is reused inserving_chat.py
: https://github.com/vllm-project/vllm/blob/901cf4c52bf65472ca13aa4f996d631d00c2228d/vllm/entrypoints/openai/serving_chat.py#L241