vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.12k stars 3.98k forks source link

The logprobs in the ChatCompletion's responses is incorrectly reusing the Completions's schema; not following the OpenAI API's spec #3179

Closed anon998 closed 3 months ago

anon998 commented 6 months ago

According to the OpenAI's documentation, the logprobs are returned in a different format when using the Chat Completion API, which is different from the format used in the old Completions one: https://platform.openai.com/docs/api-reference/chat/create image1

But vLLM uses the Completion API's way to return the logprobs for the ChatCompletion's responses too. This is the example chat completion output from the OpenAI API's documentation:

{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1702685778,
  "model": "gpt-3.5-turbo-0125",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello"
      },
      "logprobs": {
        "content": [
          {
            "token": "Hello",
            "logprob": -0.31725305,
            "bytes": [72, 101, 108, 108, 111],
            "top_logprobs": [
              {
                "token": "Hello",
                "logprob": -0.31725305,
                "bytes": [72, 101, 108, 108, 111]
              },
              {
                "token": "Hi",
                "logprob": -1.3190403,
                "bytes": [72, 105]
              }
            ]
          }
        ]
      },
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 1,
    "total_tokens": 11
  },
  "system_fingerprint": null
}

And this is what vLLM is returning with commit 901cf4c5:

{
  "id": "cmpl-feb5333e02ef436e95c09b8f2255e4c0",
  "object": "chat.completion",
  "created": 997761,
  "model": "Qwen/Qwen1.5-72B-Chat-GPTQ-Int4",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello"
      },
      "logprobs": {
        "text_offset": [
          0
        ],
        "token_logprobs": [
          -0.015691734850406647
        ],
        "tokens": [
          "Hello"
        ],
        "top_logprobs": [
          {
            "Hello": -0.015691734850406647,
            "How": -5.140691757202148
          }
        ]
      },
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "total_tokens": 11,
    "completion_tokens": 1
  }
}

Here's how the Completion's LogProbs schema is reused inside the ChatCompletion response: https://github.com/vllm-project/vllm/blob/901cf4c52bf65472ca13aa4f996d631d00c2228d/vllm/entrypoints/openai/protocol.py#L245-L249

https://github.com/vllm-project/vllm/blob/901cf4c52bf65472ca13aa4f996d631d00c2228d/vllm/entrypoints/openai/protocol.py#L289-L293

And then the _create_logprobs function that was made for the Completion API is reused in serving_chat.py: https://github.com/vllm-project/vllm/blob/901cf4c52bf65472ca13aa4f996d631d00c2228d/vllm/entrypoints/openai/serving_chat.py#L241

siyuyuan commented 6 months ago

According to the OpenAI's documentation, the logprobs are returned in a different format when using the Chat Completion API, which is different from the format used in the old Completions one: https://platform.openai.com/docs/api-reference/chat/create image1

But vLLM uses the Completion API's way to return the logprobs for the ChatCompletion's responses too. This is the example chat completion output from the OpenAI API's documentation:

{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1702685778,
  "model": "gpt-3.5-turbo-0125",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello"
      },
      "logprobs": {
        "content": [
          {
            "token": "Hello",
            "logprob": -0.31725305,
            "bytes": [72, 101, 108, 108, 111],
            "top_logprobs": [
              {
                "token": "Hello",
                "logprob": -0.31725305,
                "bytes": [72, 101, 108, 108, 111]
              },
              {
                "token": "Hi",
                "logprob": -1.3190403,
                "bytes": [72, 105]
              }
            ]
          }
        ]
      },
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 1,
    "total_tokens": 11
  },
  "system_fingerprint": null
}

And this is what vLLM is returning with commit 901cf4c:

{
  "id": "cmpl-feb5333e02ef436e95c09b8f2255e4c0",
  "object": "chat.completion",
  "created": 997761,
  "model": "Qwen/Qwen1.5-72B-Chat-GPTQ-Int4",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello"
      },
      "logprobs": {
        "text_offset": [
          0
        ],
        "token_logprobs": [
          -0.015691734850406647
        ],
        "tokens": [
          "Hello"
        ],
        "top_logprobs": [
          {
            "Hello": -0.015691734850406647,
            "How": -5.140691757202148
          }
        ]
      },
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "total_tokens": 11,
    "completion_tokens": 1
  }
}

Here's how the Completion's LogProbs schema is reused inside the ChatCompletion response:

https://github.com/vllm-project/vllm/blob/901cf4c52bf65472ca13aa4f996d631d00c2228d/vllm/entrypoints/openai/protocol.py#L245-L249

https://github.com/vllm-project/vllm/blob/901cf4c52bf65472ca13aa4f996d631d00c2228d/vllm/entrypoints/openai/protocol.py#L289-L293

And then the _create_logprobs function that was made for the Completion API is reused in serving_chat.py:

https://github.com/vllm-project/vllm/blob/901cf4c52bf65472ca13aa4f996d631d00c2228d/vllm/entrypoints/openai/serving_chat.py#L241

Hello! How do you get logprob using vllm? Can you please release the code public?

anon998 commented 6 months ago

@siyuyuan In the example the vLLM server was started with something like this:

python -m vllm.entrypoints.openai.api_server \
    --model /mnt/models/gptq/Qwen1.5-72B-Chat-GPTQ-Int4 \
    --served-model-name Qwen/Qwen1.5-72B-Chat-GPTQ-Int4 \
    --max-model-len 2000 \
    --quantization gptq \
    --tensor-parallel-size 2 \
    --gpu-memory-utilization 0.98 \
    --kv-cache-dtype fp8_e5m2 \
    --disable-custom-all-reduce \
    --enforce-eager \
    --host 127.0.0.1 \
    --port 5000

And then curl was used to make a request from the command line:

curl --silent http://127.0.0.1:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
     "model": "Qwen/Qwen1.5-72B-Chat-GPTQ-Int4",
     "messages": [{"role": "user", "content": "Hi!"}],
     "stream": false,
     "temperature": 1,
     "max_tokens": 1,
     "logprobs": true,
     "top_logprobs": 2
   }'

The difference is in the "logprobs" field returned. Notice how in the OpenAI response, one of the values can be accessed with logprobs.content[0].token, while in the vLLM one is with logprobs.tokens[0].

siyuyuan commented 6 months ago

Thank you for your response! I have tried use Langchain to get logprobs but it is not work:( For your reference:

chat = ChatOpenAI(model=model_name, temperature=0, model_kwargs={"logprobs": True, "top_logprobs": 1})
content = "XXXXXX"
result = chat.generate([[HumanMessage(content=content)]])

I will try your method. Thanks again!

siyuyuan commented 6 months ago

@anon998 Sorry to bother you.

result = openai.ChatCompletion.create(
                temperature=0,
                model=model,
                messages=[{"role": "user", "content": "Hi!"}],
                max_tokens=500,
                top_logprobs=2,
                logprobs=True,)

This is my code. However, the result is:

<OpenAIObject chat.completion id=cmpl-d954fdf1cb924488a7fd68eb1f2d7641 at 0x7fe2713f2860> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": " Hello! How can I help you today?",
        "role": "assistant"
      }
    }
  ],
  "created": 7253,
  "id": "cmpl-d954fdf1cb924488a7fd68eb1f2d7641",
  "model": "/home/huggingface_models/models--openchat--openchat-3.5-1210/snapshots/e5df841b685e5b5ca11ce142f29c6c731bf087a0",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 10,
    "prompt_tokens": 19,
    "total_tokens": 29
  }
}

Do you know why it is?

anon998 commented 6 months ago

@siyuyuan It could be vLLM's version, the chat logprobs was added on v0.3.3. I have the version 1.12.0 of the openai module installed and I built vLLM from the latest git source code, with this code:

from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:5000/v1",
    api_key="test",
)

result = client.chat.completions.create(
    temperature=0,
    model="Qwen/Qwen1.5-72B-Chat-GPTQ-Int4",
    messages=[{"role": "user", "content": "Hi!"}],
    max_tokens=1,
    top_logprobs=2,
    logprobs=True,)

print(result)

It returns this:

ChatCompletion(id='cmpl-3d38fe12182145f498944537677a6b18', 
  choices=[
    Choice(
      finish_reason='length', 
      index=0,
      logprobs=ChoiceLogprobs(
        content=None, 
        text_offset=[0], 
        token_logprobs=[-0.041397884488105774], 
        tokens=['Hello'], 
        top_logprobs=[
          {'Hello': -0.041397884488105774, 'How': -4.072648048400879}
        ]
      ), 
      message=ChatCompletionMessage(
        content='Hello', 
        role='assistant', 
        function_call=None, 
        tool_calls=None
      )
    )
  ], 
  created=35590, 
  model='Qwen/Qwen1.5-72B-Chat-GPTQ-Int4', 
  object='chat.completion', 
  system_fingerprint=None, 
  usage=CompletionUsage(completion_tokens=1, prompt_tokens=20, total_tokens=21)
)

But the logprobs part should be more like this instead:

      logprobs=ChoiceLogprobs(
        content=[
          ChatCompletionTokenLogprob(
            token='Hello', 
            bytes=None, 
            logprob=-0.04109497368335724, 
            top_logprobs=[
              TopLogprob(
                token='Hello', 
                bytes=None, 
                logprob=-0.04109497368335724
              ), 
              TopLogprob(
                token='How', 
                bytes=None, 
                logprob=-4.080157279968262)
            ]
          )
        ]
      ),
siyuyuan commented 6 months ago

@anon998 Thank you so much! After I update vllm, it works!!

SachitS commented 5 months ago

@anon998 have you found a workaround to fix this issue yet, or just patch it on client?

DarkLight1337 commented 3 months ago

Fixed by #5029