Python: Introduced a new condition to yield `StreamingChatMessageContent` directly when usage data is available.

ymuichiro commented 5 days ago

Motivation and Context

issue: https://github.com/microsoft/semantic-kernel/issues/9751

This pull request addresses a bug where setting stream_options.include_usage to True does not return token usage, resulting in None for the usage field.

The issue occurs when using Azure OpenAI's GPT-4o and GPT-4omini models. In particular, if the last chunk of the response has an empty choices list, the chunk is skipped entirely, and the token usage is not processed correctly.

In the Azure OpenAI implementation, if usage information is included, the chunk should be processed appropriately. However, the current code skips processing when choices is empty. This pull request fixes this behavior so that the chunk is processed when usage is present, even if choices is empty.

Description

This fix includes the following changes:

Modified the relevant section in azure_chat_completion.py to ensure that chunks with empty choices are not skipped if usage information is present.
Specifically, the condition if len(chunk.choices) == 0: was updated to allow chunks with usage data to be processed correctly.

With these changes, setting stream_options.include_usage to True will correctly return token usage data, even for chunks where the choices list is empty.

Contribution Checklist

[x] The code builds clean without any errors or warnings
[x] The PR follows the SK Contribution Guidelines and the pre-submission formatting script raises no violations
[x] All unit tests pass, and I have added new tests where possible
[x] I didn't break anyone :smile:

TaoChenOSU commented 5 days ago

Hi @ymuichiro, thank you for your contribution!

If you read the comments made to this particular _inner_get_streaming_chat_message_contents method, you will see the reason why stream_option is not allowed with Azure OpenAI.

Did you observe a different behavior with Azure OpenAI?

yuichiromukaiyama commented 5 days ago

@TaoChenOSU Of course. I have verified this for each API version. In my environment, regardless of which API version I choose, no errors occur, and the token usage for the stream is returned.

Am I misunderstanding something? It does indeed feel odd that it works even with older API versions.

↓ success versions and sample code

2024-10-01-preview
2024-09-01-preview
2024-07-01-preview
2024-10-21
2024-06-01

The following was created directly in the shell to prevent any misunderstandings due to other causes, but even when using Semantic Kernel, the same error could not be reproduced.

payload="{\
  \"messages\": [\
    {\
      \"role\": \"user\",\
      \"content\": [\
        {\
          \"type\": \"text\",\
          \"text\": \"hi.\"\
        }\
      ]\
    }\
  ],\
  \"temperature\": 0.7,\
  \"top_p\": 0.95,\
  \"stream\": true,\
  \"stream_options\": { \"include_usage\": true },\
  \"max_tokens\": 10\
}"

curl "https://${***********************}.openai.azure.com/openai/deployments/gpt-4o-mini/chat/completions?api-version=2024-10-21" \
  -H "Content-Type: application/json" \
  -H "api-key: **************************" \
  -d "$payload"

async def stream_sample() -> None:
    kernel = sk.Kernel()
    service_id: str = "dummy"

    kernel.add_service(
        AzureChatCompletion(
            service_id=service_id,
            deployment_name=AZURE_OPENAI_COMPLETION_DEPLOYMENT_NAME,
            endpoint=AZURE_OPENAI_COMPLETION_ENDPOINT,
            api_key=AZURE_OPENAI_COMPLETION_API_KEY,
            api_version="2024-06-01",
        )
    )

    service = kernel.get_service(service_id=service_id)
    settings = service.get_prompt_execution_settings_class()(service_id=service_id)

    if isinstance(settings, AzureChatPromptExecutionSettings):
        settings.extra_body = {
            "stream_options": {
                "include_usage": True,
            }
        }

    history = ChatHistory()
    history.add_user_message("hello")

    async for chunk in service.get_streaming_chat_message_contents(
        chat_history=history,
        settings=settings,
        kernel=kernel,
        arguments=KernelArguments(settings=settings),
    ):
        print(chunk)

ymuichiro commented 4 days ago

Sorry, I used the wrong account but it's the same person.

TaoChenOSU commented 4 days ago

Hi @ymuichiro,

I just verified. Seems like they have resolved the issue. Could you remove the override of _inner_get_streaming_chat_message_contents in AzureChatCompletion? The default implementation is already in OpenAIChatCompletionBase which handles streaming tokens correctly.

ymuichiro commented 4 days ago

hi @TaoChenOSU

sure, is this ok? I have confirmed that it works.

https://github.com/microsoft/semantic-kernel/pull/9753/commits/076c792be15da0172ef823ea69ca8be8d3b275f1

TaoChenOSU commented 3 days ago

hi @TaoChenOSU

sure, is this ok? I have confirmed that it works.

076c792

Yes, this is right. Just one minor comment and we are good!

TaoChenOSU commented 2 days ago

We have a unit test failure because of the change. Could you fix that too?

ymuichiro commented 1 day ago

@TaoChenOSU

Sorry. I missed it. I found out that stream_options is forced, so I added the argument to the test code. I verified that the test runs successfully locally.

microsoft / semantic-kernel