microsoft / autogen

A programming framework for agentic AI 🤖
https://microsoft.github.io/autogen/
Creative Commons Attribution 4.0 International
32.32k stars 4.71k forks source link

Return valid Prompt and Completion Token usage counts from `create_stream` #3971

Open auphof opened 4 hours ago

auphof commented 4 hours ago

What happened?

The create_stream method in BaseOpenAIChatCompletionClient as used by OpenAIChatCompletionClient in _openai_client.py returns 0 for prompt and Completion token usage counts. The src code currently show's a TODO raised by @jackgerrits in relation to theseusage counts https://github.com/microsoft/autogen/blob/f31ff663685a37f7960c4911b1837d36f1f32a13/python/packages/autogen-ext/src/autogen_ext/models/_openai/_openai_client.py#L661

What did you expect to happen?

The create_stream method should as per create return usage chunks correctly, showing accurate token counts and maintaining the expected flow of processing messages without errors. the OPENAI API and LITELLM (PROXY) api both support the stream_options={"include_usage": True} but when setting this in init of OpenAIChatCompletionClient( an error No stop reason found happens at the end of token stream handling.

How can we reproduce it (as minimally and precisely as possible)?

Pre-requisites:

OpenAI GPT-4o-mini requires: export OPENAI_API_KEY=xxxxxxxxxxx For local model usage, install ollama and litellm, then run

ollama pull llama3.2:3b  
ollama run llama3.2:3b
litellm --model ollama_chat/llama3.2:3b

Code Example:


# from autogen_core.components.models import OpenAIChatCompletionClient, UserMessage, CreateResult
from autogen_ext.models import OpenAIChatCompletionClient
from autogen_core.components.models import UserMessage, CreateResult

model_client = OpenAIChatCompletionClient(
    # ------- using OpenAI API -----------------
    model="gpt-4o-mini",
    # stream_options={"include_usage": True},
    # -------- for local model use -------------(see above ollama and litellm config)
    # model="gpt-4o",
    # api_key="NotRequiredSinceWeAreLocal",
    # base_url="http://localhost:4000", # first run litellm --model ollama_chat/llama3.2:3b
    # stream_options={"include_usage": True},
)

# Stream the result
model_client_result = model_client.create_stream(  
    messages=[
        UserMessage(content="What is the capital of France?", source="user"),
        extra_create_args={"stream_options": {"include_usage": True}},
    ],
)

try:
    async for chunk in model_client_result: 
        print(f"chunk: {type(chunk)}: {chunk}")
        if type(chunk) is CreateResult:
            assert (
                chunk.usage.prompt_tokens != 0 and chunk.usage.completion_tokens
            ), f"Assert: token counts should not be zero, {chunk.usage}"
except ValueError as e:
    print(f"❌ a bug (🪲), Exception (ValueError): `{e}`")
except Exception as e:
    print(f"❌ a bug (🪲), Exception: `{e}`")
else:
    print(
        f"✅: Finished Normally, last chunk is `{type(chunk).__name__}` with usage `{chunk.usage}`"
    )

AutoGen version
0.4

Which package was this bug in
autogen_ext, autogen_core.components.models

Model used
openapi gpt-4o-mini and llama3.2:3B via ollama and litellm proxy

Python version 3.11.9

Operating system
ubunutu 22.04

Any additional info you think would be helpful for fixing this bug suggest the extra_create_args={"stream_options": {"include_usage": True}} , should be the default in create_stream I have a proposed fix for this issue, which I will submit as a PR. This fix aims to properly return the usage token counts by handling the stream_options={"include_usage": True} setting across both OpenAI and LiteLLM contexts without raising the No stop reason found error.

I have not been able to verify if same issue with AzureOpenAIChatCompletionClient

ekzhu commented 3 hours ago

@auphof thank you very much for the issue. If you already have a fix, you are welcome to submit a PR for it. We can test it for the Azure OpenAI API.