[.Net][Feature Request]: Enhanced Support for IMessage, and usage data in AutoGen .NET

Is your feature request related to a problem? Please describe.

Yes, my feature request is related to several problems I have encountered while working with the .NET version of AutoGen.

Specifically: The need to manually convert TextMessages to MessageEnvelopes to receive responses and usage data. The inability to receive usage data when using .RegisterMessageConnector(). The challenge of using multiple Middlewares, such as MistralAITokenCounterMiddleware, in conjunction with .RegisterMessageConnector(). Lack of clear documentation and examples for mixing text and image messages and retrieving both responses and usage data.

Describe the solution you'd like

If MessageEnvelope is the way to get usage data, then introduce a built-in method to automatically convert TextMessages or other IMessages to MessageEnvelopes. Ensure that usage data is retained when using .RegisterMessageConnector() or add an easy way to retrieve usage data. Provide comprehensive documentation about Middlewares and examples on how to mix text and image messages, send them to GPT-4o, and retrieve both the message and usage data. Additionally, clarify whether GPT-4o can generate images and include examples if possible.

Additional context

Dear AutoGen Support Team,

I have been working with the .NET version of AutoGen and encountered several challenges regarding the use of IMessage, MessageEnvelope, GenerateReplyAsync, and SendAsync. Despite thoroughly reviewing the documentation, I have been unable to find relevant information to resolve my issues.

My primary use case, for educational purposes, involves creating an agent with a specific configuration, sending a list of messages, and retrieving responses along with usage data.

Here is the code that works fine for me:

var openaiClient = new OpenAIClient(openAIKey); var openAIChatAgent = new OpenAIChatAgent( openAIClient: openaiClient, name: "assistant", modelName: MySelectedModel, systemMessage: MySystemMessage, temperature: MyTemperature, maxTokens: MyMaxTokens, seed: MySeedNum ); However, I would like to set the httpClient.Timeout value and request multiple responses. While these features would be nice to have, they are not my main concern.

I am currently generating a list of TextMessages or IMessages and sending them to get a response. Unfortunately, I encounter an "Invalid message type" error. If I send a list of MessageEnvelope, everything works fine, and I receive the response along with the usage data. However, manually converting TextMessages to MessageEnvelopes is cumbersome, especially as I plan to use more modalities in the future.

I discovered that by applying openAIChatAgent.RegisterMessageConnector();, I should be able to use any message of type IMessage. This initially seemed to solve my problem. However, after applying .RegisterMessageConnector(), I no longer receive usage data.

I attempted to use MistralAITokenCounterMiddleware but still did not receive usage data. Additionally, I cannot use .RegisterMessageConnector() with MistralAITokenCounterMiddleware simultaneously. The Middleware documentation is not very helpful in this regard.

After spending a significant amount of time trying to work with IMessage and .RegisterMessageConnector(), I feel I am missing some critical knowledge. The only way I can get it to work is without .RegisterMessageConnector() and by manually creating MessageEnvelopes. Is this the intended approach? What exactly is a MessageEnvelope, and is there a built-in way to convert messages to MessageEnvelope without manually building MessageEnvelope for each TextMessage or other kinds of messages?

In conclusion, how can I mix images and texts in GPT-4o? Could you provide a sample case of sending a list of IMessages (texts and images) to GPT-4o and retrieving both the message and the usage data? Additionally, can GPT-4o generate images? If so, please provide a sample case of sending a list of IMessages (text and images) to GPT-4o and retrieving both the image and the usage data. If such typical use cases require the use of Middlewares, please include the necessary Middlewares in the library for OpenAI models, as this will save a significant amount of time for many developers.

Thank you for your time and assistance.

Hi Dear @vasemax

Thanks for using AutoGen.Net and I'll try my best to answer your questions here.

Firstly, let's start with the easy one

Can GPT-4o generate images? If so, please provide a sample case of sending a list of IMessages (text and images) to GPT-4o and retrieving both the image and the usage data.

Not by using gpt-4o API, gpt-4o can takes in both text and image input and generate text output. To generate an image with openai you would need to use Dalle-series model.

What exactly is a MessageEnvelope, and is there a built-in way to convert messages to MessageEnvelope without manually building MessageEnvelope for each TextMessage or other kinds of messages?

MessageEnvelope is used to convert types that are not IMessage to IMessage, so you shouldn't need to create MessageEnvelope for each TextMessage because TextMessage is already an IMessage.

To convert messages to MessageEnvelope, you would need to call MessageEnvelope.Create API. However, you shouldn't need to touch the MessageEnvelope after you register the message connector to the OpenAIChatAgent because the message connector would do the converting for you.

The reason why we introduce MessageEnvelope is to convert an original message type from LLM SDKs like Azure.AI.OpenAI to an IMessage type and convert it back afterward.

Using OpenAIChatAgent as an example, the original income and outcome message types it supports are ChatMessageRequest and ChatMessageResponse, which are both from Azure.AI.OpenAI package. OpenAIChatRequestMessageConnector (the middleware that will be applied to OpenAIChatAgent when you call RegisterMessageConnector) will convert built-in message types like TextMessage, ImageMessage into ChatMessageRequest for incoming messages and convert ChatMessageResponse back to built-in message types for outcome messages from OpenAIChatAgent. However, since OpenAIChatAgent is an IAgent, so the ChatMessageRequest and ChatMessageResponse needs to be put into MessageEnvelope before sending to OpenAIChatAgent.GenerateReplyAsync. Inside OpenAIChatAgent, the original message types will be taken from the MessageEnvelope and send it to OpenAI using Azure.AI.OpenAI.OpenAIClient.

When using OpenAIChatClient, why I can get the usage data without using RegisterMessageConnector, but can't after using message connector

The usage data comes from ChatMessageResponse and it's not included in AutoGen built-in message types like TextMessage. So the usage data will be lost after applying RegisterMessageConnector, which converts ChatMessageReponse to TextMessage

Why I can't use MistralAITokenCounterMiddleware with OpenAIChatClient, additionally, I cannot use .RegisterMessageConnector() with MistralAITokenCounterMiddleware simultaneously.

MistralAITokenCounterMiddleware collects usage information from the original message types of MistralAIClient only, and the original message types of MistralAIClient and OpenAIChatClient are different. Therefore MistralAITokenCounterMiddleware can't collect usage information from Azure.AI.OpenAI.ChatMessageResponse, where the GetCompletionTokenCounts would be always 0

In conclusion, how can I mix images and texts in GPT-4o? Could you provide a sample case of sending a list of IMessages (texts and images) to GPT-4o and retrieving both the message and the usage data?

// Modified from Example 15
// Copyright (c) Microsoft Corporation. All rights reserved.
// Example15_GPT4V_BinaryDataImageMessage.cs

using AutoGen.Core;
using AutoGen.OpenAI;
using Azure.AI.OpenAI;

namespace AutoGen.BasicSample;

/// <summary>
/// This example shows usage of ImageMessage. The image is loaded as BinaryData and sent to GPT-4V 
/// <br>
/// <br>
/// Add additional images to the ImageResources to load and send more images to GPT-4V 
/// </summary>
public static class Example15_GPT4V_BinaryDataImageMessage
{
    #region token_counter_middleware
    public class OpenAITokenCounterMiddleware : IMiddleware
    {
        private readonly List<ChatCompletions> responses = new List<ChatCompletions>();
        public string? Name => nameof(OpenAITokenCounterMiddleware);

        public async Task<IMessage> InvokeAsync(MiddlewareContext context, IAgent agent, CancellationToken cancellationToken = default)
        {
            var reply = await agent.GenerateReplyAsync(context.Messages, context.Options, cancellationToken);

            if (reply is IMessage<ChatCompletions> message)
            {
                responses.Add(message.Content);
            }

            return reply;
        }

        public int GetCompletionTokenCount()
        {
            return responses.Sum(r => r.Usage.CompletionTokens);
        }
    }
    #endregion token_counter_middleware

    private static readonly string ImageResourcePath = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "ImageResources");

    private static Dictionary<string, string> _mediaTypeMappings = new()
    {
        { ".png", "image/png" },
        { ".jpeg", "image/jpeg" },
        { ".jpg", "image/jpeg" },
        { ".gif", "image/gif" },
        { ".webp", "image/webp" }
    };

    public static async Task RunAsync()
    {
        var openAIKey = Environment.GetEnvironmentVariable("OPENAI_API_KEY") ?? throw new Exception("Please set OPENAI_API_KEY environment variable.");
        var openaiClient = new OpenAIClient(openAIKey);
        var tokenCounterMiddleware = new OpenAITokenCounterMiddleware();
        var messageConnector = new OpenAIChatRequestMessageConnector();
        var gpt4o = new OpenAIChatAgent(
            openaiClient,
            modelName: "gpt-4o",
            name: "gpt",
            systemMessage: "You are a helpful AI assistant",
            temperature: 0)
            .RegisterMiddleware(tokenCounterMiddleware)
            .RegisterMiddleware(messageConnector)
            .RegisterPrintMessage();

        List<IMessage> messages =
            [new TextMessage(Role.User, "What is this image?", from: "user")];
        AddMessagesFromResource(ImageResourcePath, messages);

        var multiModalMessage = new MultiModalMessage(Role.User, messages, from: "user");
        var response = await gpt4o.SendAsync(multiModalMessage);

        Console.WriteLine($"Completion token count: {tokenCounterMiddleware.GetCompletionTokenCount()}");
    }

    private static void AddMessagesFromResource(string imageResourcePath, List<IMessage> messages)
    {
        foreach (string file in Directory.GetFiles(imageResourcePath))
        {
            if (!_mediaTypeMappings.TryGetValue(Path.GetExtension(file).ToLowerInvariant(), out var mediaType))
            {
                continue;
            }

            using var fs = new FileStream(file, FileMode.Open, FileAccess.Read);
            var ms = new MemoryStream();
            fs.CopyTo(ms);
            ms.Seek(0, SeekOrigin.Begin);
            var imageData = BinaryData.FromStream(ms, mediaType);
            messages.Add(new ImageMessage(Role.Assistant, imageData, from: "user"));
        }
    }
}

output

Thank you very much for your time and your response!

At least now I have a working sample. I found out what is needed to make it work, but it's not entirely clear to me why it wasn't working before. Here is what I noticed:

When I register Middleware after building the Agent, it doesn't work. For example, this works:

var gpt4o = new OpenAIChatAgent( openaiClient, modelName: "gpt-4o", name: "gpt", systemMessage: "You are a helpful AI assistant", temperature: 0) .RegisterMiddleware(tokenCounterMiddleware) .RegisterMiddleware(messageConnector) .RegisterPrintMessage();

But this doesn't work (throws "Invalid message type"):

var gpt4o = new OpenAIChatAgent( openaiClient, modelName: "gpt-4o", name: "gpt", systemMessage: "You are a helpful AI assistant", temperature: 0);

gpt4o.RegisterMiddleware(tokenCounterMiddleware) .RegisterMiddleware(messageConnector) .RegisterPrintMessage();

If I use a GPTAgent, it works but it doesn't retrieve the usage because the response has already been converted, losing the usage info. I assume this is because the GPTAgent probably implements .RegisterMessageConnector() internally before we add our token counter middleware. I don't have experience with the Middleware concept in ASP.NET, and maybe the docs need a bit more explanation on this concept. I'll mention what I understood and please confirm that I understood correctly: Is the innerAgent an agent added as middleware? So the "outer" agent will initiate the innerAgent, and depending on the innerAgent's response, the outer agent will follow. This seems useful when we want an agent to respond by asking another agent that doesn't belong to the conversation. Is that right?

Additionally, I want to emphasize that usage data is essential information for agents, as a performant multi-agent process depends on both the correctness of the result and the total usage. I suggest including such functionality built-in AutoGen.

Thank you again for your assistance!

Why the order of middlewares matter when registering with agent

The order of middleware registration is FILO, so the last registered middleware will be invoked first to process the incoming message, and invoked last to process the outcoming message

why registering middleware after creating agent doesn't work

RegisterMiddleware won't affect the existing agent. Instead what it does is to create a new agent with that middleware and return. If you change the second code snippet to below it should work. I can update the comment on RegisterMiddleware to make it more clear

IAgent gpt4o = new OpenAIChatAgent(
openaiClient,
modelName: "gpt-4o",
name: "gpt",
systemMessage: "You are a helpful AI assistant",
temperature: 0);

gpt4o = gpt4o.RegisterMiddleware(tokenCounterMiddleware)
.RegisterMiddleware(messageConnector)
.RegisterPrintMessage();

If I use a GPTAgent, it works but it doesn't retrieve the usage because the response has already been converted, losing the usage info. I assume this is because the GPTAgent probably implements .RegisterMessageConnector() internally before we add our token counter middleware.

You are 100% right here.

I don't have experience with the Middleware concept in ASP.NET, and maybe the docs need a bit more explanation on this concept.

The doc for ASP.Net middleware can be found here. The agent middleware pattern is similar to this. See picture below (BTW the below picture also explain why the token counter middleware needs to be placed before message connector)

Is the innerAgent an agent added as middleware? So the "outer" agent will initiate the innerAgent, and depending on the innerAgent's response, the outer agent will follow. This seems useful when we want an agent to respond by asking another agent that doesn't belong to the conversation. Is that right?

When you call RegisterMiddleware on an TAgent, a MiddlewareAgent<TAgent> will be created as a wrapper agent and return. When you call multiple RegisterMiddleware on the same agent, the previous registered MiddlewareAgent will be passed to the next middleware as innerAgent. You can check the implementation of MiddlewareAgent<T> for how middleware gets chained together.

This seems useful when we want an agent to respond by asking another agent that doesn't belong to the conversation. Is that right?

Yes you are 100% right. For example you can ask a summarizer to summaize chat history before passing to inner agent.

Additionally, I want to emphasize that usage data is essential information for agents, as a performant multi-agent process depends on both the correctness of the result and the total usage. I suggest including such functionality built-in AutoGen.

Thanks for the feedback. can you elaborate on how you plan to use total usage information, it would be super helpful for us when desiging related API.

Thank you very much for the detailed responses. It is now much clearer to me how all of this works.

The total usage information will be used as a measure of performance for a multi-agent process. We need a multi-agent process to complete a task correctly, but we also need to do it efficiently, meaning with as few tokens as possible. If we have several configurations of multi-agent processes that successfully complete a task, we can measure their performance by a token counter or the total cost.

The token counter will also be useful for all of us who pay the API bills, as multi-agent processes can be quite expensive!

A simple solution would be to include a built-in "AgentWithUsageData" for the popular APIs. I have already built this custom solution myself using your suggested middleware approach. I can share the code if anyone is interested. So far, I have only worked with a single agent, so I'm not sure if this will work as expected in multi-agent processes. I'll get back to you once it has been tested enough.

A better solution would be to get this data directly in the reply of the agent.

Thank you again for your assistance!

microsoft / autogen