microsoft / autogen

A programming framework for agentic AI. Discord: https://aka.ms/autogen-dc. Roadmap: https://aka.ms/autogen-roadmap
https://microsoft.github.io/autogen/
Creative Commons Attribution 4.0 International
28.59k stars 4.17k forks source link

[Issue]: Autogen with vision models like GPT-4o creates HUGE spike in usage and bill #2827

Open daniel-counto opened 1 month ago

daniel-counto commented 1 month ago

Describe the issue

I created three agents to read document images, which are black and white financial documents, and are not very huge in terms of resolution (around 1k x 2k or smaller). The model I used for all of them is GPT-4o.

The flow it creates is mostly linear, i.e. from agent 1 -> agent 2 -> final agent to summarize as output. However, for only 400 images I uploaded, it already costs me like USD200 +, and the context tokens used are about 28+ million tokens!

I wonder if this is because Autogen inserts image bits into the prompt itself? If so, shouldn't the best way is to upload the images to some place and then just insert the image path link to the prompts?

Steps to reproduce

Step1 - The agents are constructed as follows:


image_agent = MultimodalConversableAgent(
    name="image-content-extracter",
    max_consecutive_auto_reply=10,
    llm_config={"config_list": config_list_gpt4, "temperature": 0.05, "max_tokens": 1024, "cache_seed": None},
    human_input_mode="NEVER"

)

agent_1 = MultimodalConversableAgent(
    name="agent_1 ",
    system_message='''You are a helpful agent.
                    Look at the image and compare the extraction results against those extracted by image-content-extracter, and then correct them if any mistakes found.
                                       ''',
    max_consecutive_auto_reply=4,
    llm_config={"config_list": config_list_gpt4, "temperature": 0, "max_tokens": 1024, "cache_seed": None},
    human_input_mode="NEVER"
)

agent_2 = MultimodalConversableAgent(
            name="agent_2 ",
            system_message='''You are agent_1's assistant.  you put the finalized results in a JSON format.
                                              ''',
            max_consecutive_auto_reply=2,
            llm_config={"config_list": config_list_gpt4, "temperature": 0, "max_tokens": 800, "response_format":{ "type": "json_object" }, "cache_seed": None},
            human_input_mode="NEVER"
        )

coder = autogen.AssistantAgent(
    name="coding_assistant",
    system_message="Helpful coding assistant.",
    llm_config={"config_list": config_list_gpt4, "temperature": 0.1, "max_tokens": 2048},
 )

groupchat = autogen.GroupChat(agents=[user_proxy, image_agent_llava, bookkeeper, bookkeeper_assistant], messages=[], max_round=5)
group_chat_manager = autogen.GroupChatManager(groupchat=groupchat, llm_config=gpt4_llm_config)

Step -2

user_prompt = "<this is a detailed prompt of about 1700 tokens>"

session = user_proxy.initiate_chat(
    group_chat_manager,
    message= user_prompt
)

Step-3

execute the above multiagent model, with about 500 images. Each is a standard invoice image.

Screenshots and logs

Screenshot 2024-05-28 211431 single day usage of tokens. But only about 400 images uploaded.

Additional Information

the right way to send image for OpenAI api is not sending string but this method:

 {
        "role": "user",
        "content": [
            {"type": "text", "text": 'How many bananas?'},
            {
                "type": "image_url",
                "image_url": {"url": f"data:image/x-png;base64,{base64_image}", "detail": "low"},
            },
        ],
    }

please make the changes.

tytung2020 commented 1 month ago

yes wondering how images are being inserted in Autogen

qingyun-wu commented 1 month ago

@BeibinLi for awareness!

BeibinLi commented 1 month ago

The charge for reading images does not depend on how the images are inserted, but on how large the images are. Moreover, AutoGen uses the same format as the OpenAI vanilla API, see this line of code.

Even if there are only 400 images in the document, because of the multi-agent design, the chat history may contain more than one image. This is particularly true for group chats, because each agent has the history of all other agents. For instance, if there are 10 agent interactions in the group chat, the number of images we have in the prompt is: 1 + 2 + 3 + ... + 10 = 55 times.

For more details, please read here.