microsoft / autogen

A programming framework for agentic AI 🤖
Creative Commons Attribution 4.0 International
30.9k stars 4.51k forks source link

MultimodalConversableAgent in autogenstudio? #1169

Open antoan opened 8 months ago

antoan commented 8 months ago

Is it currently possible or are there plans to support this in the future?

rickyloynd-microsoft commented 8 months ago

@victordibia fyi

victordibia commented 8 months ago

Hi @antoan,

Thanks for the note. Currently, there is only support for core autogen agent classes - UserProxy, Assistant (GroupChat support currently in development and on the roadmap). We plan to start supporting more agent types from contrib in the future but this is not currently on the roadmap.

If you would consider describing your envisioned use case in a bit more detail, that would be helpful once we get there.

sonichi commented 8 months ago

In the meantime, @BeibinLi is thinking about implementing multimodal in the core. Knowing the use case here would also help that.

antoan commented 8 months ago

I see, thank for letting me know.

My use case involves the periodic visual monitoring of an industrial hanger, for anomalies - e.g people present in the hanger where none should be preset, via a camera stream.

I initially intended to use a multimodal agent in conjunction with autogen studio to render anomalous detection frames to the user, and a gui is the only component I lack to complete the experience.

Please let me know if this is sufficient.

gagb commented 8 months ago

Hi @antoan,

Thanks for the note. Currently, there is only support for core autogen agent classes - UserProxy, Assistant (GroupChat support currently in development and on the roadmap). We plan to start supporting more agent types from contrib in the future but this is not currently on the roadmap.

If you would consider describing your envisioned use case in a bit more detail, that would be helpful once we get there.

There was already a P3 for supporting contrib agents; appended multi modal to that list

Alblahm commented 3 months ago

It is working as it is now. I'm using the autogen Studio without any change and you just have to add a skill to the build skills tab, and then also add the new created skill to your workflow, for instance, open the general assistant, and add this skill to the primary_assistant. Then you can use it to describe images or any other text-image based task. The only thing that you have to take in account is the folder where the system tries to find the OAI_CONFIG_LIST and the image.

The skill file I'm using is this one:

import autogen  

def describe_image_with_gp4o(task_description: str, image_name: str) -> str:  
    Describe the content of an image based on a given task description.  

        task_description (str): A description of what you want the agent to do.  
        image_name (str): The name of the image file to be described.  

        str: The description of the image content.  

    # Define the LLM configuration directly
    gpt4_llm_config = {
        "model": "gpt-4o",
        "temperature": 0.5,
        "max_tokens": 300

    # Create the multimodal conversable agent
    from autogen.agentchat.contrib.multimodal_conversable_agent import MultimodalConversableAgent

    image_agent = MultimodalConversableAgent(

    # Create the user proxy agent  
    user_proxy = autogen.UserProxyAgent(  
        system_message="A human admin.",  

    # Initiate the chat with the image agent  
    user_proxy.initiate_chat(image_agent, message=f"""What's on the image? <img {image_name}>. {task_description}""")  

    # Assuming the response is stored in a variable called response
    response = user_proxy.chat_messages[image_agent][-1]['content']   

    return response  

# Example usage of the function:  
# try:  
#     description = describe_image_with_gp4o("Please describe the main objects and their colors.", "imagen_2.jpg")  
#     print(f"Image description: {description}")  
# except Exception as e:  
#     print(f"An error occurred: {e}")