Open antoan opened 8 months ago
@victordibia fyi
Hi @antoan,
Thanks for the note. Currently, there is only support for core autogen agent classes - UserProxy, Assistant (GroupChat support currently in development and on the roadmap). We plan to start supporting more agent types from contrib in the future but this is not currently on the roadmap.
If you would consider describing your envisioned use case in a bit more detail, that would be helpful once we get there.
In the meantime, @BeibinLi is thinking about implementing multimodal in the core. Knowing the use case here would also help that.
I see, thank for letting me know.
My use case involves the periodic visual monitoring of an industrial hanger, for anomalies - e.g people present in the hanger where none should be preset, via a camera stream.
I initially intended to use a multimodal agent in conjunction with autogen studio to render anomalous detection frames to the user, and a gui is the only component I lack to complete the experience.
Please let me know if this is sufficient.
Hi @antoan,
Thanks for the note. Currently, there is only support for core autogen agent classes - UserProxy, Assistant (GroupChat support currently in development and on the roadmap). We plan to start supporting more agent types from contrib in the future but this is not currently on the roadmap.
If you would consider describing your envisioned use case in a bit more detail, that would be helpful once we get there.
There was already a P3 for supporting contrib agents; appended multi modal to that list
It is working as it is now. I'm using the autogen Studio without any change and you just have to add a skill to the build skills tab, and then also add the new created skill to your workflow, for instance, open the general assistant, and add this skill to the primary_assistant. Then you can use it to describe images or any other text-image based task. The only thing that you have to take in account is the folder where the system tries to find the OAI_CONFIG_LIST and the image.
The skill file I'm using is this one:
import autogen
def describe_image_with_gp4o(task_description: str, image_name: str) -> str:
"""
Describe the content of an image based on a given task description.
Args:
task_description (str): A description of what you want the agent to do.
image_name (str): The name of the image file to be described.
Returns:
str: The description of the image content.
"""
# Define the LLM configuration directly
gpt4_llm_config = {
"model": "gpt-4o",
"temperature": 0.5,
"max_tokens": 300
}
# Create the multimodal conversable agent
from autogen.agentchat.contrib.multimodal_conversable_agent import MultimodalConversableAgent
image_agent = MultimodalConversableAgent(
name="image-explainer",
max_consecutive_auto_reply=10,
llm_config=gpt4_llm_config
)
# Create the user proxy agent
user_proxy = autogen.UserProxyAgent(
name="User_proxy",
system_message="A human admin.",
human_input_mode="NEVER",
max_consecutive_auto_reply=0
)
# Initiate the chat with the image agent
user_proxy.initiate_chat(image_agent, message=f"""What's on the image? <img {image_name}>. {task_description}""")
# Assuming the response is stored in a variable called response
response = user_proxy.chat_messages[image_agent][-1]['content']
return response
# Example usage of the function:
# try:
# description = describe_image_with_gp4o("Please describe the main objects and their colors.", "imagen_2.jpg")
# print(f"Image description: {description}")
# except Exception as e:
# print(f"An error occurred: {e}")
Is it currently possible or are there plans to support this in the future?