Open whiskyboy opened 7 months ago
Re 1. We have some on going work in https://github.com/microsoft/autogen/pull/1929 and https://github.com/microsoft/autogen/pull/2414 for adding functions as a function store. This is similar to your idea of adding built in hugging face tools. It's best to discuss with @gagb @afourney and @LeoLjl about this direction and see if you can combine effort.
Re 2. Sounds interesting! I think we can start with a notebook example to show how this work. And then we can decide on whether to just do a notebook PR or a contrib agent.
Re 3. cc @WaelKarkoub @BeibinLi we do have text to image capability already.
@whiskyboy we implemented VisionCapability
that adds the vision modality to any LLM (image-to-text): https://microsoft.github.io/autogen/docs/notebooks/agentchat_lmm_gpt-4v/#behavior-with-and-without-visioncapability-for-agents.
We also implemented an ImageGeneration
capability that allows for any LLM to generate images (text-to-image): https://microsoft.github.io/autogen/docs/notebooks/agentchat_image_generation_capability
Other multimodality features are currently being worked on that you can track the progress of in this roadmap https://github.com/microsoft/autogen/issues/1975. Let me know if you have other ideas that we could add to the roadmap.
@ekzhu It's good to know there will be a function store in AutoGen soon! I will also try to provide a PoC of the #2 approach in these two days.
@WaelKarkoub Thank you for sharing this awesome roadmap! I'm also thinking of adding some similar multimodal capabilities like TTS or document QA but with non-openai models (more specifically, with open-source models in huggingface hub). Although the current implementation of some capabilities accepts a customer process function, a built-in support of huggingface models is also attractive (to me at least). Additionaly, we can achieve more capabilities like image-to-image and audio separition by leveraging hf-hub.
@whiskyboy just for awareness, I have a PR that handles text-to-speech and speech-to-text https://github.com/microsoft/autogen/pull/2098. I'm experimenting with architecture but it mainly works.
@WaelKarkoub @ekzhu Drafted a PR here: #2599
Is your feature request related to a problem? Please describe.
The HuggingFace Hub provides an elegant python client to allow users to control over 100,000+ huggingface models and run inference on these models to achieve a variety of multimodal tasks, like image-to-text, text-to-speech, etc. By connecting to this hub, a text-based LLM like gpt-3.5-tubor could also have the multimodal capability to handle images, video, audio, and documents, in a cost efficient way.
However, it still needs some additional coding work to allow an autogen agent to interact with a huggingface-hub client, such as wrapping the client method into a function, parsing different input/output types, and model deployment management. That's why I'm seeking if autogen could have an out-of-box solution for the connecting.
Other similar works: JARVIS, Transformers Agent
Describe the solution you'd like
huggingface_agent
, like Transformers Agent. This agent would essentially consist of a pairing between an assistant and a user-proxy agent, both are registered with the huggingface-hub toolkit. Users could seamlessly access this agent to leverage its multimodal capabilities, without the need for manual registration of toolkits for execution.process_last_received_message
method. However, it may not be straightforward for some tasks such as text-to-image.Additional context
I'd like to hear your suggestions and make contributions in different ways.