om-ai-lab / OmAgent

A multimodal agent framework for solving complex tasks
Apache License 2.0
438 stars 27 forks source link
agent large-language-models multimodal

OmAgent

English | 中文

🗓️ Updates

📖 Introduction

OmAgent is a sophisticated multimodal intelligent agent system, dedicated to harnessing the power of multimodal large language models and other multimodal algorithms to accomplish intriguing tasks. The OmAgent project encompasses a lightweight intelligent agent framework, omagent_core, meticulously designed to address multimodal challenges. With this framework, we have constructed an intricate long-form video comprehension system—OmAgent. Naturally, you have the liberty to employ it to realize any of your innovative ideas.
OmAgent comprises three core components:

For more details, check out our paper OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer

🛠️ How To Install

General Task Processing

  1. Create a configuration file and set some necessary variables

    cd workflows/general && vim config.yaml
    Configuration Name Usage
    custom_openai_endpoint API address for calling OpenAI GPT or other MLLM, format: {custom_openai_endpoint}/chat/completions
    custom_openai_key api_key provided by the MLLM provider
    bing_api_key Bing's api key, used for websearch
  2. Set up run.py

    def run_agent(task):
        logging.init_logger("omagent", "omagent", level="INFO")
        registry.import_module(project_root=Path(__file__).parent, custom=["./engine"])
        bot_builder = Builder.from_file("workflows/general") # General task processing workflow configuration directory
        input = DnCInterface(bot_id="1", task=AgentTask(id=0, task=task))
    
        bot_builder.run_bot(input)
        return input.last_output
    
    if __name__ == "__main__":
        run_agent("Your Query") # Enter your query
  3. Start OmAgent by running python run.py.

Video Understanding Task

Environment Preparation

Running Preparation

  1. Create a configuration file and set some necessary environment variables

    cd workflows/video_understanding && vim config.yaml
  2. Configure the API addresses and API keys for MLLM and tools.

    Configuration Name Usage
    custom_openai_endpoint API address for calling OpenAI GPT or other MLLM, format: {custom_openai_endpoint}/chat/completions
    custom_openai_key api_key provided by the respective API provider
    bing_api_key Bing's api key, used for web search
    ovd_endpoint ovd tool API address. If using OmDet, the address should be http://host:8000/inf_predict
    ovd_model_id Model ID used by the ovd tool. If using OmDet, the model ID should be OmDet-Turbo_tiny_SWIN_T
  3. Set up run.py

    def run_agent(task):
        logging.init_logger("omagent", "omagent", level="INFO")
        registry.import_module(project_root=Path(__file__).parent, custom=["./engine"])
        bot_builder = Builder.from_file("workflows/video_understanding") # Video understanding task workflow configuration directory
        input = DnCInterface(bot_id="1", task=AgentTask(id=0, task=task))
    
        bot_builder.run_bot(input)
        return input.last_output
    
    if __name__ == "__main__":
        run_agent("") # You will be prompted to enter the query in the console
  4. Start OmAgent by running python run.py. Enter the path of the video you want to process, wait a moment, then enter your query, and OmAgent will answer based on the query.

🔗 Related works

If you are intrigued by multimodal algorithms, large language models, and agent technologies, we invite you to delve deeper into our research endeavors:
🔆 How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection(AAAI24)
🏠 Github Repository

🔆 OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network(IET Computer Vision)
🏠 Github Repository

⭐️ Citation

If you find our repository beneficial, please cite our paper:

@article{zhang2024omagent,
  title={OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer},
  author={Zhang, Lu and Zhao, Tiancheng and Ying, Heting and Ma, Yibo and Lee, Kyusong},
  journal={arXiv preprint arXiv:2406.16620},
  year={2024}
}