OmAgent is a sophisticated multimodal intelligent agent system, dedicated to harnessing the power of multimodal large language models and other multimodal algorithms to accomplish intriguing tasks. The OmAgent project encompasses a lightweight intelligent agent framework, omagent_core, meticulously designed to address multimodal challenges. With this framework, we have constructed an intricate long-form video comprehension system—OmAgent. Naturally, you have the liberty to employ it to realize any of your innovative ideas.
OmAgent comprises three core components:
For more details, check out our paper OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer
cd omagent-core
pip install -e .
cd ..
pip install -r requirements.txt
Create a configuration file and set some necessary variables
cd workflows/general && vim config.yaml
Configuration Name | Usage |
---|---|
custom_openai_endpoint | API address for calling OpenAI GPT or other MLLM, format: {custom_openai_endpoint}/chat/completions |
custom_openai_key | api_key provided by the MLLM provider |
bing_api_key | Bing's api key, used for websearch |
Set up run.py
def run_agent(task):
logging.init_logger("omagent", "omagent", level="INFO")
registry.import_module(project_root=Path(__file__).parent, custom=["./engine"])
bot_builder = Builder.from_file("workflows/general") # General task processing workflow configuration directory
input = DnCInterface(bot_id="1", task=AgentTask(id=0, task=task))
bot_builder.run_bot(input)
return input.last_output
if __name__ == "__main__":
run_agent("Your Query") # Enter your query
Start OmAgent by running python run.py
.
Optional
OmAgent uses Milvus Lite as a vector database to store vector data by default. If you wish to use the full Milvus service, you can deploy it milvus vector database using docker. The vector database is used to store video feature vectors and retrieve relevant vectors based on queries to reduce MLLM computation. Not installed docker? Refer to docker installation guide.
# Download milvus startup script
curl -sfL https://raw.githubusercontent.com/milvus-io/milvus/master/scripts/standalone_embed.sh -o standalone_embed.sh
# Start milvus in standalone mode
bash standalone_embed.sh start
Fill in the relevant configuration information after the deployment workflows/video_understanding/config.yml
Optional
Configure the face recognition algorithm. The face recognition algorithm can be called as a tool by the agent, but it is optional. You can disable this feature by modifying the workflows/video_understanding/tools/video_tools.json
configuration file and removing the FaceRecognition section. The default face recognition database is stored in the data/face_db
directory, with different folders corresponding to different individuals.
Optional
Open Vocabulary Detection (ovd) service, used to enhance OmAgent's ability to recognize various objects. The ovd tools depend on this service, but it is optional. You can disable ovd tools by following these steps. Remove the following from workflows/video_understanding/tools/video_tools.json
{
"name": "ObjectDetection",
"ovd_endpoint": "$<ovd_endpoint::http://host_ip:8000/inf_predict>",
"model_id": "$<ovd_model_id::OmDet-Turbo_tiny_SWIN_T>"
}
If using ovd tools, we use OmDet for demonstration.
pip install pydantic fastapi uvicorn
wsgi.py
file to expose OmDet Inference as an API
cd OmDet && vim wsgi.py
Copy the OmDet Inference API code to wsgi.py
python wsgi.py
Download some interesting videos
Create a configuration file and set some necessary environment variables
cd workflows/video_understanding && vim config.yaml
Configure the API addresses and API keys for MLLM and tools.
Configuration Name | Usage |
---|---|
custom_openai_endpoint | API address for calling OpenAI GPT or other MLLM, format: {custom_openai_endpoint}/chat/completions |
custom_openai_key | api_key provided by the respective API provider |
bing_api_key | Bing's api key, used for web search |
ovd_endpoint | ovd tool API address. If using OmDet, the address should be http://host:8000/inf_predict |
ovd_model_id | Model ID used by the ovd tool. If using OmDet, the model ID should be OmDet-Turbo_tiny_SWIN_T |
Set up run.py
def run_agent(task):
logging.init_logger("omagent", "omagent", level="INFO")
registry.import_module(project_root=Path(__file__).parent, custom=["./engine"])
bot_builder = Builder.from_file("workflows/video_understanding") # Video understanding task workflow configuration directory
input = DnCInterface(bot_id="1", task=AgentTask(id=0, task=task))
bot_builder.run_bot(input)
return input.last_output
if __name__ == "__main__":
run_agent("") # You will be prompted to enter the query in the console
Start OmAgent by running python run.py
. Enter the path of the video you want to process, wait a moment, then enter your query, and OmAgent will answer based on the query.
If you are intrigued by multimodal algorithms, large language models, and agent technologies, we invite you to delve deeper into our research endeavors:
🔆 How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection(AAAI24)
🏠 Github Repository
🔆 OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network(IET Computer Vision)
🏠 Github Repository
If you find our repository beneficial, please cite our paper:
@article{zhang2024omagent,
title={OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer},
author={Zhang, Lu and Zhao, Tiancheng and Ying, Heting and Ma, Yibo and Lee, Kyusong},
journal={arXiv preprint arXiv:2406.16620},
year={2024}
}