rikeilong / Bay-CAT

[ECCV’24] Official Implementation for CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios
Apache License 2.0
36 stars 1 forks source link

Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

Qilang Ye1, Zitong Yu*1, Rui Shao2, Xinyu Xie1, Philip Torr3, Xiaochun Cao4
1 Great Bay University
2 Harbin Institute of Technology, Shenzhen
3 University of Oxford
4 Shenzhen Campus of Sun Yat-sen University

*Corresponding author

[![arXiv](https://img.shields.io/badge/arXiv-2403.04640-b31b1b.svg)](https://arxiv.org/abs/2403.04640) [![License](https://img.shields.io/badge/Code%20License-Apache2.0-yellow)](https://github.com/rikeilong/Bay-CAT/blob/main/LICENSE)

News :loudspeaker:

Introduction :bulb:

We introduce the CAT, enhancing MLLM in three ways:
1) We design a clue aggregator that aggregates question-related clues in dynamic audio-visual scenarios to enrich the detailed knowledge required for large language models.
2) CAT is trained on a mixed multimodal dataset, allowing direct application in audio-visual scenarios. Notably, we collect an audio-visual joint instruction dataset named AVinstruct, to further enhance the capacity of CAT to model cross-semantic correlations.
3) We propose AI-assisted ambiguity-aware direct preference optimization, a strategy specialized in retraining the model to favor the non-ambiguity response and improve the ability to localize specific audio-visual objects.

Demo 🤗

Training & Validation

We have collect an audio-visual joint instruction dataset, named AVinstruct, details in Data.md.

The Fine-tuning process is in here SFT.md.

The ADPO process is in here ADPO.md.

Citation ✏️

If you find this work useful for your research, please kindly cite our paper and star our repo.

@misc{ye2024cat,
      title={CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios}, 
      author={Qilang Ye and Zitong Yu and Rui Shao and Xinyu Xie and Philip Torr and Xiaochun Cao},
      year={2024},
      eprint={2403.04640},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}