This repository provides the official PyTorch implementation of the following paper:
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate
Qidong Huang1,2, Xiaoyi Dong2,3, Pan Zhang2, Yuhang Zang 2, Yuhang Cao 2, Jiaqi Wang2, Dahua Lin2, Weiming Zhang1, Nenghai Yu1
1University of Science and Technology of China, 2Shanghai AI Laboratory, 3The Chinese University of Hong Kong
[2024.10.10] 🚀 We release the paper at ArXiv and HuggingFace!
[2024.10.10] 🚀 This project page has been built!
If you just want to use MIR as the pre-training indicator of your own model, no additional environment is required.
torch
, numpy
, and scipy
are installed.mir.py
with your own model's code, we display LLaVA's code as the reference.python mir.py --model_path PATH/TO/MODEL --base_llm PATH/TO/LLM --text_data_path PATH/TO/TEXT/DATA --image_data_path PATH/TO/VISION/DATA --eval_num 100 --mode fast
Note that base_llm
is not required if you haven't train the base LLM during pre-training.
You can also adjust the args to the intialization style of your model.
If you just want to use MoCa on your own model, we recommand you to following the steps below:
modality_mask
, please refer to Line183-184, Line269-276 and Line373-382 in llava/model/llava_arch.py
. Also, make sure that the modality_mask
can be successsfully delivered into the model forward pass, e.g., adding it as the formal parameter of each forward function, like Line70, Line88, Line96, Line106, Line127, Line137, Line145, Line157, Line166, Line174-175 in llava/model/language_model/llava_llama.py
. use_moca=True
, such as (it is recommanded to search use_moca
in this repo to find which places should be revised):
1)Add it into the model config (here).
2) Add it into training arguments (here).
3) Unlock it during training (here).
4) Ensure the correct checkpoint saving (here1, here2, here3).--use_moca
when running the training command to enable the usage of MoCa.If you want to use our codebase (modified on LLaVA) for reproduction, you are recommanded to build a new environment though the steps below. The following steps are just listed for Linux. If you are using macOS or Windows, please refer to LLaVA.
git clone https://github.com/shikiw/Modality-Integration-Rate.git
cd Modality-Integration-Rate
conda create -n llava python=3.10 -y
conda activate llava
python -m pip install --upgrade pip # enable PEP 660 support
python -m pip install -e .
python -m pip install -e transformers-4.37.2
pythom -m pip install -e ".[train]"
pythom -m pip install flash-attn --no-build-isolation
To reproduce the MIR implementation on this codebase, you can follow these steps:
text_data_path
and image_data_path
for MIR calculation. You can also specify them like Line55-64 in mir.py
, using TextVQA val images and CNN/DM text by default, i.e.,
1) Download TextVQA_0.5.1_val.json and images and extract to PATH/TO/VISION/DATA
.
2) Download CNN stories and extract to PATH/TO/TEXT/DATA
.
3) Modify Line55-64 with the text data path and image data path.python mir.py --model_path PATH/TO/MODEL --base_llm PATH/TO/LLM --eval_num 100 --mode fast
python mir.py --model_path PATH/TO/MODEL --eval_num 100 --mode fast
Our codebase supports --use_moca
to activate the implementation of MoCa. Check out scripts/v1_5/pre_sft_moca.sh
for more details.
Model | Size | Schedule | Average | MMStar | MME | MMB | MMB-CN | SEED-IMG | TextVQA | MM-Vet | POPE | GQA |
---|---|---|---|---|---|---|---|---|---|---|---|---|
LLaVA-v1.5 | 7B | full_ft-1e | 59.1 | 30.3 | 1510.7 | 64.3 | 58.3 | 66.1 | 58.2 | 31.1 | 85.9 | 62.0 |
+MoCa | 7B | full_ft-1e | 60.6 | 36.5 | 1481.0 | 66.8 | 60.0 | 67.0 | 58.7 | 32.2 | 86.9 | 62.8 |
The pretrained and finetuned checkpoints are released.
This codebase is based on LLaVA and ShareGPT4V, where we introduce some new features and now it supports the following inputs in the launch script:
1) --tune_vision_tower
and --tune_vit_from_layer
2) --tune_language_model
and --tune_llm_utill_layer
3) --tune_entire_model
4) --data_scale
5) --use_moca
and --moca_std
Some cases for reference:
To pre-train the model with the customized data scale (e.g., 200K):
sh scripts/v1_5/pre_data_scale.sh
To pre-train the model (unlock the 13-24 layer of ViT and the 1-16 layer of base LLM), and SFT (unlock entire LLM by default):
sh scripts/v1_5/pre_unlock_vit-12_llm-16_sft.sh
To pre-train the model (unlock the 13-24 layer of ViT and the entire base LLM), and SFT (unlock entire LLM by default):
sh scripts/v1_5/pre_unlock_vit-12_llm-all_sft.sh
To apply MoCa in training:
sh scripts/v1_5/pre_sft_moca.sh
We follow the original evaluation in LLaVA for most of benchmarks. For MMStar, we use VLMEvalKit.
See Evaluation.md.
This repo is based on the codebase of LLaVA and ShareGPT4V. Thanks for their impressive works!
If you find this work useful for your research, please cite our paper:
@article{huang2024deciphering,
title={Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate},
author={Huang, Qidong and Dong, Xiaoyi and Zhang, Pan and Zang, Yuhang and Cao, Yuhang and Wang, Jiaqi and Lin, Dahua and Zhang, Weiming and Yu, Nenghai},
journal={arXiv preprint arXiv:2410.07167},
year={2024}
}