Json files of score results for numerous vision language benchmarks in MoAI are also accessible in Google Drive.
π¨ In-Progress
[x] Code is public (Only Inference Supported).
[x] Downloading MoAI-7B is available in Huggingface.
[x] Huggingface README.md for simple running
[x] Short running code for an image example is available.
[ ] Uploading GPT-Aided Evaluation
Official PyTorch implementation code for realizing the technical part of Mixture of All Intelligence (MoAI) to improve performance of numerous zero-shot vision language tasks. This code is developed on two baseline codes of XDecoder: Generalized Decoding for Pixel, Image, and Language accepted in CVPR 2023 and InternLM for Technical Paper. Please understand the combined code in the current version combining two technical code implementation!
π Citation
@misc{lee2024moai,
title={MoAI: Mixture of All Intelligence for Large Language and Vision Models},
author={Byung-Kwan Lee and Beomchan Park and Chae Won Kim and Yong Man Ro},
year={2024},
eprint={2403.07508},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
ποΈ Summary
The rise of large language models (LLMs) and instruction tuning has led to the current trend of instruction-tuned large language and vision models (LLVMs). This trend involves either meticulously curating numerous instruction tuning datasets tailored to specific objectives or enlarging LLVMs to manage vast amounts of vision language (VL) data. However, current LLVMs have disregarded the detailed and comprehensive real-world scene understanding available from specialized computer vision (CV) models in visual perception tasks such as segmentation, detection, scene graph generation (SGG), and optical character recognition (OCR). Instead, the existing LLVMs rely mainly on the large capacity and emergent capabilities of their LLM backbones. Therefore, we present a new LLVM, Mixture of All Intelligence (MoAI), which leverages auxiliary visual information obtained from the outputs of external segmentation, detection, SGG, and OCR models. MoAI operates through two newly introduced modules: MoAI-Compressor and MoAI-Mixer. After verbalizing the outputs of the external CV models, the MoAI-Compressor aligns and condenses them to efficiently use relevant auxiliary visual information for VL tasks. MoAI-Mixer then blends three types of intelligenceβ(1) visual features, (2) auxiliary features from the external CV models, and (3) language featuresβutilizing the concept of Mixture of Experts. Through this integration, MoAI significantly outperforms both open-source and closed-source LLVMs in numerous zero-shot VL tasks, particularly those related to real-world scene understanding such as object existence, positions, relations, and OCR without enlarging the model size or curating extra visual instruction tuning datasets.
Q1. Have you tried just feeding in the auxiliary feature into the LLM without compression? Is the compression of all auxiliary features down to just 64 tokens mainly for efficiency?
A1. According to the paper "Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study", we gained the insight that just feeding auxiliary information without compression leads to performance degradation. We think that it is because wrong information get from external computer vision models seems to damage outputs of MoAI directly. As you know, not only external CV models but also other models are not perfect models to predict completely due to many reasons. Therefore, the referred paper assigns learnable parameters to the auxiliary information. The main purpose of compression is for efficiency, of course, but moreover, we expect that MoAI-Compressor corrects wrong information or eliminates non-relevant information for vision language tasks.
Q2. Have you tried just concatenating all of the features together (marked by begin/end boundary tokens for e.g.) instead of using cross attention in each of the "experts"?
A2. We can answer the question by the two types. In case of not using the compressed 64 tokens, auxiliary tokens are, on average, 1500 tokens. Therefore, training MoAI with image tokens + auxiliary tokens + language tokens brings in heavy burden, and wrong information can damage the output of MoAI. On the other hand, in case of using the compressed ones, just concatenating all features you mentioned can be an appropriate infusion strategy only when learning transformer decoder block with supervised fine tuning, LoRA, and QLoRA. However, we want to connect the original transformer decoder block with MoAI-Mixer and train only MoAI-Mixer, instead of directly tuning the transformer decoder. This is because we believe this plug-in-play infusion strategy can boost the utilization of LLMs without editing the original ones. Furthermore, the purpose of employing MoE is based on its effectiveness on the paper "MoE-LLaVA: Mixture of Experts for Large Vision-Language Models" that provided a key in how to effectively harmonize auxiliary features with visual and language features, where the self-attended and cross-attended features are expected to independently capture various aspects compared with the jointly concatenated features.
π Directory Layout
.
βββ asset # Required package lists (Important)
βββ trainer # Training MoAI and initializing optimizer (Not Support Now)
βββ utils # Michallengeous util files (Not important)
βββ moai # MoAI architecture & loading moai (Important)
βββ pipeline # Evaluating zero-shot vision language tasks (Important)
β
βββ datasets # Important
β βββ dataset_mappers # data parsing including augmentation for loader
β βββ evaluation # measuring evaluation for each dataset
β βββ registration # register dataset
β
βββ configs
β βββ accel # Accelerate Config files (Support Deepspeed, DDP, Multinode)
β βββ moai_eval.yaml # Evaluating MoAI
β
βββ modeling # Not Important
β βββ architectures # training the prototype of moai (Not Support Now)
β βββ utils # utils for modeling (Not important)
β βββ BaseModel # loading and saving model (Important)
β
βββ lbk_entry.py # main code of control tower (Important)
βββ run # bash file for running the evaluation (Important)
β
βββ install # install required packages (Important)
βββ README.md
π‘ How to Run?
In bash file of install, you should first run the following lines.
You should make directory 'checkpoints' in moai/sgg and upload checkpoint of Scene Graph Generation after downloading it, where its checkpoint filename should be 'psgtr_r50_epoch_60.pth'
Download checkpoints with labeled name 'PSGTR' in Panoptic SGG. Or, download checkpoints in my google drive Google Drive.
At init_detector function in mmdet/apis/inference.py, line 95-110 should be commented to get compatibility.
# if palette != 'none':
# model.dataset_meta['palette'] = palette
# else:
# test_dataset_cfg = copy.deepcopy(config.test_dataloader.dataset)
# # lazy init. We only need the metainfo.
# test_dataset_cfg['lazy_init'] = True
# metainfo = DATASETS.build(test_dataset_cfg).metainfo
# cfg_palette = metainfo.get('palette', None)
# if cfg_palette is not None:
# model.dataset_meta['palette'] = cfg_palette
# else:
# if 'palette' not in model.dataset_meta:
# warnings.warn(
# 'palette does not exist, random is used by default. '
# 'You can also set the palette to customize.')
# model.dataset_meta['palette'] = 'random'
At inference_detector function in mmdet/apis/inference.py, line 179- should be changed by the following lines.
# build the data pipeline
data_ = test_pipeline(data_)
data_['inputs'] = data_['inputs'].unsqueeze(0)
data_['data_samples'] = [data_['data_samples']]
# forward the model
with torch.no_grad():
results = model.test_step(data_)[0]
In mmcv/transforms/processing.py, line 388 should be commented to get compatibility.
MoAI/README.md at master Β· ByungKwanLee/MoAI
Description
π° News
π¨ In-Progress
Official PyTorch implementation code for realizing the technical part of Mixture of All Intelligence (MoAI) to improve performance of numerous zero-shot vision language tasks. This code is developed on two baseline codes of XDecoder: Generalized Decoding for Pixel, Image, and Language accepted in CVPR 2023 and InternLM for Technical Paper. Please understand the combined code in the current version combining two technical code implementation!
π Citation
ποΈ Summary
The rise of large language models (LLMs) and instruction tuning has led to the current trend of instruction-tuned large language and vision models (LLVMs). This trend involves either meticulously curating numerous instruction tuning datasets tailored to specific objectives or enlarging LLVMs to manage vast amounts of vision language (VL) data. However, current LLVMs have disregarded the detailed and comprehensive real-world scene understanding available from specialized computer vision (CV) models in visual perception tasks such as segmentation, detection, scene graph generation (SGG), and optical character recognition (OCR). Instead, the existing LLVMs rely mainly on the large capacity and emergent capabilities of their LLM backbones. Therefore, we present a new LLVM, Mixture of All Intelligence (MoAI), which leverages auxiliary visual information obtained from the outputs of external segmentation, detection, SGG, and OCR models. MoAI operates through two newly introduced modules: MoAI-Compressor and MoAI-Mixer. After verbalizing the outputs of the external CV models, the MoAI-Compressor aligns and condenses them to efficiently use relevant auxiliary visual information for VL tasks. MoAI-Mixer then blends three types of intelligenceβ(1) visual features, (2) auxiliary features from the external CV models, and (3) language featuresβutilizing the concept of Mixture of Experts. Through this integration, MoAI significantly outperforms both open-source and closed-source LLVMs in numerous zero-shot VL tasks, particularly those related to real-world scene understanding such as object existence, positions, relations, and OCR without enlarging the model size or curating extra visual instruction tuning datasets.
π Highlights
Download MoAI-7B
Interesting Questions for Architecture Choices
π Directory Layout
π‘ How to Run?
In bash file of
install
, you should first run the following lines.In addition, you should set the following environment variables to set the dataset path.
You should make directory 'checkpoints' in moai/sgg and upload checkpoint of Scene Graph Generation after downloading it, where its checkpoint filename should be 'psgtr_r50_epoch_60.pth'
Download checkpoints with labeled name 'PSGTR' in Panoptic SGG. Or, download checkpoints in my google drive Google Drive.
At init_detector function in mmdet/apis/inference.py, line 95-110 should be commented to get compatibility.
At inference_detector function in mmdet/apis/inference.py, line 179- should be changed by the following lines.
In mmcv/transforms/processing.py, line 388 should be commented to get compatibility.
Download MoAI Model and then run the demo script,
If you want to validate zero-shot performances in numerous datasets, then running the bash file 'run'.
Note that, you should change the two parts to evaluate the dataset you want. (This is very important!!)
DATASETS.TEST
qbench_dev
scienceqa_test
textvqa_val
pope_test
mme
mmbench_test
ormmbench_test_cn
mm-vet
mathvista_testmini
ai2d
seed
hallusionbench
PIPELINE
QBenchPipeline
SQAPipeline
TextVQAPipeline
POPEPipeline
MMEPipeline
MMBenchPipeline
MMVetPipeline
MathVistaPipeline
AI2DPipeline
SEEDPipeline
HallusionPipeline
GPT-4 Aid Evaluation for AI2D, MM-Vet, SEED
This code will be soon public!
π Download Datasets
Suggested labels
{'label-name': 'zero-shot-tasks', 'label-description': 'Keywords related to zero-shot vision language tasks and evaluations.', 'confidence': 61.64}