irthomasthomas / undecidability

2 stars 1 forks source link

MoAI/README.md at master Β· ByungKwanLee/MoAI #722

Open irthomasthomas opened 2 months ago

irthomasthomas commented 2 months ago

MoAI/README.md at master Β· ByungKwanLee/MoAI

Description

MoAI: Mixture of All Intelligence for Large Language and Vision Models

πŸ“° News

a

🎨 In-Progress

Official PyTorch implementation code for realizing the technical part of Mixture of All Intelligence (MoAI) to improve performance of numerous zero-shot vision language tasks. This code is developed on two baseline codes of XDecoder: Generalized Decoding for Pixel, Image, and Language accepted in CVPR 2023 and InternLM for Technical Paper. Please understand the combined code in the current version combining two technical code implementation!

πŸ“– Citation

@misc{lee2024moai,
      title={MoAI: Mixture of All Intelligence for Large Language and Vision Models}, 
      author={Byung-Kwan Lee and Beomchan Park and Chae Won Kim and Yong Man Ro},
      year={2024},
      eprint={2403.07508},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

🏝️ Summary

The rise of large language models (LLMs) and instruction tuning has led to the current trend of instruction-tuned large language and vision models (LLVMs). This trend involves either meticulously curating numerous instruction tuning datasets tailored to specific objectives or enlarging LLVMs to manage vast amounts of vision language (VL) data. However, current LLVMs have disregarded the detailed and comprehensive real-world scene understanding available from specialized computer vision (CV) models in visual perception tasks such as segmentation, detection, scene graph generation (SGG), and optical character recognition (OCR). Instead, the existing LLVMs rely mainly on the large capacity and emergent capabilities of their LLM backbones. Therefore, we present a new LLVM, Mixture of All Intelligence (MoAI), which leverages auxiliary visual information obtained from the outputs of external segmentation, detection, SGG, and OCR models. MoAI operates through two newly introduced modules: MoAI-Compressor and MoAI-Mixer. After verbalizing the outputs of the external CV models, the MoAI-Compressor aligns and condenses them to efficiently use relevant auxiliary visual information for VL tasks. MoAI-Mixer then blends three types of intelligenceβ€”(1) visual features, (2) auxiliary features from the external CV models, and (3) language featuresβ€”utilizing the concept of Mixture of Experts. Through this integration, MoAI significantly outperforms both open-source and closed-source LLVMs in numerous zero-shot VL tasks, particularly those related to real-world scene understanding such as object existence, positions, relations, and OCR without enlarging the model size or curating extra visual instruction tuning datasets.

πŸš€ Highlights

Comparing the scores and accuracies of numerous VL benchmarks for various open-source and closed-source LLVMs with those for MoAI

Overview of MoAI architecture

Illustrating zero-shot vision language performances

Download MoAI-7B

Q-Bench SQA-IMG TextVQA POPE MME-P MME-C MM-Bench MMB-CN MM-Vet
InstructBLIP-7B 56.7 49.2 60.5 50.1 - - 36.0 23.7 25.6
Qwen-VL-7B 59.4 67.1 63.8 - - - 38.2 7.4 -
LLaVA1.5-7B 58.7 66.8 58.2 85.9 1511 294 64.3 58.3 30.5
MoAI-7B 70.2 83.5 67.8 87.1 1714 561 79.3 76.5 43.7

Interesting Questions for Architecture Choices

πŸ“‚ Directory Layout

    .
    β”œβ”€β”€ asset                           # Required package lists (Important)
    β”œβ”€β”€ trainer                         # Training MoAI and initializing optimizer (Not Support Now)
    β”œβ”€β”€ utils                           # Michallengeous util files (Not important)
    β”œβ”€β”€ moai                            # MoAI architecture & loading moai (Important)
    β”œβ”€β”€ pipeline                        # Evaluating zero-shot vision language tasks (Important)
    β”‚
    β”œβ”€β”€ datasets                        # Important
    β”‚   β”œβ”€β”€ dataset_mappers             # data parsing including augmentation for loader
    β”‚   β”œβ”€β”€ evaluation                  # measuring evaluation for each dataset 
    β”‚   └── registration                # register dataset
    β”‚
    β”œβ”€β”€ configs                         
    β”‚   β”œβ”€β”€ accel                       # Accelerate Config files (Support Deepspeed, DDP, Multinode)
    β”‚   └── moai_eval.yaml              # Evaluating MoAI
    β”‚
    β”œβ”€β”€ modeling                        # Not Important
    β”‚   β”œβ”€β”€ architectures               # training the prototype of moai (Not Support Now)
    β”‚   β”œβ”€β”€ utils                       # utils for modeling (Not important)
    β”‚   └── BaseModel                   # loading and saving model (Important)
    β”‚
    β”œβ”€β”€ lbk_entry.py                    # main code of control tower (Important)
    β”œβ”€β”€ run                             # bash file for running the evaluation (Important)
    β”‚
    β”œβ”€β”€ install                         # install required packages (Important)
    └── README.md

πŸ’‘ How to Run?

In bash file of install, you should first run the following lines.

conda create -n moai python=3.9
conda activate moai
conda clean -a && pip cache purge
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r assets/requirements/requirements.txt
pip install -r assets/requirements/requirements_custom.txt
pip install flash-attn --no-build-isolation

In addition, you should set the following environment variables to set the dataset path.

export DETECTRON2_DATASETS=/path/to/dataset
export DATASET=/path/to/dataset
export DATASET2=/path/to/dataset
export VLDATASET=/path/to/dataset

You should make directory 'checkpoints' in moai/sgg and upload checkpoint of Scene Graph Generation after downloading it, where its checkpoint filename should be 'psgtr_r50_epoch_60.pth'

Download checkpoints with labeled name 'PSGTR' in Panoptic SGG. Or, download checkpoints in my google drive Google Drive.

At init_detector function in mmdet/apis/inference.py, line 95-110 should be commented to get compatibility.

# if palette != 'none':
#     model.dataset_meta['palette'] = palette
# else:
#     test_dataset_cfg = copy.deepcopy(config.test_dataloader.dataset)
#     # lazy init. We only need the metainfo.
#     test_dataset_cfg['lazy_init'] = True
#     metainfo = DATASETS.build(test_dataset_cfg).metainfo
#     cfg_palette = metainfo.get('palette', None)
#     if cfg_palette is not None:
#         model.dataset_meta['palette'] = cfg_palette
#     else:
#         if 'palette' not in model.dataset_meta:
#             warnings.warn(
#                 'palette does not exist, random is used by default. '
#                 'You can also set the palette to customize.')
#             model.dataset_meta['palette'] = 'random'

At inference_detector function in mmdet/apis/inference.py, line 179- should be changed by the following lines.

# build the data pipeline
data_ = test_pipeline(data_)

data_['inputs'] = data_['inputs'].unsqueeze(0)
data_['data_samples'] = [data_['data_samples']]

# forward the model
with torch.no_grad():
    results = model.test_step(data_)[0]

In mmcv/transforms/processing.py, line 388 should be commented to get compatibility.

# results['img_shape'] = padded_img.shape[:2]

Download MoAI Model and then run the demo script,

"""
MoAI-7B

Simple Six Steps
"""

# [1] Loading Image
from PIL import Image
from torchvision.transforms import Resize
from torchvision.transforms.functional import pil_to_tensor
image_path = "figures/moai_mystery.png"
image = Resize(size=(490, 490), antialias=False)(pil_to_tensor(Image.open(image_path)))

# [2] Instruction Prompt
prompt = "Describe this image in detail."

# [3] Loading MoAI
from moai.load_moai import prepare_moai
moai_model, moai_processor, seg_model, seg_processor, od_model, od_processor, sgg_model, ocr_model \
    = prepare_moai(moai_path='/mnt/ssd/lbk-cvpr/MoAI/final', bits=4, grad_ckpt=False, lora=False, dtype='fp16')

# [4] Pre-processing for MoAI
moai_inputs = moai_model.demo_process(image=image, 
                                    prompt=prompt, 
                                    processor=moai_processor,
                                    seg_model=seg_model,
                                    seg_processor=seg_processor,
                                    od_model=od_model,
                                    od_processor=od_processor,
                                    sgg_model=sgg_model,
                                    ocr_model=ocr_model,
                                    device='cuda:0')

# [5] Generate
import torch
with torch.inference_mode():
    generate_ids = moai_model.generate(**moai_inputs, do_sample=True, temperature=0.9, top_p=0.95, max_new_tokens=256, use_cache=True)

# [6] Decoding
answer = moai_processor.batch_decode(generate_ids, skip_special_tokens=True)[0].split('[U')[0]
print(answer)

If you want to validate zero-shot performances in numerous datasets, then running the bash file 'run'.

GPU_DEVICE="0,1,2,3,4,5"
length=${#GPU_DEVICE}
n_gpu=$(((length+1)/2))
main_port=10000
test_batch=1 # (Must be Necessary)

CUDA_VISIBLE_DEVICES=$GPU_DEVICE \
accelerate launch --config_file configs/accel/ddp_accel.yaml \
    --num_processes=$n_gpu \
    --main_process_port=$main_port \
    lbk_entry.py eval \
    --conf_files configs/moai_eval.yaml \
    --overrides \
    WANDB False \
    DATASETS.TEST mme \
    PIPELINE MMEPipeline \
    MME.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    SCIENCEQA.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    POPE.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    MMBENCH.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    MMVET.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    AI2D.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    HALLUSIONBENCH.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    MATHVISTA.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    QBENCH.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    SEED.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    SAVE_DIR /path/to/MoAI_DIR \
    WEIGHT True \
    RESUME_FROM /path/to/MoAI_WEIGHT \

Note that, you should change the two parts to evaluate the dataset you want. (This is very important!!)

DATASETS.TEST

PIPELINE

GPT-4 Aid Evaluation for AI2D, MM-Vet, SEED

This code will be soon public!

πŸ… Download Datasets

Suggested labels

{'label-name': 'zero-shot-tasks', 'label-description': 'Keywords related to zero-shot vision language tasks and evaluations.', 'confidence': 61.64}

irthomasthomas commented 2 months ago

Related content

628

Similarity score: 0.9

184

Similarity score: 0.89

706

Similarity score: 0.86

383

Similarity score: 0.86

554

Similarity score: 0.86

494

Similarity score: 0.85