mbzuai-oryx / PALO

(WACV 2025) Vision-language conversation in 10 languages including English, Chinese, French, Spanish, Russian, Japanese, Arabic, Hindi, Bengali and Urdu.
Apache License 2.0
81 stars 5 forks source link

🌍 PALO: A Polyglot Large Multimodal Model for 5B People (WACV 2025)

Oryx Video-ChatGPT

Hanoona Rasheed, Muhammad Maaz, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Timothy Baldwin, Michael Felsberg and Fahad Khan

Vision-language conversation in 10 languages including English, Chinese, French, Spanish, Russian, Japanese, Arabic, Hindi, Bengali and Urdu

Demo paper Dataset


πŸ“’ Latest Updates


Overview

In pursuit of more inclusive Vision-Language Models (VLMs), this study introduces a Large Multilingual Multimodal Model called PALO. PALO offers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese, that span a total of ~5B people (65% of the world population).

Palo Results

πŸ† Contributions

  1. We develop Palo: the first multilingual Large Multimodal Model (LMM), capable of generating responses in 10 languages.
  2. We created an extensive multilingual instruction-tuning dataset (~2.1M instructions) by translating LLaVA-Instruct-150K.
  3. We train models across three distinct scales i.e., 1.7B, 7B, and 13B parameters to demonstrate the scalability of our training pipeline. The models demonstrate good performance on low-resource languages, e.g., Hindi, Arabic, Bengali, and Urdu, without compromising its high-performance on high-resource languages e.g., English, Chinese, French, and Spanish.

πŸ“‚ PALO Multi-Lingual Dataset Access

We develop a diverse instruction set (~2.1M instructions) comprising conversations from ten languages. Specifically, 665K instructions from LLaVA-Instruct-665K are used for English, and approximately 150K conversations from LLaVA-Instruct-150K are translated to Chinese, French, Spanish, Russian, Japanese, Arabic, Hindi, Bengali and Urdu using our proposed semi-automated translation pipeline.

πŸ“₯ Download the Training Dataset: Access our multi-lingual dataset on Hugging Face: MBZUAI/palo_multilingual_dataset.

We also develop a multi-lingual evaluation set to conduct a comprehensive evaluation across various languages. This set is constructed by translating the LLaVA-Bench into all target languages using GPT-4-Turbo, with particular attention to preserving linguistic authenticity and mitigating common issues of automated translations through careful human correction.

πŸ“₯ Download the Evaluation Dataset: Access our multi-lingual evaluation dataset on Hugging Face: MBZUAI/MBZUAI/multilingual-llava-bench-in-the-wild.

🧠 Model Zoo

Model Name HuggingFace Link
MobilePALO-1.7B MBZUAI/MobilePALO-1.7B
PALO-7B MBZUAI/PALO-7B
PALO-13B MBZUAI/PALO-13B

πŸ”§ Installation

We recommend setting up a conda environment for the project:

conda create --name=palo python=3.10
conda activate palo

git clone https://github.com/mbzuai-oryx/PALO
cd PALO

pip install -r requirements.txt
pip instal flash-attn==2.3.2

export PYTHONPATH="./:$PYTHONPATH"

πŸ’Ώ Running Demo Offline

Please follow the instructions below to run the PALO demo on your local GPU machine.

1. Launch a controller

python palo/serve/controller.py --host 0.0.0.0 --port 10000

2. Launch a gradio web server.

python palo/serve/gradio_web_server.py --controller http://localhost:10000 --model-list-mode reload

3. Launch a model worker

python palo/serve/model_worker.py --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path MBZUAI/PALO-13B

You can launch as many workers as you want, and compare between different model checkpoints in the same Gradio interface. Please keep the --controller the same, and modify the --port and --worker to a different port number for each worker.

πŸš‹ Training

1. Prepare data

Please download the annotations from MBZUAI/palo_multilingual_dataset and all images following the below links.

After downloading all of them, organize the data as follows in ./playground/data,

data
    β”œβ”€β”€ coco
    β”‚   └── train2017
    β”œβ”€β”€ gqa
    β”‚   └── images
    β”œβ”€β”€ ocr_vqa
    β”‚   └── images
    β”œβ”€β”€ textvqa
    β”‚   └── train_images
    └── vg
        β”œβ”€β”€ VG_100K
        └── VG_100K_2
    β”œβ”€β”€ palo_multilingual_dataset
        β”œβ”€β”€ palo_multilingual_dataset.json

Please note that all images should be in the .jpg format.

2. Download Pretrained Projection Weights

Model Name Projector Weights
MobilePALO-1.7B MBZUAI/palo_1.7B_stage1_mm_projector
PALO-7B liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5
PALO-13B liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-13b-v1.5

3. Run Training

# For MobilePALO-1.7B
bash scripts/train/finetune_palo.sh "mtgv/MobileLLaMA-1.4B-Chat" "data/palo_multilingual_dataset/palo_multilingual_dataset.json" <path to palo_1.7B_stage1_mm_projector.bin> "ldpnet" "results/PALO-1.7B" "2" "2e-5"

# For PALO-7B
bash scripts/train/finetune_lora_palo.sh "lmsys/vicuna-7b-v1.5" "data/palo_multilingual_dataset/palo_multilingual_dataset.json" <path to llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5.bin> "mlp2x_gelu" "results/PALO-7B" "3" "2e-4"

# For PALO-13B
bash scripts/train/finetune_lora_palo.sh "lmsys/vicuna-13b-v1.5" "data/palo_multilingual_dataset/palo_multilingual_dataset.json" <path to llava-v1.5-mlp2x-336px-pretrain-vicuna-13b-v1.5.bin> "mlp2x_gelu" "results/PALO-13B" "3" "2e-4"

πŸ“Š Quantitative Evaluation

Please download PALO multi-lingual evaluation data from MBZUAI/MBZUAI/multilingual-llava-bench-in-the-wild and arrange it as follows,

data
    β”œβ”€β”€ multilingual-llava-bench-in-the-wild 
        β”œβ”€β”€ arabic
            β”œβ”€β”€ question.jsonl
            β”œβ”€β”€ answers.jsonl
            β”œβ”€β”€ context.jsonl
        β”œβ”€β”€ bengali
            β”œβ”€β”€ question.jsonl
            β”œβ”€β”€ answers.jsonl
            β”œβ”€β”€ context.jsonl
        ...
        ...
        ...

Use the following scripts to perform evaluation,

bash scripts/eval/eval_all_languages.sh <path to the trained model> <Output file name> <OpenAI API Key>

Palo Results

πŸ“š Qualitative Examples of Multilingual Capabilities

Palo Sample

Palo Sample

πŸ“œ Citation


    @inproceedings{PALO,
        title={Palo: A Large Multilingual Multimodal Language Model},
        author={Rasheed, Hanoona and Maaz, Muhammad and Shaker, Abdelrahman and Khan, Salman and Cholakal, Hisham and Anwer, Rao M. and Baldwin, Tim and Felsberg, Michael and Khan, Fahad S.},
        booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2025)},
        year={2025}
    }