haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
20.53k stars 2.27k forks source link

[Question] LLaVA Pretraining with Mixtral 8×7B #1417

Open ShawnAn-WHU opened 7 months ago

ShawnAn-WHU commented 7 months ago

Question

Does anyone have carried out the pretraining with Mixtral 8×7B? When I run the petraining script, one problem occured like the figure shown below. I just add a llava_mixtral.py to the llava/model/language_model and some necessary supplementary code. image

martinakaduc commented 7 months ago

I have not faced this issue. Can you give me the reproducing command.

ShawnAn-WHU commented 7 months ago

@martinakaduc Thank you very much for your prompt reply! Below is my pretraining script. The --model_name_or_path is the model I downloaded from HF mistralai/Mixtral-8x7B-v0.1. Despite the warnings, running this script will produce a mm_projector.bin file. When pretraining, the loss decreases from ~15 to ~6 and does not decrease any more. Can you figure out the problem?

image

martinakaduc commented 7 months ago

Have you merged my pull request about adding mixtral? If not, you can use my modified repo here: https://github.com/martinakaduc/LLaVA

My pretraining script: deepspeed llava/train/train_mem.py --deepspeed ./scripts/zero3_offload.json --model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1 --version plain --data_path ./playground/data/LLaVA-Pretrain/blip_laion_cc_sbu_558k.json --image_folder ./playground/data/LLaVA-Pretrain/images --vision_tower openai/clip-vit-large-patch14-336 --mm_projector_type mlp2x_gelu --tune_mm_mlp_adapter True --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --bf16 True --output_dir ./checkpoints/Mixtral-pt --num_train_epochs 1 --per_device_train_batch_size 32 --per_device_eval_batch_size 4 --gradient_accumulation_steps 2 --evaluation_strategy "no" --save_strategy "steps" --save_steps 200 --save_total_limit 2 --learning_rate 1e-3 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --tf32 True --model_max_length 32768 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --report_to neptune

And fine-tuning script: deepspeed llava/train/train_mem.py --deepspeed ./scripts/zero3_offload.json --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 --model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1 --version mistral_instruct --data_path ./playground/data/llava_v1_5_mix665k.json --image_folder ./playground/data --vision_tower openai/clip-vit-large-patch14-336 --pretrain_mm_mlp_adapter ./checkpoints/Mixtral-pt/checkpoint-400/mm_projector.bin --mm_projector_type mlp2x_gelu --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --image_aspect_ratio pad --group_by_modality_length True --bf16 True --output_dir ./checkpoints/Mixtral-sft --num_train_epochs 1 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "steps" --save_steps 200 --save_total_limit 2 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --tf32 True --model_max_length 32768 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --report_to neptune

ShawnAn-WHU commented 7 months ago

@martinakaduc Thank you! I will try it now and find out the problem!

accupham commented 7 months ago

Interesting, would pretraining on mixtral-8x-22b also be possible?

martinakaduc commented 7 months ago

I think it is possible. However I have not tested yet.

ShawnAn-WHU commented 7 months ago

@martinakaduc Hi, I'm using your pretrained MixSUraV model downloaded from HF to finetune on my own dataset. The script I use is like Figure 1, is it correct? If correct, I found it infeasible when using 8 3090 GPUs (24G) even with 4-bit quantification (set --bits 4, like the red rectangle in the figure). The code for model loading is like Figure 2. However, when I use the code shown in Figure 3, only 3 GPUs are more than enough (may be 1 is ok). Is there any difference between these two codes? And could you please tell me your finetuning script and computational resources needed if you have done this? Thank tou so much! image image image

fisher75 commented 7 months ago

Hi, how do you know the training was effecitve? Did you use the default training setting? I LoRA with default parameters and basically no improvement.

ShawnAn-WHU commented 7 months ago

Hi, how do you know the training was effecitve? Did you use the default training setting? I LoRA with default parameters and basically no improvement.

@fisher75 I have LoRA finetuned with my own dataset using LLaVA-v1.5 and the qualitative results are better than the original LLaVA-v1.5.

fisher75 commented 7 months ago

ShawnAn-WHU

Hi @ShawnAn-WHU thanks for your reply. I am also working on this, may I ask is the improvement is very obvious? May I see the training and inference scripts(mostly I am curious about the parameter settings), btw if possible, may I add your WeChat? Could be very helpful to share some details.

ShawnAn-WHU commented 7 months ago

@fisher75 Sure, e-mail me your WeChat ID is ok.