Open ShawnAn-WHU opened 7 months ago
I have not faced this issue. Can you give me the reproducing command.
@martinakaduc Thank you very much for your prompt reply! Below is my pretraining script. The --model_name_or_path is the model I downloaded from HF mistralai/Mixtral-8x7B-v0.1. Despite the warnings, running this script will produce a mm_projector.bin file. When pretraining, the loss decreases from ~15 to ~6 and does not decrease any more. Can you figure out the problem?
Have you merged my pull request about adding mixtral? If not, you can use my modified repo here: https://github.com/martinakaduc/LLaVA
My pretraining script:
deepspeed llava/train/train_mem.py --deepspeed ./scripts/zero3_offload.json --model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1 --version plain --data_path ./playground/data/LLaVA-Pretrain/blip_laion_cc_sbu_558k.json --image_folder ./playground/data/LLaVA-Pretrain/images --vision_tower openai/clip-vit-large-patch14-336 --mm_projector_type mlp2x_gelu --tune_mm_mlp_adapter True --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --bf16 True --output_dir ./checkpoints/Mixtral-pt --num_train_epochs 1 --per_device_train_batch_size 32 --per_device_eval_batch_size 4 --gradient_accumulation_steps 2 --evaluation_strategy "no" --save_strategy "steps" --save_steps 200 --save_total_limit 2 --learning_rate 1e-3 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --tf32 True --model_max_length 32768 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --report_to neptune
And fine-tuning script:
deepspeed llava/train/train_mem.py --deepspeed ./scripts/zero3_offload.json --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 --model_name_or_path mistralai/Mixtral-8x7B-Instruct-v0.1 --version mistral_instruct --data_path ./playground/data/llava_v1_5_mix665k.json --image_folder ./playground/data --vision_tower openai/clip-vit-large-patch14-336 --pretrain_mm_mlp_adapter ./checkpoints/Mixtral-pt/checkpoint-400/mm_projector.bin --mm_projector_type mlp2x_gelu --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --image_aspect_ratio pad --group_by_modality_length True --bf16 True --output_dir ./checkpoints/Mixtral-sft --num_train_epochs 1 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "steps" --save_steps 200 --save_total_limit 2 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --tf32 True --model_max_length 32768 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --report_to neptune
@martinakaduc Thank you! I will try it now and find out the problem!
Interesting, would pretraining on mixtral-8x-22b also be possible?
I think it is possible. However I have not tested yet.
@martinakaduc Hi, I'm using your pretrained MixSUraV model downloaded from HF to finetune on my own dataset. The script I use is like Figure 1, is it correct? If correct, I found it infeasible when using 8 3090 GPUs (24G) even with 4-bit quantification (set --bits 4, like the red rectangle in the figure). The code for model loading is like Figure 2. However, when I use the code shown in Figure 3, only 3 GPUs are more than enough (may be 1 is ok). Is there any difference between these two codes? And could you please tell me your finetuning script and computational resources needed if you have done this? Thank tou so much!
Hi, how do you know the training was effecitve? Did you use the default training setting? I LoRA with default parameters and basically no improvement.
Hi, how do you know the training was effecitve? Did you use the default training setting? I LoRA with default parameters and basically no improvement.
@fisher75 I have LoRA finetuned with my own dataset using LLaVA-v1.5 and the qualitative results are better than the original LLaVA-v1.5.
ShawnAn-WHU
Hi @ShawnAn-WHU thanks for your reply. I am also working on this, may I ask is the improvement is very obvious? May I see the training and inference scripts(mostly I am curious about the parameter settings), btw if possible, may I add your WeChat? Could be very helpful to share some details.
@fisher75 Sure, e-mail me your WeChat ID is ok.
Question
Does anyone have carried out the pretraining with Mixtral 8×7B? When I run the petraining script, one problem occured like the figure shown below. I just add a llava_mixtral.py to the llava/model/language_model and some necessary supplementary code.