haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
20.13k stars 2.21k forks source link

[Usage] About finetuning llama 2 with liuhaotian/llava-pretrain-llama-2-7b-chat #1504

Open llv22 opened 5 months ago

llv22 commented 5 months ago

Describe the issue

Issue: I try to do visual instruction tuning using the pretrained projector liuhaotian/llava-pretrain-llama-2-7b-chat. However, got the following issue. I have download the projector from https://huggingface.co/liuhaotian/llava-pretrain-llama-2-7b-chat to ./checkpoints/llava-pretrain-llama-2-7b-chat. According to the guide in https://github.com/haotian-liu/LLaVA/blob/main/scripts/v1_5/finetune.sh and https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md, I think I should use meta-llama/Llama-2-7b-chat-hf during fine-tuning. But I got an issue, please check the details in the logging section.

Command:

deepspeed llava/train/train_mem.py \
    --deepspeed ./scripts/zero3.json \
    --model_name_or_path meta-llama/Llama-2-7b-chat-hf \
    --version v1 \
    --data_path ./playground/data/llava_v1_5_mix665k.json \
    --image_folder ./playground/data \
    --vision_tower openai/clip-vit-large-patch14 \
    --pretrain_mm_mlp_adapter ./checkpoints/llava-pretrain-llama-2-7b-chat/mm_projector.bin \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 True \
    --output_dir ./checkpoints/llava-llama2-7b-finetune \
    --num_train_epochs 1 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb

Log:

2024-05-15 11:48:42.708 ERROR train - global_exception_handler: Uncaught exception Error(s) in loading state_dict for Sequential:
    Missing key(s) in state_dict: "0.weight", "0.bias", "2.weight", "2.bias". 
    Unexpected key(s) in state_dict: "weight", "bias". 
NoneType: None
2024-05-15 11:48:42.708 ERROR train - global_exception_handler: <class 'RuntimeError'>
2024-05-15 11:48:42.708 ERROR train - global_exception_handler: <class 'RuntimeError'>
2024-05-15 11:48:42.709 ERROR train - global_exception_handler: 
      File "/data/orlando/workspace/AndroidAgentModelZoo/models/LLaVA_forward/llava/train/train_mem.py", line 4, in <module>
    train(attn_implementation="flash_attention_2")
  File "/data/orlando/workspace/AndroidAgentModelZoo/models/LLaVA_forward/llava/train/train.py", line 1302, in train
    model.get_model().initialize_vision_modules(
  File "/data/orlando/workspace/AndroidAgentModelZoo/models/LLaVA_forward/llava/model/llava_arch.py", line 97, in initialize_vision_modules
    self.mm_projector.load_state_dict(get_w(mm_projector_weights, 'mm_projector'))
  File "/usr/local/anaconda3/envs/agentbackend/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2153, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(

I guessed that may be caused by inconsistency between the model_name_or_path and the referenced model used in projector. However, in the projector's setting, I can only see the model name is ./checkpoints/llama_2/llama-2-7b-chat (https://huggingface.co/liuhaotian/llava-pretrain-llama-2-7b-chat/blob/main/config.json). Could you clarify what llama2 model should I use in --model_name_or_path?

PS: For my understanding, the pertaining phase focuses on language and image alignment (feature alignment) so its goal is to train an appropriate projector to map image into language space. Then with this projector, we can fine-tune both language and image to improve task performance. My guess is meta-llama/Llama-2-7b-chat-hf should be OK (it's the converted format from meta's official release llama2), or according to https://github.com/haotian-liu/LLaVA/blob/main/docs/LLaVA_from_LLaMA2.md, I need to download the latest llama2 checkpoints and use it (I try this, but failed, because this format can't be loaded by huggingface API).

Current follow-up: Now I'm trying to use meta-llama/Llama-2-7b-chat-hf to pretrain a projector, then follow the fine-tune process.

Could you clarify which language model I should use for llava-pretrain-llama-2-7b-chat/mm_projector.bin? Correct me if there is anything wrong for my description.

Really appreciate your help

Orlando

aybora commented 5 months ago

You need to change mm_projector_type to linear. mlp2x_gelu is for Vicuna.

llv22 commented 5 months ago

@aybora if I want to support llama2 with projector mlp2x_gelu, I need to traint the first phase and get my own projector?