haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
19.02k stars 2.08k forks source link

About projector weights #577

Open LH019 opened 10 months ago

LH019 commented 10 months ago

Question

hi, if i choose liuhaotian/llava-v1.5-7b as my bone model, which projector weights should download? I tried to download LLaMA-2-7B-Chat, but it will encounter the error like Error(s) in loading state_dict for Sequential: Missing key(s) in state_dict: "0.weight", "0.bias", "2.weight", "2.bias". Unexpected key(s) in state_dict: "weight", "bias". so what should i do to solve this issue?

haotian-liu commented 10 months ago

It's this one: https://huggingface.co/liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5

adrielkuek commented 10 months ago

Hi. In reference to this question, I'm a little bit confused on the selection of the projector weights. As mentioned by the OP that should we want to fine-tune from llava-v1.5-7b, the projector weights should be extracted from https://huggingface.co/liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5. I was under the impression that llava-v1.5-7b was pretrained from llama-2 chat rather than vicuna v1.5? Thanks for the clarification!

haotian-liu commented 10 months ago

So there are two types of finetuning:

  1. You start with a Vicuna (pure LLM), and you want to finetune it for multimodal capability (visual image reasoning). This, you start with a Vicuna, pretrain a connector (or use our pretrained one), and finetune on visual instruction tuning data mixture, and you obtain LLaVA-v1.5. During the "finetuning" process, LLM and the projector are both updated.
  2. You start with LLaVA-v1.5 (already having visual capability), and you want to finetune it further for a specific task. This, you start with LLaVA-v1.5, and do not worry about projector at all because it is already there, and finetune it to the task data using this script.
adrielkuek commented 10 months ago

So there are two types of finetuning:

  1. You start with a Vicuna (pure LLM), and you want to finetune it for multimodal capability (visual image reasoning). This, you start with a Vicuna, pretrain a connector (or use our pretrained one), and finetune on visual instruction tuning data mixture, and you obtain LLaVA-v1.5. During the "finetuning" process, LLM and the projector are both updated.

  2. You start with LLaVA-v1.5 (already having visual capability), and you want to finetune it further for a specific task. This, you start with LLaVA-v1.5, and do not worry about projector at all because it is already there, and finetune it to the task data using this script.

Thanks for the clarifications! Can I check then for (2), finetuning with the specific task starting from LLaVA-1.5 with a custom visual instruction dataset, are we "finetuning" both the LLM (I suppose is Vicuna v1.5 in this case) as well as the projector?

haotian-liu commented 10 months ago

Yes we are finetuning both the LLM and the projector, but this LLM is a finetuned LLM from LLaVA-1.5.

  1. Init
    1. LLM: Vicuna
    2. proj: None
  2. Pretrain:
    1. LLM: Vicuna
    2. proj: llava-pretrain
  3. Instruction tuning (resulting model: LLaVA-1.5)
    1. LLM: Vicuna-llava-finetune
    2. proj: llava-pretrain-llava-finetune
  4. Task-specific finetuning (resulting model: finetuned LLaVA-1.5)
    1. LLM: Vicuna-llava-finetune-task-finetune
    2. proj: llava-pretrain-llava-finetune-task-finetune

Hope this makes it clear on the weights being modified in each stage.

Road2Redemption commented 9 months ago

Hi!Thank you for your work! I was wondering what would happen if I used finetune_lora.sh based on LLaVA-1.5? In the script there is the projector argument and I might have realized I had incorrectly used that argument after reading the discussion above. So is there any downsides of doing this or it just cannot perform the best ability of LLaVA? Here is my script: deepspeed --include="localhost:2,3" \ --master_port='29501' \ llava/train/train_mem.py \ --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \ --deepspeed ./scripts/zero2.json \ --model_name_or_path /data6/xyc/models/llava-v1.5-7b \ --version v1 \ --data_path ./playground/data/train_mix.json \ --image_folder /data6/xyc/data/baseline_data-v2/train \ --vision_tower openai/clip-vit-large-patch14-336 \ --pretrain_mm_mlp_adapter /data6/xyc/models/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5/mm_projector.bin \ --mm_projector_type mlp2x_gelu \ --mm_vision_select_layer -2 \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --image_aspect_ratio pad \ --group_by_modality_length True \ --bf16 True \ --output_dir ./checkpoints/llava-v1.5-7b-lora \ --num_train_epochs 5 \ --per_device_train_batch_size 16 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 50000 \ --save_total_limit 1 \ --learning_rate 2e-4 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing True \ --dataloader_num_workers 4 \ --lazy_preprocess True \ --report_to wandb