haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
20.12k stars 2.21k forks source link

[Question] Reuse the MLP projection layer, or retrain it? #1026

Open gameveloster opened 9 months ago

gameveloster commented 9 months ago

Question

I want to try changing liuhaotian/llava-v1.5-13b to use a different image tower instead of clip-vit-large-patch14.

  1. After changing the vision tower, is it necessary to pretrain the MLP projection layer from scratch, or can we reuse the pretrained projector weights in the liuhaotian/llava-v1.5-13b checkpoint?

  2. How can the vision tower be replaced?

  3. Similarly, if I want to use a different LLaMA-2 finetune model of the same parameter size instead of lmsys/vicuna-13b-v1.5, used in the liuhaotian/llava-v1.5-13b checkpoint, can you simply replace the LLM network without pretraining the MLP projection layer from scratch?

  4. How can the LLM be replaced?

ggcr commented 5 months ago

I am interested in knowing if training the MLP projector alone would be suitable. Since recent work by Apple (see paper) suggest that it is one of the most important factors.