[Question] Reuse the MLP projection layer, or retrain it?

haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

Apache License 2.0

20.12k stars 2.21k forks source link

I want to try changing liuhaotian/llava-v1.5-13b to use a different image tower instead of clip-vit-large-patch14.

After changing the vision tower, is it necessary to pretrain the MLP projection layer from scratch, or can we reuse the pretrained projector weights in the liuhaotian/llava-v1.5-13b checkpoint?
How can the vision tower be replaced?
Similarly, if I want to use a different LLaMA-2 finetune model of the same parameter size instead of lmsys/vicuna-13b-v1.5, used in the liuhaotian/llava-v1.5-13b checkpoint, can you simply replace the LLM network without pretraining the MLP projection layer from scratch?
How can the LLM be replaced?

haotian-liu / LLaVA