I want to try changing liuhaotian/llava-v1.5-13b to use a different image tower instead of clip-vit-large-patch14.
After changing the vision tower, is it necessary to pretrain the MLP projection layer from scratch, or can we reuse the pretrained projector weights in the liuhaotian/llava-v1.5-13b checkpoint?
How can the vision tower be replaced?
Similarly, if I want to use a different LLaMA-2 finetune model of the same parameter size instead of lmsys/vicuna-13b-v1.5, used in the liuhaotian/llava-v1.5-13b checkpoint, can you simply replace the LLM network without pretraining the MLP projection layer from scratch?
I am interested in knowing if training the MLP projector alone would be suitable. Since recent work by Apple (see paper) suggest that it is one of the most important factors.
Question
I want to try changing
liuhaotian/llava-v1.5-13b
to use a different image tower instead ofclip-vit-large-patch14
.After changing the vision tower, is it necessary to pretrain the MLP projection layer from scratch, or can we reuse the pretrained projector weights in the
liuhaotian/llava-v1.5-13b
checkpoint?How can the vision tower be replaced?
Similarly, if I want to use a different LLaMA-2 finetune model of the same parameter size instead of
lmsys/vicuna-13b-v1.5
, used in theliuhaotian/llava-v1.5-13b
checkpoint, can you simply replace the LLM network without pretraining the MLP projection layer from scratch?How can the LLM be replaced?