In the paper, you say "Since the original BLIP-2 models do not include checkpoints for Vicuna, we perform pre-training with Vicuna
using the same procedure as BLIP-2". Is this means instructblip training from the second stage model? But the second stage model dropped the qformer's text decoder out. Does the new feedforward layer is randomly initialed or initialed from the first stage model?
In the paper, you say "Since the original BLIP-2 models do not include checkpoints for Vicuna, we perform pre-training with Vicuna using the same procedure as BLIP-2". Is this means instructblip training from the second stage model? But the second stage model dropped the qformer's text decoder out. Does the new feedforward layer is randomly initialed or initialed from the first stage model?