Open yeppp27 opened 7 months ago
We just randomly initialize the tiny transformer network, and the Q-former needs to load the pre-trained parameters from here.
Thanks for your kind reply. And I what to know Why the randomly initialized transformer can capture the fusion of text embedding and image embedding and Why it can be directly adapted to a pretrained Q-former ?
Our work was inspired by Video-LLaMA, which is similarly loaded with pre-training parameters. For text fusion, our approach is modified based on some previous work, e.g., Learning to Answer Questions in Dynamic Audio-Visual Scenarios.
Thanks a lot~
Hi ! Thanks for your great work for inspiring us! I have some doubts about the pretrained weight of cue aggregator. Are the parameters of forward order block and backward order block randomly initialized? And are the parameters of Q-former randomly initialized or pretrained from BLIP2?