rikeilong / Bay-CAT

[ECCV’24] Official Implementation for CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios
Apache License 2.0
41 stars 1 forks source link

About cue aggregator #1

Open yeppp27 opened 7 months ago

yeppp27 commented 7 months ago

Hi ! Thanks for your great work for inspiring us! I have some doubts about the pretrained weight of cue aggregator. Are the parameters of forward order block and backward order block randomly initialized? And are the parameters of Q-former randomly initialized or pretrained from BLIP2?

rikeilong commented 7 months ago

We just randomly initialize the tiny transformer network, and the Q-former needs to load the pre-trained parameters from here.

yeppp27 commented 7 months ago

Thanks for your kind reply. And I what to know Why the randomly initialized transformer can capture the fusion of text embedding and image embedding and Why it can be directly adapted to a pretrained Q-former ?

rikeilong commented 7 months ago

Our work was inspired by Video-LLaMA, which is similarly loaded with pre-training parameters. For text fusion, our approach is modified based on some previous work, e.g., Learning to Answer Questions in Dynamic Audio-Visual Scenarios.

yeppp27 commented 7 months ago

Thanks a lot~