rxtan2 / Koala-video-llm

BSD 3-Clause "New" or "Revised" License
25 stars 3 forks source link

Regarding Attention Heatmap #5

Open rbsohee opened 1 month ago

rbsohee commented 1 month ago

Hi, thank you for an interesting work :)

I was wondering how the "attention heatmap" in the paper was drawn. If I have understood your method correctly, the learnable parameters are only added to the "Video Q-former", which cross-attends with 32 x T queries generated from frozen "Visual Q-former". The 32 visual queries attend to different regions from the frame, but as they are freezed the attention would not have changed.

It would really help if you could share the code/method you used to visualize the attention map.

rxtan2 commented 4 weeks ago

Hi rbsohee, thank you very much for your interest in our work! I apologize for the delay due to some deadlines. We use a simplified method similar to attention rollout to extract the attention weights from the Video Q-former. The 32 visual queries are frozen. However, we append the learnable queries which interact with the visual queries through the self-attention layers. This causes the representations of the queries to change, which also affects the attention weights. Additionally, due to the complexity of the model, we used a simplified version before and are now evaluating new ways to extract such attention maps. We are working on cleaning up the script and code component to extract the attention maps for public use and will release it once it is cleaned and tested.