salesforce / ALBEF

Code for ALBEF: a new vision-language pre-training method
BSD 3-Clause "New" or "Revised" License
1.46k stars 193 forks source link

A quick question about visual grounding and visualizing Grad-CAM #43

Closed bellos1203 closed 2 years ago

bellos1203 commented 2 years ago

Thanks for releasing the code!

I have a quick question about visualizing Grad-CAM. Do you have any particular reason for using 3rd layer of multimodal encoder? I've tried other layers using your demo code for visualization, but the results generated from the 4th & 5th layers are quite inaccurate.

Thanks in advance :)

LiJunnan1992 commented 2 years ago

Thanks for your question! As shown in our appendix (Figure 8), we empirically find that the 3rd cross-attention layer specialize in localization.

bellos1203 commented 2 years ago

Thanks for your kind reply! I missed the appendix.