Open Qmagine opened 1 week ago
Thank you for your interest in our work. We referred to the Grad-CAM visualization code in this work: https://github.com/salesforce/ALBEF/blob/main/visualization.ipynb. If you have further questions about Grad-CAM visualization, I will also be glad to help you.
Thank you for your interest in our work. We referred to the Grad-CAM visualization code in this work: https://github.com/salesforce/ALBEF/blob/main/visualization.ipynb. If you have further questions about Grad-CAM visualization, I will also be glad to help you.
Thank you very much for your effective response! I had previously referred to the work you provided, but due to my limited coding skills, I didn't know where to start making modifications. Although I feel this might be asking for something without much effort on my part, I still want to ask if you could share your source code. This would greatly help me understand and apply the concepts in the future. Of course, if it's inconvenient for you to share, that's perfectly fine. And thank you very much for your response :)
@gordonhu608 Following the GradCam Questions. I just reproduce the GradCam Visualization of Llava 1.5 on some public dataset. I want to make sure 3 details:
I would reproduce GradCam on your brilliant work next week. I just want to make sure my pipeline is the same as yours. Thanks !
1.For our work, we chose the layer of cross-attention, so I assume for llava it would be the projection layer. 2. We tested on a QA example, we computed the loss on the 'answer' part's output tokens. 3. I happened to test on llava's gradcam results, it's sometimes noisy. I conjecture each of the 576 visual tokens is attending to very different image information and studying some kind of relations.
1.For our work, we chose the layer of cross-attention, so I assume for llava it would be the projection layer. 2. We tested on a QA example, we computed the loss on the 'answer' part's output tokens. 3. I happened to test on llava's gradcam results, it's sometimes noisy. I conjecture each of the 576 visual tokens is attending to very different image information and studying some kind of relations.
Thanks for explanation ! Just for confirmation, when you said "cross-attention layer", you are referring the Multihead Attention in the Resampler in MQT ?
Yes, Correct.
Your work is excellent, and I have seen the Grad-CAM visualization results you provided. Could you please share the code for the Grad-CAM visualization? I would be very grateful.