gordonhu608 / MQT-LLaVA

Matryoshka Query Transformer for Large Vision-Language Models
Apache License 2.0
72 stars 10 forks source link

About grad-cam visualization #2

Open Qmagine opened 1 week ago

Qmagine commented 1 week ago

Your work is excellent, and I have seen the Grad-CAM visualization results you provided. Could you please share the code for the Grad-CAM visualization? I would be very grateful.

gordonhu608 commented 1 week ago

Thank you for your interest in our work. We referred to the Grad-CAM visualization code in this work: https://github.com/salesforce/ALBEF/blob/main/visualization.ipynb. If you have further questions about Grad-CAM visualization, I will also be glad to help you.

Qmagine commented 1 week ago

Thank you for your interest in our work. We referred to the Grad-CAM visualization code in this work: https://github.com/salesforce/ALBEF/blob/main/visualization.ipynb. If you have further questions about Grad-CAM visualization, I will also be glad to help you.

Thank you very much for your effective response! I had previously referred to the work you provided, but due to my limited coding skills, I didn't know where to start making modifications. Although I feel this might be asking for something without much effort on my part, I still want to ask if you could share your source code. This would greatly help me understand and apply the concepts in the future. Of course, if it's inconvenient for you to share, that's perfectly fine. And thank you very much for your response :)

Oscar860601 commented 1 week ago

@gordonhu608 Following the GradCam Questions. I just reproduce the GradCam Visualization of Llava 1.5 on some public dataset. I want to make sure 3 details:

  1. Which layer of the model did you use to get the result on the paper ? vision encoder layer 22 ? projector layer ?
  2. If the output is a sequence, how did you aggregate different Gradcam result from each output token ? choosing some keywords ? or just average the gradient and activation ?
  3. Do you notice some weird result on llava 1.5 GradCam visualization ? In my own experiments, most of the GradCam heatmap produced by llava 1.5 looks noisy.

I would reproduce GradCam on your brilliant work next week. I just want to make sure my pipeline is the same as yours. Thanks !

gordonhu608 commented 1 week ago

1.For our work, we chose the layer of cross-attention, so I assume for llava it would be the projection layer. 2. We tested on a QA example, we computed the loss on the 'answer' part's output tokens. 3. I happened to test on llava's gradcam results, it's sometimes noisy. I conjecture each of the 576 visual tokens is attending to very different image information and studying some kind of relations.

Oscar860601 commented 1 week ago

1.For our work, we chose the layer of cross-attention, so I assume for llava it would be the projection layer. 2. We tested on a QA example, we computed the loss on the 'answer' part's output tokens. 3. I happened to test on llava's gradcam results, it's sometimes noisy. I conjecture each of the 576 visual tokens is attending to very different image information and studying some kind of relations.

Thanks for explanation ! Just for confirmation, when you said "cross-attention layer", you are referring the Multihead Attention in the Resampler in MQT ?

gordonhu608 commented 1 week ago

Yes, Correct.