Open gwyong opened 1 year ago
Hi, you can look at our code in LAVIS, which provides gradcam computation function for BLIP image-text matching model https://github.com/salesforce/LAVIS/blob/a9939492f8f992d03088e7575bc711089b06544a/lavis/models/blip_models/blip_image_text_matching.py#L151
Does it mean, only image-text matching model can perform gradcam? My model is image captioning model, (see this https://huggingface.co/docs/transformers/model_doc/blip#transformers.BlipForConditionalGeneration)
If it only supports image-text matching model, do I need to make another image-text matching model for gradcam?
You can adapt the gradcam code to work with an image captioning model.
Thank you I will try it.
Hi, I am also working on the visualization that goes beyond the image-text matching model, and I've encountered some difficulties when calling 'attn_gradients' and 'attention_map'. Have you had any success with this and if so can you share the code or provide some guidance? Thank you very much!
Sure if I solve it, I will let you know.
Sure if I solve it, I will let you know.
Did you manage to solve this?
Hi, I used BlipForConditionalGeneration from transformers for image captioning. I want to visualize the reason of generated caption (word by word) like GradCAM.
I found a code from Albef (https://github.com/salesforce/ALBEF/blob/main/visualization.ipynb), but it used an image-text matching model, not image captioning model.
Can you give me any hints or simple codes for this?