salesforce / BLIP

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
BSD 3-Clause "New" or "Revised" License
4.85k stars 646 forks source link

BLIP Image Captioning GradCAM? #155

Open gwyong opened 1 year ago

gwyong commented 1 year ago

Hi, I used BlipForConditionalGeneration from transformers for image captioning. I want to visualize the reason of generated caption (word by word) like GradCAM.

I found a code from Albef (https://github.com/salesforce/ALBEF/blob/main/visualization.ipynb), but it used an image-text matching model, not image captioning model.

Can you give me any hints or simple codes for this?

LiJunnan1992 commented 1 year ago

Hi, you can look at our code in LAVIS, which provides gradcam computation function for BLIP image-text matching model https://github.com/salesforce/LAVIS/blob/a9939492f8f992d03088e7575bc711089b06544a/lavis/models/blip_models/blip_image_text_matching.py#L151

gwyong commented 1 year ago

Does it mean, only image-text matching model can perform gradcam? My model is image captioning model, (see this https://huggingface.co/docs/transformers/model_doc/blip#transformers.BlipForConditionalGeneration)

If it only supports image-text matching model, do I need to make another image-text matching model for gradcam?

LiJunnan1992 commented 1 year ago

You can adapt the gradcam code to work with an image captioning model.

gwyong commented 1 year ago

Thank you I will try it.

Michi-3000 commented 1 year ago

Hi, I am also working on the visualization that goes beyond the image-text matching model, and I've encountered some difficulties when calling 'attn_gradients' and 'attention_map'. Have you had any success with this and if so can you share the code or provide some guidance? Thank you very much!

gwyong commented 1 year ago

Sure if I solve it, I will let you know.

dip9811111 commented 1 year ago

Sure if I solve it, I will let you know.

Did you manage to solve this?