Open gwyong opened 1 year ago
Hi, I used BlipForConditionalGeneration from transformers for image captioning. I want to visualize the reason of generated caption (word by word) like GradCAM.
I found a code from Albef (https://github.com/salesforce/ALBEF/blob/main/visualization.ipynb), but it used an image-text matching model, not image captioning model.
Can you give me any hints or simple codes for this?
BlipForConditionalGeneration
你好,请问BLIP2可以批量化对图像生成字幕吗?
Hi, I used BlipForConditionalGeneration from transformers for image captioning. I want to visualize the reason of generated caption (word by word) like GradCAM.
I found a code from Albef (https://github.com/salesforce/ALBEF/blob/main/visualization.ipynb), but it used an image-text matching model, not image captioning model.
Can you give me any hints or simple codes for this?