salesforce / ALBEF

Code for ALBEF: a new vision-language pre-training method
BSD 3-Clause "New" or "Revised" License
1.53k stars 195 forks source link

grad-cam code #54

Closed ziyanyang closed 2 years ago

ziyanyang commented 2 years ago

Hi,

Thank you for this amazing work! I have a question about getting the grad-cam for ITM. From your code, the label for positive image-text pairs should be 1 and the label for negative pairs should be 0 as shown here: itm_labels = torch.cat([torch.ones(bs,dtype=torch.long),torch.zeros(2*bs,dtype=torch.long)],dim=0).to(image.device) However, in your visualization code, the loss is calculated as: vl_output = model.itm_head(vl_embeddings)
loss = vl_output[:,1].sum()
I feel confused about loss = vl_output[:,1].sum(), will loss.backward() make this itm score smaller? I thought the ground truth label should be 1 here, and I tried to calculate the loss as: itm_labels = torch.LongTensor([1]) loss = F.cross_entropy(output, itm_labels) but the grad-cam obtained from this loss does not make sense. Could you explain why the loss is calculated as loss = vl_output[:,1].sum()? Thanks!

LiJunnan1992 commented 2 years ago

Hi, as explained in the gradcam paper (https://arxiv.org/pdf/1610.02391.pdf), the gradient is computed w.r.t the prediction score for the ground-truth class, which is vl_output[:,1] in ALBEF's ITM.

ziyanyang commented 2 years ago

Hi, as explained in the gradcam paper (https://arxiv.org/pdf/1610.02391.pdf), the gradient is computed w.r.t the prediction score for the ground-truth class, which is vl_output[:,1] in ALBEF's ITM.

Got it. I misunderstood gradcam. Thanks!