Closed ziyanyang closed 2 years ago
Hi, as explained in the gradcam paper (https://arxiv.org/pdf/1610.02391.pdf), the gradient is computed w.r.t the prediction score for the ground-truth class, which is vl_output[:,1]
in ALBEF's ITM.
Hi, as explained in the gradcam paper (https://arxiv.org/pdf/1610.02391.pdf), the gradient is computed w.r.t the prediction score for the ground-truth class, which is
vl_output[:,1]
in ALBEF's ITM.
Got it. I misunderstood gradcam. Thanks!
Hi,
Thank you for this amazing work! I have a question about getting the grad-cam for ITM. From your code, the label for positive image-text pairs should be 1 and the label for negative pairs should be 0 as shown here: itm_labels = torch.cat([torch.ones(bs,dtype=torch.long),torch.zeros(2*bs,dtype=torch.long)],dim=0).to(image.device) However, in your visualization code, the loss is calculated as: vl_output = model.itm_head(vl_embeddings)
loss = vl_output[:,1].sum()
I feel confused about loss = vl_output[:,1].sum(), will loss.backward() make this itm score smaller? I thought the ground truth label should be 1 here, and I tried to calculate the loss as: itm_labels = torch.LongTensor([1]) loss = F.cross_entropy(output, itm_labels) but the grad-cam obtained from this loss does not make sense. Could you explain why the loss is calculated as loss = vl_output[:,1].sum()? Thanks!