In attention.py demo, get_attention_by_gradcam method's inputs have image_input and text_input, I want to know why choosing text_input to deal. The demo is showed below.
def get_attention_by_gradcam(self, model, tokenizer, image_path, image_input, text_input, attr_name, target_layer):
encoder_name = getattr(model, attr_name, None)
encoder_name.encoder.layer[target_layer].crossattention.self.save_attention = True
output = model(image_input, text_input)
loss = output[:, 1].sum()
model.zero_grad()
loss.backward()
image_size = 256
temp = int(np.sqrt(image_size))
# the effect of mask is let those padding tokens multiply with 0 so that they won't be calculated in cams and
# grads , because of the text preprocess of ALBEF and TCL, mask is unuseful here
mask = **text_input**.attention_mask.view(text_input.attention_mask.size(0), 1, -1, 1, 1)
grads = **encoder_name**.encoder.layer[target_layer].crossattention.self.get_attn_gradients()
cams = encoder_name.encoder.layer[target_layer].crossattention.self.get_attention_map()
Another same question is in 'albef' attention, demo shows atter_name is 'text_encoder', The demo is showed below.
In attention.py demo, get_attention_by_gradcam method's inputs have image_input and text_input, I want to know why choosing text_input to deal. The demo is showed below.
Another same question is in 'albef' attention, demo shows atter_name is 'text_encoder', The demo is showed below.