Why attention demo chooses language model layer to catch model attention？

In attention.py demo, get_attention_by_gradcam method's inputs have image_input and text_input, I want to know why choosing text_input to deal. The demo is showed below.

def get_attention_by_gradcam(self, model, tokenizer, image_path, image_input, text_input, attr_name, target_layer):
    encoder_name = getattr(model, attr_name, None)
    encoder_name.encoder.layer[target_layer].crossattention.self.save_attention = True
    output = model(image_input, text_input)
    loss = output[:, 1].sum()
    model.zero_grad()
    loss.backward()
    image_size = 256
    temp = int(np.sqrt(image_size))
    # the effect of mask is let those padding tokens multiply with 0 so that they won't be calculated in cams and
    # grads , because of the text preprocess of ALBEF and TCL, mask is unuseful here
    mask = **text_input**.attention_mask.view(text_input.attention_mask.size(0), 1, -1, 1, 1)
    grads = **encoder_name**.encoder.layer[target_layer].crossattention.self.get_attn_gradients()
    cams = encoder_name.encoder.layer[target_layer].crossattention.self.get_attention_map()

Another same question is in 'albef' attention, demo shows atter_name is 'text_encoder', The demo is showed below.

def getAttMap(self, image_path, text):
    if self.model_name.lower() == 'albef':
        engine = ALBEF('ALBEF_4M.pth')
        model, tokenizer = engine.load_model(engine.model_id)
        image_input = engine.load_data(src_type='local', data=[image_path])[0]
        text_input = tokenizer(engine.pre_caption(text), return_tensors="pt")
        self.get_attention_by_gradcam(model, tokenizer, image_path, image_input, text_input,
                                          attr_name='text_encoder', target_layer=8)

om-ai-lab / VL-CheckList

Why attention demo chooses language model layer to catch model attention？ #14