Is this really using the technique from the publication?

The top of this repo's README links the article [ICCV 2021- Oral] PyTorch Implementation of Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, but I'm looking at the CLIP_explainability.ipynb notebook, and it appears to me as if this does not demonstrate the technique introduced by the paper. Have I missed something? Or should this notebook be updated?

Each of the examples in the notebook does no more than give heatmaps using the output of the helper function interpret, which makes use of simple self-attention, whereas the publication computed self-attention with effects from co-attention (cf. equation 11).

Here's a relevant excerpt from interpret with comments to indicate correspondence between the python code and the publication. Equation 11 is absent.

    R = torch.eye(num_tokens, num_tokens, dtype=image_attn_blocks[0].attn_probs.dtype).to(device) # eq 1: self-attn Relevancy map
    R = R.unsqueeze(0).expand(batch_size, num_tokens, num_tokens)
    for i, blk in enumerate(image_attn_blocks):
        if i < start_layer:
          continue
        grad = torch.autograd.grad(one_hot, [blk.attn_probs], retain_graph=True)[0].detach()
        cam = blk.attn_probs.detach() # A (attention map)
        cam = cam.reshape(-1, cam.shape[-1], cam.shape[-1])
        grad = grad.reshape(-1, grad.shape[-1], grad.shape[-1])
        cam = grad * cam # A-bar, eq 5
        cam = cam.reshape(batch_size, -1, cam.shape[-1], cam.shape[-1])
        cam = cam.clamp(min=0).mean(dim=1)
        R = R + torch.bmm(cam, R) # eq 6. It's not eq 7 b/c 7 starts from an R which is zeros, whereas this starts from an R which is identity.

hila-chefer / Transformer-MM-Explainability

Is this really using the technique from the publication? #31