Swin Transformer - Githubissues

hila-chefer / Transformer-MM-Explainability

[ICCV 2021- Oral] Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

MIT License

801 stars 107 forks source link

Swin Transformer #4

Closed KP-Zhang closed 3 years ago

KP-Zhang commented 3 years ago

Hello Hila, Thank you for your great work. It is impressive. Right now, I am working on visualizing attention maps with Swin Transformer. Your work brings me some interesting insights. In your code CLIP-explainability.ipynb

for blk in image_attn_blocks:
      grad = blk.attn_grad
      cam = blk.attn_probs
      cam = cam.reshape(-1, cam.shape[-1], cam.shape[-1])
      grad = grad.reshape(-1, grad.shape[-1], grad.shape[-1])
      cam = grad * cam
      cam = cam.clamp(min=0).mean(dim=0)
      R += torch.matmul(cam, R)

the shapes of grad and cam are supposed to be consistent in attention blocks. However, in Swin Transformer, the patch size changes across blocks, which results in different attention sizes. Can you give me some advice on how can I apply your work to generate relevance in Swin Transformer? Thank you for your time. Best wishes, Kevin

hila-chefer commented 3 years ago

Hi @KP-Zhang, thanks for your interest in our work!

As I understand it and please correct me if I'm wrong, not only is the patch size different but also the attention is applied within each patch. If I understand correctly, I would expand the attention map. Assume your attention map refers to tokens 2,3,4 and there are 6 tokens, I would expand my 3x3 attention map to a 6x6 attention map by padding with zeros, and keeping the information from my 3x3 attention maps in the appropriate places. All the calculations, in this case, will remain exactly the same other than the fact that each layer will now have several updates for each patch for which we apply self-attention (which are independent since each will update different tokens). Please let me know if this is helpful, since I'm not 100% familiar with the implementation details of Swin Transformers there may be something I'm missing here, but generally speaking, this should work.

All the best, Hila.

KP-Zhang commented 3 years ago

Hi @hila-chefer, thank you for your response.

I just get the gradient of each attention map with attn.register_hook(self.save_attn_gradients) and grad = blk.attn.get_attn_gradients(). The shapes of the attention maps and the gradients are as follows: torch.Size([64, 3, 49, 49]) torch.Size([64, 3, 49, 49]) torch.Size([16, 6, 49, 49]) torch.Size([16, 6, 49, 49]) torch.Size([4, 12, 49, 49]) torch.Size([4, 12, 49, 49]) torch.Size([4, 12, 49, 49]) torch.Size([4, 12, 49, 49]) torch.Size([4, 12, 49, 49]) torch.Size([4, 12, 49, 49]) torch.Size([1, 24, 49, 49]) torch.Size([1, 24, 49, 49]) The second dimension (i.e., 3, 6, 12, 24) is the number of heads in each attention block. I am trying to manage all the gradients with a shape of [1, number of heads, 49, 49] before applying your algorithm. Do you think it will work? I will let you know the result. Thank you for your time.

Best regards, Kevin

hila-chefer commented 3 years ago

Hi @KP-Zhang,

could you please explain to me the entire shapes in each layer? How many patches? Is it 49 patches? Thanks.

KP-Zhang commented 3 years ago

Hi, @hila-chefer. Thank you for your attention. Yes, the patch number is 49, which results in a 7x7 attention map. I have already managed all gradients to a shape of [1, number of heads, 49, 49] and visualized the attention map. To some extent, the visualized attention map makes sense. However, a 7x7 attention map seems to miss lots of details. I am trying to enlarge the size of the attention map. Do you have any advice?

hila-chefer commented 3 years ago

@KP-Zhang great news! With ViT what we did is we interpolated the relevance map from the attention map size back to the original dimension of the image (we used bilinear interpolation and it seemed to work well). Also, I’d really appreciate it if you can add a PR with your code (of course you’ll get credit) to this repo to share your great work with, I’m sure it could help others as well :) In that case I’ll also go over the code since you say the relevances seem to miss some details.

Thanks, Hila.

KP-Zhang commented 3 years ago

@hila-chefer Hi, Hila. Thank you for your inspirational work. I am planning to add a PR with my code. Since attention map visualization is a part of ongoing work, we are still working on some details. Hopefully, I can finish the work in two weeks. I am sorry for the delay. I am still learning your code and I will share my code with you as soon as possible. Thank you for your time.

Best regards, Kevin

hila-chefer commented 3 years ago

@KP-Zhang thanks Kevin! I was happy to help, and I wish you the best of luck with your work! I look forward to hearing about it soon.

All the best, Hila.

wjhou commented 1 year ago

Hi Kevin,

Do you still have plans for adding a PR about the Swin Transformer? @KP-Zhang

Thank you.

Best, Ethan