Closed KP-Zhang closed 3 years ago
Hi @KP-Zhang, thanks for your interest in our work!
As I understand it and please correct me if I'm wrong, not only is the patch size different but also the attention is applied within each patch. If I understand correctly, I would expand the attention map. Assume your attention map refers to tokens 2,3,4 and there are 6 tokens, I would expand my 3x3 attention map to a 6x6 attention map by padding with zeros, and keeping the information from my 3x3 attention maps in the appropriate places. All the calculations, in this case, will remain exactly the same other than the fact that each layer will now have several updates for each patch for which we apply self-attention (which are independent since each will update different tokens). Please let me know if this is helpful, since I'm not 100% familiar with the implementation details of Swin Transformers there may be something I'm missing here, but generally speaking, this should work.
All the best, Hila.
Hi @hila-chefer, thank you for your response.
I just get the gradient of each attention map with attn.register_hook(self.save_attn_gradients) and grad = blk.attn.get_attn_gradients(). The shapes of the attention maps and the gradients are as follows: torch.Size([64, 3, 49, 49]) torch.Size([64, 3, 49, 49]) torch.Size([16, 6, 49, 49]) torch.Size([16, 6, 49, 49]) torch.Size([4, 12, 49, 49]) torch.Size([4, 12, 49, 49]) torch.Size([4, 12, 49, 49]) torch.Size([4, 12, 49, 49]) torch.Size([4, 12, 49, 49]) torch.Size([4, 12, 49, 49]) torch.Size([1, 24, 49, 49]) torch.Size([1, 24, 49, 49]) The second dimension (i.e., 3, 6, 12, 24) is the number of heads in each attention block. I am trying to manage all the gradients with a shape of [1, number of heads, 49, 49] before applying your algorithm. Do you think it will work? I will let you know the result. Thank you for your time.
Best regards, Kevin
Hi @KP-Zhang,
could you please explain to me the entire shapes in each layer? How many patches? Is it 49 patches? Thanks.
Hi, @hila-chefer. Thank you for your attention. Yes, the patch number is 49, which results in a 7x7 attention map. I have already managed all gradients to a shape of [1, number of heads, 49, 49] and visualized the attention map. To some extent, the visualized attention map makes sense. However, a 7x7 attention map seems to miss lots of details. I am trying to enlarge the size of the attention map. Do you have any advice?
@KP-Zhang great news! With ViT what we did is we interpolated the relevance map from the attention map size back to the original dimension of the image (we used bilinear interpolation and it seemed to work well). Also, I’d really appreciate it if you can add a PR with your code (of course you’ll get credit) to this repo to share your great work with, I’m sure it could help others as well :) In that case I’ll also go over the code since you say the relevances seem to miss some details.
Thanks, Hila.
@hila-chefer Hi, Hila. Thank you for your inspirational work. I am planning to add a PR with my code. Since attention map visualization is a part of ongoing work, we are still working on some details. Hopefully, I can finish the work in two weeks. I am sorry for the delay. I am still learning your code and I will share my code with you as soon as possible. Thank you for your time.
Best regards, Kevin
@KP-Zhang thanks Kevin! I was happy to help, and I wish you the best of luck with your work! I look forward to hearing about it soon.
All the best, Hila.
Hi Kevin,
Do you still have plans for adding a PR about the Swin Transformer? @KP-Zhang
Thank you.
Best, Ethan
Hello Hila, Thank you for your great work. It is impressive. Right now, I am working on visualizing attention maps with Swin Transformer. Your work brings me some interesting insights. In your code CLIP-explainability.ipynb
the shapes of grad and cam are supposed to be consistent in attention blocks. However, in Swin Transformer, the patch size changes across blocks, which results in different attention sizes. Can you give me some advice on how can I apply your work to generate relevance in Swin Transformer? Thank you for your time. Best wishes, Kevin