lucidrains / vit-pytorch

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch
MIT License
20.41k stars 3.03k forks source link

Attention maps #1

Open Tato14 opened 4 years ago

Tato14 commented 4 years ago

Hi! First, thanks for the great resource. I was wondering how difficult would be to implement the attention results they show in the Fig. 6 and Fig 13 of the paper. I am not quite familiar with transformers. This is similar to GradCam o some different approach?

lucidrains commented 4 years ago

@Tato14 Hi Joan! Seems like the approach came from https://arxiv.org/pdf/2005.00928.pdf I'll have to read it after I get through my queue of papers this week to see how difficult it is to implement! Feel free to keep this issue open in the meanwhile

lucidrains commented 4 years ago

@Tato14 the naive attention map for individual layers is this variable attn https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit_pytorch.py#L56

lukasfolle commented 3 years ago

Its probably this line https://github.com/lucidrains/vit-pytorch/blob/6c8dfc185ea41f4d2388e4d33bbb76f900ff8a0a/vit_pytorch/vit_pytorch.py#L63

PascalHbr commented 3 years ago

Why is the softmax only applied to dim=-1? Shouldn't the softmax be calculated over the last 2 dimensions i.e. over the whole matrix instead of just one dimension of the matrix?

edit: I'll open a separate Issue

suarezjessie commented 3 years ago

Hi @lucidrains, is there any news for the attention map visualization? Thanks!

jpgard commented 3 years ago

It seems this has been implemented; see the description in the README here.