Open Tato14 opened 4 years ago
@Tato14 Hi Joan! Seems like the approach came from https://arxiv.org/pdf/2005.00928.pdf I'll have to read it after I get through my queue of papers this week to see how difficult it is to implement! Feel free to keep this issue open in the meanwhile
@Tato14 the naive attention map for individual layers is this variable attn
https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit_pytorch.py#L56
Why is the softmax only applied to dim=-1? Shouldn't the softmax be calculated over the last 2 dimensions i.e. over the whole matrix instead of just one dimension of the matrix?
edit: I'll open a separate Issue
Hi @lucidrains, is there any news for the attention map visualization? Thanks!
Hi! First, thanks for the great resource. I was wondering how difficult would be to implement the attention results they show in the Fig. 6 and Fig 13 of the paper. I am not quite familiar with transformers. This is similar to GradCam o some different approach?