ouusan / some-papers

0 stars 0 forks source link

Advancing Vision Transformers with Group-Mix Attention (---Efficient Attention) #1

Open ouusan opened 8 months ago

ouusan commented 8 months ago

1.Public code and paper link: I have installed the following code: https://github.com/AILab-CVC/GroupMixFormer paper link : https://arxiv.org/abs/2311.15157

  1. What does this work do?

3.What is the heart of this method?

(1). pre-attention branches:x[0,1,2,3]

x[0]: employ an identity mapping on one segment instead of an aggregator to maintain the network’s abilities in modeling individual token correlations: individual patterns x[1,2,3]: use aggregators with different kernel sizes( kernel sizes of 3, 5, and 7) to generate group proxies: group patterns

(2). the non-attention branch:x[4]/x_local

x[4]: To construct diverse connections, the rightmost branch utilizes aggregation but without attention.

4.results

5.can I remember some related works

ouusan commented 8 months ago

2024-03-30 19-44-38屏幕截图 inspiration: most visualization tools like gradCAM can only visualize image classicifation and segmentation features since there output is not a dict type, and they have a Classifier head so that the output be set as : len(model_output.shape) == 1 , so GradCAM/GradCAMPlusPlus.... such tools can't be applied to our multi-output tasks. So how can we visualize ?

done.

ouusan commented 7 months ago

paper link: https://arxiv.org/pdf/2308.10305v1.pdf preliminary section

image image