Advancing Vision Transformers with Group-Mix Attention (---Efficient Attention)

ouusan commented 8 months ago

1.Public code and paper link: I have installed the following code: https://github.com/AILab-CVC/GroupMixFormer paper link : https://arxiv.org/abs/2311.15157

What does this work do?

Attention map generated from the Query and Key captures only token-to-token correlations at one single granularity, while this paper argue that self-attention should have a more comprehensive mechanism to capture correlations among tokens and groups (i.e., multiple adjacent tokens)for higher representational capacity.
Thereby, they propose Group-Mix Attention (GMA) as an advanced replacement for traditional self-attention, which can simultaneously capture token-to-token, token-to-group, and group-to-group correlations with various group sizes
Adopt stochastic Depth in ViT backbone

3.What is the heart of this method?

Splits the tokens into uniform and distinct segments and substitutes some individual tokens with group proxies generated via group aggregators:

(1). pre-attention branches:x[0,1,2,3]

x[0]: employ an identity mapping on one segment instead of an aggregator to maintain the network’s abilities in modeling individual token correlations: individual patterns x[1,2,3]: use aggregators with different kernel sizes( kernel sizes of 3, 5, and 7) to generate group proxies: group patterns

(2). the non-attention branch:x[4]/x_local

x[4]: To construct diverse connections, the rightmost branch utilizes aggregation but without attention.

aggregators can be efficiently implemented with sliding-window-based operations)e.g., pooling and convolution...)depth-wise convolutions with various kernel sizes to implement aggregators Agg(·) in this paper.

4.results

5.can I remember some related works

ouusan commented 8 months ago

2024-03-30 19-44-38屏幕截图 inspiration: most visualization tools like gradCAM can only visualize image classicifation and segmentation features since there output is not a dict type, and they have a Classifier head so that the output be set as : len(model_output.shape) == 1 , so GradCAM/GradCAMPlusPlus.... such tools can't be applied to our multi-output tasks. So how can we visualize ?

done.

ouusan commented 7 months ago

paper link: https://arxiv.org/pdf/2308.10305v1.pdf preliminary section

ouusan / some-papers

Advancing Vision Transformers with Group-Mix Attention (---Efficient Attention) #1