Open ouusan opened 8 months ago
inspiration: most visualization tools like gradCAM can only visualize image classicifation and segmentation features since there output is not a dict type, and they have a Classifier head so that the output be set as : len(model_output.shape) == 1 , so GradCAM/GradCAMPlusPlus.... such tools can't be applied to our multi-output tasks. So how can we visualize ?
done.
paper link: https://arxiv.org/pdf/2308.10305v1.pdf preliminary section
1.Public code and paper link: I have installed the following code: https://github.com/AILab-CVC/GroupMixFormer paper link : https://arxiv.org/abs/2311.15157
Attention map generated from the Query and Key captures only token-to-token correlations at one single granularity, while this paper argue that self-attention should have a more comprehensive mechanism to capture correlations among tokens and groups (i.e., multiple adjacent tokens)for higher representational capacity.
Thereby, they propose Group-Mix Attention (GMA) as an advanced replacement for traditional self-attention, which can simultaneously capture token-to-token, token-to-group, and group-to-group correlations with various group sizes
Adopt stochastic Depth in ViT backbone
3.What is the heart of this method?
(1). pre-attention branches:x[0,1,2,3]
x[0]: employ an identity mapping on one segment instead of an aggregator to maintain the network’s abilities in modeling individual token correlations: individual patterns x[1,2,3]: use aggregators with different kernel sizes( kernel sizes of 3, 5, and 7) to generate group proxies: group patterns
(2). the non-attention branch:x[4]/x_local
x[4]: To construct diverse connections, the rightmost branch utilizes aggregation but without attention.
4.results
5.can I remember some related works