ma-xu / FCViT

A Close Look at Spatial Modeling: From Attention to Convolution
Apache License 2.0
90 stars 5 forks source link

A Close Look at Spatial Modeling: From Attention to Convolution [arXiv]

by Xu Ma, Huan Wang, Can Qin, Kunpeng Li, Xingchen Zhao, Jie Fu, Yun Fu


Motivation

Figure 1: Attention map visualizations of Vision Transformers. For each pair, we show the query point and its corresponding attention map (of last block and last head). We randomly selected images and the query points for illustration. The right color bar identifies the value of normalized attention maps.

:eyes: :bangbang: Observations & Motivations:

Solution: From Attention to Convolution

Figure 2: Illustration of an FCViT block. Following MetaFormer, FCViT considers the block as a combination of token-mixer and channel-mixer, with residual connection and layer normalization (LN). In the token-mixers, we dynamically integrate the global context with input tokens by the token-global similarity. A depth-wise convolution is employed to fuse local information. To improve the generalization ability of the global context, we introduce a competition-driven information bottleneck structure.

Figure 3: Visual comparisons of FCViT-B12 similarity and ViT-B attention map. We plot all the outputs of the last block for the two models (8 groups for FCViT and 12 heads for ViT). Compared to ViT, the results indicate that: 1), FCViT focuses more on the objects; 2), FCViT presents more diversities than multi-head attention, whose attention maps from different heads are nearly the same.


Image Classification

1. Requirements

torch>=1.7.0; torchvision>=0.8.0; pyyaml; apex-amp (if you want to use fp16); timm (pip install git+https://github.com/rwightman/pytorch-image-models.git@9d6aad44f8fd32e89e5cca503efe3ada5071cc2a)

data prepare: ImageNet with the following folder structure, you can extract ImageNet by this script.

│imagenet/
├──train/
│  ├── n01440764
│  │   ├── n01440764_10026.JPEG
│  │   ├── n01440764_10027.JPEG
│  │   ├── ......
│  ├── ......
├──val/
│  ├── n01440764
│  │   ├── ILSVRC2012_val_00000293.JPEG
│  │   ├── ILSVRC2012_val_00002138.JPEG
│  │   ├── ......
│  ├── ......

2. FCViT Models

Model #params Image resolution Top1 Acc Download
FCViT-tiny 4.6M 224 74.9 download
FCViT-B12 14M 224 80.9 download
FCViT-B24 25.7M 224 82.5 download
FCViT-B48 49.1M 224 83.6 download

3. Validation

To evaluate our FCViT models, run:

MODEL=fcvit_tiny #{tiny, b12, b24, b48}
python3 validate.py /path/to/imagenet  --model $MODEL -b 128 --checkpoint {/path/to/checkpoint} 

4. Train

We show how to train FCViT on 8 GPUs. The relation between learning rate and batch size is lr=bs/1024*1e-3. For convenience, assuming the batch size is 1024, then the learning rate is set as 1e-3 (for batch size of 1024, setting the learning rate as 2e-3 sometimes sees better performance).

MODEL=fcvit_tiny # fcvit_{tiny, b12, b24, b48}
DROP_PATH=0.1 # drop path rates [0.1, 0.1, 0,1, 0.2] responding to model [tiny, b12, b24, b48]
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 /path/to/imagenet \
  --model $MODEL -b 128 --lr 1e-3 --drop-path $DROP_PATH --apex-amp

5. Detection and Segmentation

For detection and segmentation tasks, please see here: [detection & instance segmentation] and [semantic segmentation].


Acknowledgment

Our implementation is mainly based on the following codebases. We gratefully thank the authors for their wonderful works.

poolformer, pytorch-image-models, mmdetection, mmsegmentation.


Citation

@article{ma2022fcvit,
  author      = {Ma, Xu and Wang, Huan and Qin, Can and Li, Kunpeng and Zhao, Xingchen and Fu, Jie and Fu, Yun},
  title       = {A Close Look at Spatial Modeling: From Attention to Convolution},
  publisher   = {arXiv},
  year        = {2022},
}