microsoft / FocalNet

[NeurIPS 2022] Official code for "Focal Modulation Networks"
MIT License
682 stars 61 forks source link

Focal Modulation Networks

This is the official Pytorch implementation of FocalNets:

"Focal Modulation Networks" by Jianwei Yang, Chunyuan Li, Xiyang Dai, Lu Yuan and Jianfeng Gao.

PWC PWC PWC PWC

News

Introduction

We propose FocalNets: Focal Modulation Networks, an attention-free architecture that achieves superior performance than SoTA self-attention (SA) methods across various vision benchmarks. SA is an first interaction, last aggregation (FILA) process as shown above. Our Focal Modulation inverts the process by first aggregating, last interaction (FALI). This inversion brings several merits:

Before getting started, see what our FocalNets have learned to perceive images and where to modulate!

Finally, FocalNets are built with convolutional and linear layers, but goes beyond by proposing a new modulation mechanism that is simple, generic, effective and efficient. We hereby recommend:

Focal-Modulation May be What We Need for Visual Modeling!

Getting Started

Benchmarking

Image Classification on ImageNet-1K

Model Depth Dim Kernels #Params. (M) FLOPs (G) Throughput (imgs/s) Top-1 Download
FocalNet-T [2,2,6,2] 96 [3,5] 28.4 4.4 743 82.1 ckpt/config/log
FocalNet-T [2,2,6,2] 96 [3,5,7] 28.6 4.5 696 82.3 ckpt/config/log
FocalNet-S [2,2,18,2] 96 [3,5] 49.9 8.6 434 83.4 ckpt/config/log
FocalNet-S [2,2,18,2] 96 [3,5,7] 50.3 8.7 406 83.5 ckpt/config/log
FocalNet-B [2,2,18,2] 128 [3,5] 88.1 15.3 280 83.7 ckpt/config/log
FocalNet-B [2,2,18,2] 128 [3,5,7] 88.7 15.4 269 83.9 ckpt/config/log
Model Depth Dim Kernels #Params. (M) FLOPs (G) Throughput (imgs/s) Top-1 Download
FocalNet-T 12 192 [3,5,7] 5.9 1.1 2334 74.1 ckpt/config/log
FocalNet-S 12 384 [3,5,7] 22.4 4.3 920 80.9 ckpt/config/log
FocalNet-B 12 768 [3,5,7] 87.2 16.9 300 82.4 ckpt/config/log

ImageNet-22K Pretraining

Model Depth Dim Kernels #Params. (M) Download
FocalNet-L [2,2,18,2] 192 [5,7,9] 207 ckpt/config
FocalNet-L [2,2,18,2] 192 [3,5,7,9] 207 ckpt/config
FocalNet-XL [2,2,18,2] 256 [5,7,9] 366 ckpt/config
FocalNet-XL [2,2,18,2] 256 [3,5,7,9] 366 ckpt/config
FocalNet-H [2,2,18,2] 352 [3,5,7] 687 ckpt/config
FocalNet-H [2,2,18,2] 352 [3,5,7,9] 689 ckpt/config

NOTE: We reorder the class names in imagenet-22k so that we can directly use the first 1k logits for evaluating on imagenet-1k. We remind that the 851th class (label=850) in imagenet-1k is missed in imagenet-22k. Please refer to this labelmap. More discussion found in this issue.

Object Detection on COCO

Backbone Kernels Lr Schd #Params. (M) FLOPs (G) box mAP mask mAP Download
FocalNet-T [9,11] 1x 48.6 267 45.9 41.3 ckpt/config/log
FocalNet-T [9,11] 3x 48.6 267 47.6 42.6 ckpt/config/log
FocalNet-T [9,11,13] 1x 48.8 268 46.1 41.5 ckpt/config/log
FocalNet-T [9,11,13] 3x 48.8 268 48.0 42.9 ckpt/config/log
FocalNet-S [9,11] 1x 70.8 356 48.0 42.7 ckpt/config/log
FocalNet-S [9,11] 3x 70.8 356 48.9 43.6 ckpt/config/log
FocalNet-S [9,11,13] 1x 72.3 365 48.3 43.1 ckpt/config/log
FocalNet-S [9,11,13] 3x 72.3 365 49.3 43.8 ckpt/config/log
FocalNet-B [9,11] 1x 109.4 496 48.8 43.3 ckpt/config/log
FocalNet-B [9,11] 3x 109.4 496 49.6 44.1 ckpt/config/log
FocalNet-B [9,11,13] 1x 111.4 507 49.0 43.5 ckpt/config/log
FocalNet-B [9,11,13] 3x 111.4 507 49.8 44.1 ckpt/config/log
Backbone Kernels Method Lr Schd #Params. (M) FLOPs (G) box mAP Download
FocalNet-T [11,9,9,7] Cascade Mask R-CNN 3x 87.1 751 51.5 ckpt/config/log
FocalNet-T [11,9,9,7] ATSS 3x 37.2 220 49.6 ckpt/config/log
FocalNet-T [11,9,9,7] Sparse R-CNN 3x 111.2 178 49.9 ckpt/config/log

Semantic Segmentation on ADE20K

Backbone Kernels Method #Params. (M) FLOPs (G) mIoU mIoU (MS) Download
FocalNet-T [9,11] UPerNet 61 944 46.5 47.2 ckpt/config/log
FocalNet-T [9,11,13] UPerNet 61 949 46.8 47.8 ckpt/config/log
FocalNet-S [9,11] UPerNet 83 1035 49.3 50.1 ckpt/config/log
FocalNet-S [9,11,13] UPerNet 84 1044 49.1 50.1 ckpt/config/log
FocalNet-B [9,11] UPerNet 124 1180 50.2 51.1 ckpt/config/log
FocalNet-B [9,11,13] UPerNet 126 1192 50.5 51.4 ckpt/config/log

Visualizations

There are three steps in our FocalNets:

  1. Contexualization with depth-wise conv;
  2. Multi-scale aggregation with gating mechanism;
  3. Modulator derived from context aggregation and projection.

We visualize them one by one.

Yellow colors represent higher values. Apparently, FocalNets learn to gather more local context at earlier stages while more global context at later stages.

From left to right, the images are input image, gating map for focal level 1,2,3 and the global context. Clearly, our model has learned where to gather the context depending on the visual contents at different locations.

The modulator derived from our model automatically learns to focus on the foreground regions.

For visualization by your own, please refer to visualization notebook.

Citation

If you find this repo useful to your project, please consider to cite it with following bib:

@misc{yang2022focal,
      title={Focal Modulation Networks}, 
      author={Jianwei Yang and Chunyuan Li and Xiyang Dai and Jianfeng Gao},
      journal={Advances in Neural Information Processing Systems (NeurIPS)},
      year={2022}
}

Acknowledgement

Our codebase is built based on Swin Transformer and Focal Transformer. To achieve the SoTA object detection performance, we heavily rely on the most advanced method DINO and the advices from the authors. We thank the authors for the nicely organized code!

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.