naver-ai / rope-vit

[ECCV 2024] Official PyTorch implementation of RoPE-ViT "Rotary Position Embedding for Vision Transformer"
https://arxiv.org/abs/2403.13298
Other
211 stars 3 forks source link
eccv2024 position-embedding rotary-position-embedding self-attention transformer vision-transformer
# Rotary Position Embedding for Vision Transformer **[Byeongho Heo](https://sites.google.com/view/byeongho-heo/home), [Song Park](https://8uos.github.io/), [Dongyoon Han](https://sites.google.com/site/dyhan0920/), [Sangdoo Yun](https://sangdooyun.github.io/)**
[NAVER AI LAB](https://naver-career.gitbook.io/en/teams/clova-cic/ai-lab) [![Paper](https://img.shields.io/badge/Paper-arxiv-green)](https://arxiv.org/abs/2403.13298) [![Paper](https://img.shields.io/badge/Paper-ECCV_2024-blue)](https://www.ecva.net/papers/eccv_2024/papers_ECCV/html/1584_ECCV_2024_paper.php) [![Paper](https://img.shields.io/badge/Weights-HuggingFace-red)](https://huggingface.co/collections/naver-ai/rope-vit-670e367fa2d547b705335153)

Official PyTorch implementation of RoPE-ViT "Rotary Position Embedding for Vision Transformer" | arxiv, ECCV

Abstract

Rotary Position Embedding (RoPE) performs remarkably on language models, especially for length extrapolation of Transformers. However, the impacts of RoPE on computer vision domains have been underexplored, even though RoPE appears capable of enhancing Vision Transformer (ViT) performance in a way similar to the language domain. This study provides a comprehensive analysis of RoPE when applied to ViTs, utilizing practical implementations of RoPE for 2D vision data. The analysis reveals that RoPE demonstrates impressive extrapolation performance, i.e., maintaining precision while increasing image resolution at inference. It eventually leads to performance improvement for ImageNet-1k, COCO detection, and ADE-20k segmentation. We believe this study provides thorough guidelines to apply RoPE into ViT, promising improved backbone performance with minimal extra computational overhead.

Updates

Getting Started

You can find RoPE implementations at each folder.

Performances

DeiT-III

RoPE-ViT

Swin Transformer

RoPE-ViT

Pre-trained weights

DeiT-III (400 epochs)

Model Name Top-1 (224) Top-1 (384) Weights
deit_small_patch16_LS 80.4 79.4 HF hub / Google drive
rope_axial_deit_small_patch16_LS 80.9 80.0 HF hub / Google drive
rope_mixed_deit_small_patch16_LS 80.9 81.8 HF hub / Google drive
rope_axial_ape_deit_small_patch16_LS 80.7 81.2 HF hub / Google drive
rope_mixed_ape_deit_small_patch16_LS 80.9 81.7 HF hub / Google drive
deit_base_patch16_LS 83.4 82.8 HF hub / Google drive
rope_axial_deit_base_patch16_LS 83.6 83.9 HF hub / Google drive
rope_mixed_deit_base_patch16_LS 83.8 84.4 HF hub / Google drive
rope_axial_ape_deit_base_patch16_LS 83.7 83.8 HF hub / Google drive
rope_mixed_ape_deit_base_patch16_LS 83.8 84.3 HF hub / Google drive
deit_large_patch16_LS 84.6 84.2 HF hub / Google drive
rope_axial_deit_large_patch16_LS 84.7 85.1 HF hub / Google drive
rope_mixed_deit_large_patch16_LS 84.8 85.6 HF hub / Google drive
rope_axial_ape_deit_large_patch16_LS 84.7 85.1 HF hub / Google drive
rope_mixed_ape_deit_large_patch16_LS 84.9 85.5 HF hub / Google drive

Swin Transformer (300 epochs)

Model Name Top-1 (224) Top-1 (384) Weights
swin_tiny_patch4_window7_224 81.2 78.9
swin_rope_axial_tiny_patch4_window7_224 81.3 79.2 HF hub / Google drive
swin_rope_mixed_tiny_patch4_window7_224 81.4 79.5 HF hub / Google drive
swin_small_patch4_window7_224 82.9 81.0
swin_rope_axial_small_patch4_window7_224 83.1 80.9 HF hub / Google drive
swin_rope_mixed_small_patch4_window7_224 83.0 81.4 HF hub / Google drive
swin_base_patch4_window7_224 83.3 81.2
swin_rope_axial_base_patch4_window7_224 83.6 81.8 HF hub / Google drive
swin_rope_mixed_base_patch4_window7_224 83.7 82.1 HF hub / Google drive

How to cite

@inproceedings{heo2024ropevit,
    title={Rotary Position Embedding for Vision Transformer},
    author={Heo, Byeongho and Park, Song and Han, Dongyoon and Yun, Sangdoo},
    year={2024},
    booktitle={European Conference on Computer Vision (ECCV)},
}

License

This project is distributed under Apache-2.0,
except for the files below which originated from https://github.com/meta-llama/codellama.

RoPE-ViT
Copyright (c) 2024-present NAVER Cloud Corp.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.