MaxViT: Multi-Axis Vision Transformer is one of the nice papers of late 2022 which is also published in ECCV 2022 by Google AI.
This paper introduces a new attention module called "multi-axis attention" which consists of blocked local and sparse global attention for efficient and scalable spatial interactions on arbitrary input resolutions.
It demonstrates superior performance on various vision tasks including image classification, object detection, and so on.
I think it would be nice to have it on Hugging Face. I would be happy to contribute it on Hugging face.
Model description
MaxViT: Multi-Axis Vision Transformer is one of the nice papers of late 2022 which is also published in ECCV 2022 by Google AI.
I think it would be nice to have it on Hugging Face. I would be happy to contribute it on Hugging face.
cc: @alara @NielsRogge
Open source status
Provide useful links for the implementation
Code and Weights: