pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
15.98k stars 6.92k forks source link

crossvit vs vision transformer #8598

Open Navoditamathur opened 3 weeks ago

Navoditamathur commented 3 weeks ago

🚀 The feature

Implement CrossVIT model for Fine grained classification

Motivation, pitch

CrossViT integrates multi-scale feature representations, enabling it to efficiently process images of varying resolutions. By implementing CrossViT in PyTorch, you can harness the strength of multi-scale feature fusion to improve performance in image classification, object detection, and other computer vision tasks.

Key Points:

Multi-Scale Representation: CrossViT uses two separate branches with different image patch sizes, allowing the model to capture both fine and coarse-grained features. This dual-branch architecture significantly enhances the model's ability to understand complex image structures.

Cross-Attention Mechanism: The core innovation of CrossViT lies in its cross-attention mechanism, where features from one branch are fused with features from another. This interaction facilitates information exchange between scales, improving the model's capability to detect patterns across different granularities.

Real-World Applications: CrossViT has shown promise in tasks ranging from image classification to object detection, making it a versatile choice for real-world applications such as medical imaging, remote sensing, and autonomous vehicles. PyTorch's support for deployment on different platforms (e.g., mobile and embedded systems) ensures that CrossViT can be used in diverse environments. It shows strong performance in scenarios where multi-scale feature extraction is crucial, such as fine-grained image classification or tasks requiring both global context and local details

Alternatives

No response

Additional context

No response

abhi-glitchhg commented 3 weeks ago

There are so many versions of vision transformers paper, I feel like it's better to use Timm library. It has implementation of many vision models.

NicolasHug commented 3 weeks ago

Hi @Navoditamathur

Thank you for opening this issue. We're not planning on adding new models to torchvision at this point. I agree with @abhi-glitchhg that other repos like timm might be better venue for that.