ra1ph2 / Vision-Transformer

Implementation of Vision Transformer from scratch and performance compared to standard CNNs (ResNets) and pre-trained ViT on CIFAR10 and CIFAR100.
103 stars 9 forks source link
attention-mechanism computer-vision convolutional-neural-networks jupyter-notebook pytorch transformer-architecture

Vision-Transformer

Open In Colab

Implementation of the ViT model in Pytorch from the paper 'An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale' by Google Research.

Model Architecture

arch

Paper Description

Aim

Methodology

Methodology

Transformer Encoder

Transformer_Encoder

Testing

Why do we need attention mechanism?

Attention

Attention Mechanism

Attention_Mechanism

Multi-Head Attention

Multi_Head_Attention

Datasets

Due to non-availability of powerful compute on Google Colab, we chose to train and test on these 2 datasets –

Major Components Implemented

Results

Attention Map Visualisation

Attention_Map_Visualization

Patch Embedding

Patch_Embedding

Position Embedding

Position_Embedding

Results for Different Model Variations

Resut_Table

Inference from Results

Train vs Test Accuracy Graphs (CIFAR10)

CIFAR10_Acc

Train vs Test Accuracy Graphs (CIFAR100)

CIFAR100_Acc

Future Scope

Presentation

Presentation can be accessed here.

Group Members

Name ID
Akshit Khanna 2017A7PS0023P
Vishal Mittal 2017A7PS0080P
Raghav Bansal 2017A3PS0196P

References