eroshacinas commented 9 months ago

After discussions with @alanahmet, @SuryaKrishna02, @Mkrolick, and @sulphatet, here is our proposed subdivisions for learning common vision transformers, specifically the Shifted Windows (SWIN) architecture.

Introduction

This section will provide the limitations of ViT and the rationale for using SWIN over ViT.

[ ] Background on Transformers. (A brief background of how transformers work in CV, since we assume this will primarily be tackled in the preceding Transformers Architecture + ViT chapter)
[ ] Limitations of ViT and the emergence of SWIN Transformer. (Quadratic complexity, absolute positional encoding, etc.)
[ ] Motivation for SWIN Transformer. (The problems it addresses from ViT)

SWIN Transformer Architecture and Its Advantages

This section will provide the theoretical aspect of SWIN's advantages to have better performance over ViT.

[ ] Overview
[ ] Window-based self-attention ((Linear complexity and its efficiency when scaling)
[ ] Hierarchical Representation / MSA and how it enhances feature representation
[ ] Relative position bias

Deconstructing SWIN

This section will now go into the technical (and coding) aspect of the architecture; primarily on breaking down the key layers to understand how it is implemented and why it is that way.

[ ] Deconstruct key layers (Patch embedding, W-MSA / SW-MSA, Relative Position Bias, and Patch Merging)
[ ] Implement the key layers mentioned above
[ ] Plot attention maps of W-MSA / SW-MSA to get a better understanding of the model's perception

Application of SWIN

This section is where SWIN will be finetuned to classify a custom dataset via hugging face

[ ] Finetune SWIN, ViT, and CNN-based architectures for classification of a custom dataset.
[ ] Record each model's performance and compare
[ ] Discussion of SWIN for other downstream tasks like object detection and segmentation

Conclusion

[ ] Wrap up
[ ] Advancements on SWIN
[ ] Advancements on Computer Vision models in general (e.g. self-supervised methods)

Please let us know your suggestions on this proposal.

johko commented 9 months ago

Hey @eroshacinas thanks for the outline and all the effort you already put into it.

Maybe my description for the chapter was a bit misleading, so now it is very Swin focused, but I just wanted to give that as one potential model you can cover. In general I think the chapter should cover more than one popular model, but does not need to be in much detail as you did it here for Swin.

What I think would be great is to have some models that showcase popular/interesting architecture choices, like:

Swin https://huggingface.co/docs/transformers/model_doc/swinv2
MobileViT(v2) https://huggingface.co/docs/transformers/model_doc/mobilevitv2
DiNAT https://huggingface.co/docs/transformers/model_doc/dinat
CvT https://huggingface.co/docs/transformers/model_doc/cvt

But those are just some examples from roaming through the transformers library that I find in interesting, feel free to suggst others.

eroshacinas commented 9 months ago

Hello @johko, apologies for the misunderstanding I thought we needed actual implementation of key features. Thanks for clearing that up! We'll reflect on these changes and take into account your suggestions.

johko commented 9 months ago

No need for apologies @eroshacinas - I'm super grateful for everyone who is contributing and misunderstandings happen in such big projects. It is good that we could clarify this one quite early on :hugs:

merveenoyan commented 9 months ago

Hello @eroshacinas! @johko my 2 cents: it would be fine if they would contribute this under Swin section in pre-trained models/backbones/common architectures. However it might be non-standard if other sections fall short since this one seems comprehensive. Do you think we should come up with how each chapter is structured?

eroshacinas commented 9 months ago

Hi @merveenoyan! Our team decided to follow @johko's suggestion to focus more on the theory (of the key features) and less on the actual implementation of the architecture. Though I'm open to the suggestion of going more in-depth like the one we listed originally after we establish the theories of the key features since that's where it gets interesting.

johko / computer-vision-course

Common Vision Transformers (SWIN) Chapter #41

Introduction

SWIN Transformer Architecture and Its Advantages

Deconstructing SWIN

Application of SWIN

Conclusion