johko / computer-vision-course

This repo is the homebase of a community driven course on Computer Vision with Neural Networks. Feel free to join us on the Hugging Face discord: hf.co/join/discord
MIT License
389 stars 126 forks source link

Common Vision Transformers (SWIN) Chapter #41

Closed eroshacinas closed 3 months ago

eroshacinas commented 9 months ago

After discussions with @alanahmet, @SuryaKrishna02, @Mkrolick, and @sulphatet, here is our proposed subdivisions for learning common vision transformers, specifically the Shifted Windows (SWIN) architecture.

Introduction

This section will provide the limitations of ViT and the rationale for using SWIN over ViT.

SWIN Transformer Architecture and Its Advantages

This section will provide the theoretical aspect of SWIN's advantages to have better performance over ViT.

Deconstructing SWIN

This section will now go into the technical (and coding) aspect of the architecture; primarily on breaking down the key layers to understand how it is implemented and why it is that way.

Application of SWIN

This section is where SWIN will be finetuned to classify a custom dataset via hugging face

Conclusion

Please let us know your suggestions on this proposal.

johko commented 9 months ago

Hey @eroshacinas thanks for the outline and all the effort you already put into it.

Maybe my description for the chapter was a bit misleading, so now it is very Swin focused, but I just wanted to give that as one potential model you can cover. In general I think the chapter should cover more than one popular model, but does not need to be in much detail as you did it here for Swin.

What I think would be great is to have some models that showcase popular/interesting architecture choices, like:

But those are just some examples from roaming through the transformers library that I find in interesting, feel free to suggst others.

eroshacinas commented 9 months ago

Hello @johko, apologies for the misunderstanding I thought we needed actual implementation of key features. Thanks for clearing that up! We'll reflect on these changes and take into account your suggestions.

johko commented 9 months ago

No need for apologies @eroshacinas - I'm super grateful for everyone who is contributing and misunderstandings happen in such big projects. It is good that we could clarify this one quite early on :hugs:

merveenoyan commented 9 months ago

Hello @eroshacinas! @johko my 2 cents: it would be fine if they would contribute this under Swin section in pre-trained models/backbones/common architectures. However it might be non-standard if other sections fall short since this one seems comprehensive. Do you think we should come up with how each chapter is structured?

eroshacinas commented 9 months ago

Hi @merveenoyan! Our team decided to follow @johko's suggestion to focus more on the theory (of the key features) and less on the actual implementation of the architecture. Though I'm open to the suggestion of going more in-depth like the one we listed originally after we establish the theories of the key features since that's where it gets interesting.