Transformers Architecture + ViT

sazio commented 9 months ago

Here's our (w/ @Anindyadeep, @kaustubh-s1) outline for the chapter on Transformers and Vision Transformers 🤗

Chapter Layout

Introduction
- [ ] Convolution & CNNs : limits, inductive bias, difference with attention welcome transformers
- [ ] Pre-requisites E.g. attention mechanism (see a nice blogpost here for more refs)
Transformers and Vision Transformers
- [ ] The origin of transformers + commonalities and differences between the original transformer (NLP applications) and ViT (may be something visual)
- [ ] Patches, i.e. Make attention work with visual data
- [ ] Technical section w/ how the encoding process is done (add vanilla ViT code)
Pre-Trained Models, Finetuning etc
- [ ] Welcome timm (not clear: do we need to stick to timm or transformers for vision?)
Interpretability + visualizing effects of inductive biases
- [ ] #35
  - CNNs → feature maps or feature viz (maximally activating stimuli) à là circuits thread
  - Transformers → attention maps (on patches)
ViTs at scale and real-world scenarios
- [ ] How ViT got dominant in this space and how it is used in other areas (possible applications w/ mini-projects)
  - segmentation
  - object detection
  - etc ..
Research
- [ ] Foundation models for ViTs
- [ ] Few Shot Applications
- [ ] more (any ideas?)
Other resources
- [ ] References
  - Vaswani et al 2017, Attention is all you need
  - Al, Dosovitskiy et al 2021, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,
  - More articles and blogs
  - Repositories
Conclusions
- [ ] Wrapping up
- [ ] more ideas (?)

johko commented 9 months ago

Thanks for the outline @sazio

Some of my thoughts: Introduction Looks great, especially the connection to CNNs :+1:

Transformers and Vision Transformers Also looks very good and covers the most important parts imo

Pre-Trained Models, Finetuning etc It is great if you can stick to HF libraries like transformers or timm, but if it is too difficult for whatever reason, you can also use alternatives. And which kind of models do you think about here? Just different Vanilla ViT ones or do you also want to cover stuff like DeiT, BeiT, etc. which basically build up on vision transformers? In general look out to not overlap too much with the following chapter about various Transformer architectures (I think they might not have their syllabus ready yet)

Interpretability + visualizing effects of inductive biases Wouldn't have thought about this, but love it :heart_eyes:

ViTs at scale and real-world scenarios Biggest concern I have here is again the overlap with the following chapter, but giving some real world use-cases definitely is a good idea

Research Can you elaborate on the "Foundation models for ViTs" part? I think in general you can keep the research section quite short and maybe state some of the most promising trends

In total I would say that is definitely a syllabus you can start working with. Probably there will be some changes and new ideas while working on it and we'll see how it develops over time :slightly_smiling_face:

kausmeows commented 9 months ago

Great @johko we'll get on with it. Were waiting for some confirmation on this front.

By foundation models in ViT i believe he meant something like- DINO (https://arxiv.org/abs/2104.14294) Although this would come way later (if at all, as this might seem like too much for a majority audience) and considering the requirements i don't think is a priority rn?

Also, we thought of just Vanilla ViT for now but can definitely extend it to DeiT etc. In several iterations that is.

hwaseem04 commented 9 months ago

Went through your proposed curriculum, and it is really amazing.

Interpretability + visualizing effects of inductive biases CNNs → feature maps or feature viz (maximally activating stimuli) à là circuits thread Transformers → attention maps (on patches)

Just my suggestion, you can also look into this recent interpretation for ViT. For example: CVPR-2023

When it comes to ViT, the intermediate interpretations are not well explored as the field is emerging, it would be really helpful to the community if you can add the above part in your curriculum

Anindyadeep commented 9 months ago

Thanks, @johko for your thoughts.

We also had additional thoughts to flow the chapter from very beginner stuff to slowly carrying out some advanced concepts. In the advanced concepts (as an additional part), we wanted to show the full code of ins and outs of the working architecture and training of a vanilla ViT. So according to your viewpoint, should we put it in the last (as a separate or an additional section) or should we put it inside Transformers and Vision Transformers section only?

The overall idea is to assume that the reader might not have any knowledge on transformers. And the reader might have different priorities. So the objective is to not to overwhelm readers but also to provide these concepts for those who are interested in digging deeper.

johko commented 9 months ago

@Anindyadeep I think it is good to separate the advanced part in some way. There are two options I see right now:

As you mentioned make it an own section later on, something like an Advanced CV section
For now, let the advanced part only exist in a notebook. With the strutucture of separate .mdx and notebook folders, this would take it out of the direct flow and you could mention it in one of your sections, so people who are really interested can already dig into it.

I do prefer the second option right now.

@kaustubh-s1 thanks for the clarification in regard to foundation models - makes sense :slightly_smiling_face: Also covering Vanilla ViT is totally fine, I think that really is the core of this section. Expanding it to closely related models and training methods like DeiT and FlexiViT is an option, but definitely not urgently needed right now

sazio commented 9 months ago

Went through your proposed curriculum, and it is really amazing.

Interpretability + visualizing effects of inductive biases CNNs → feature maps or feature viz (maximally activating stimuli) à là circuits thread Transformers → attention maps (on patches)

Just my suggestion, you can also look into this recent interpretation for ViT. For example: CVPR-2023

When it comes to ViT, the intermediate interpretations are not well explored as the field is emerging, it would be really helpful to the community if you can add the above part in your curriculum

Thank you, @hwaseem04 ! I wasn't aware of this resource ! Looks great 💪

sazio commented 9 months ago

@Anindyadeep I think it is good to separate the advanced part in some way. There are two options I see right now:

As you mentioned make it an own section later on, something like an Advanced CV section

For now, let the advanced part only exist in a notebook. With the strutucture of separate .mdx and notebook folders, this would take it out of the direct flow and you could mention it in one of your sections, so people who are really interested can already dig into it.

I do prefer the second option right now.

@kaustubh-s1 thanks for the clarification in regard to foundation models - makes sense 🙂 Also covering Vanilla ViT is totally fine, I think that really is the core of this section. Expanding it to closely related models and training methods like DeiT and FlexiViT is an option, but definitely not urgently needed right now

Thank you @johko for your comments on the outline as well as your suggestions! Regarding these two options, the second one looks more reasonable, I agree! I think we can try to stick to it 💪

lunarflu commented 9 months ago

X-posting from discord since I couldn't find everyone on discord, saw some cool animations for ViT in one of our recent community blogposts I thought you guys might like! If you feel it's appropriate for the course, might be a nice cross-pollination of different community efforts to add them - but no pressure if it doesn't quite fit, I leave it to you 🤗 https://huggingface.co/blog/MarkusStoll/embeddings-during-fine-tuning-of-vision-transform hilkFy1bkFzrNmvH194JU

kausmeows commented 9 months ago

Thanks for sharing this here @lunarflu. This looks really nice 🙌

markus-stoll commented 9 months ago

thanks, let me know if you need any help - e.g. a different resolution or some more information

johko / computer-vision-course

Transformers Architecture + ViT #34

Chapter Layout