Closed sazio closed 3 months ago
Thanks for the outline @sazio
Some of my thoughts: Introduction Looks great, especially the connection to CNNs :+1:
Transformers and Vision Transformers Also looks very good and covers the most important parts imo
Pre-Trained Models, Finetuning etc It is great if you can stick to HF libraries like transformers or timm, but if it is too difficult for whatever reason, you can also use alternatives. And which kind of models do you think about here? Just different Vanilla ViT ones or do you also want to cover stuff like DeiT, BeiT, etc. which basically build up on vision transformers? In general look out to not overlap too much with the following chapter about various Transformer architectures (I think they might not have their syllabus ready yet)
Interpretability + visualizing effects of inductive biases Wouldn't have thought about this, but love it :heart_eyes:
ViTs at scale and real-world scenarios Biggest concern I have here is again the overlap with the following chapter, but giving some real world use-cases definitely is a good idea
Research Can you elaborate on the "Foundation models for ViTs" part? I think in general you can keep the research section quite short and maybe state some of the most promising trends
In total I would say that is definitely a syllabus you can start working with. Probably there will be some changes and new ideas while working on it and we'll see how it develops over time :slightly_smiling_face:
Great @johko we'll get on with it. Were waiting for some confirmation on this front.
By foundation models in ViT i believe he meant something like- DINO (https://arxiv.org/abs/2104.14294) Although this would come way later (if at all, as this might seem like too much for a majority audience) and considering the requirements i don't think is a priority rn?
Also, we thought of just Vanilla ViT for now but can definitely extend it to DeiT etc. In several iterations that is.
Went through your proposed curriculum, and it is really amazing.
Interpretability + visualizing effects of inductive biases CNNs ā feature maps or feature viz (maximally activating stimuli) Ć lĆ circuits thread Transformers ā attention maps (on patches)
Just my suggestion, you can also look into this recent interpretation for ViT. For example: CVPR-2023
When it comes to ViT, the intermediate interpretations are not well explored as the field is emerging, it would be really helpful to the community if you can add the above part in your curriculum
Thanks, @johko for your thoughts.
We also had additional thoughts to flow the chapter from very beginner stuff to slowly carrying out some advanced concepts. In the advanced concepts (as an additional part), we wanted to show the full code of ins and outs of the working architecture and training of a vanilla ViT. So according to your viewpoint, should we put it in the last (as a separate or an additional section) or should we put it inside Transformers and Vision Transformers section only?
The overall idea is to assume that the reader might not have any knowledge on transformers. And the reader might have different priorities. So the objective is to not to overwhelm readers but also to provide these concepts for those who are interested in digging deeper.
@Anindyadeep I think it is good to separate the advanced part in some way. There are two options I see right now:
I do prefer the second option right now.
@kaustubh-s1 thanks for the clarification in regard to foundation models - makes sense :slightly_smiling_face: Also covering Vanilla ViT is totally fine, I think that really is the core of this section. Expanding it to closely related models and training methods like DeiT and FlexiViT is an option, but definitely not urgently needed right now
Went through your proposed curriculum, and it is really amazing.
Interpretability + visualizing effects of inductive biases CNNs ā feature maps or feature viz (maximally activating stimuli) Ć lĆ circuits thread Transformers ā attention maps (on patches)
Just my suggestion, you can also look into this recent interpretation for ViT. For example: CVPR-2023
When it comes to ViT, the intermediate interpretations are not well explored as the field is emerging, it would be really helpful to the community if you can add the above part in your curriculum
Thank you, @hwaseem04 ! I wasn't aware of this resource ! Looks great šŖ
@Anindyadeep I think it is good to separate the advanced part in some way. There are two options I see right now:
- As you mentioned make it an own section later on, something like an Advanced CV section
- For now, let the advanced part only exist in a notebook. With the strutucture of separate .mdx and notebook folders, this would take it out of the direct flow and you could mention it in one of your sections, so people who are really interested can already dig into it.
I do prefer the second option right now.
@kaustubh-s1 thanks for the clarification in regard to foundation models - makes sense š Also covering Vanilla ViT is totally fine, I think that really is the core of this section. Expanding it to closely related models and training methods like DeiT and FlexiViT is an option, but definitely not urgently needed right now
Thank you @johko for your comments on the outline as well as your suggestions! Regarding these two options, the second one looks more reasonable, I agree! I think we can try to stick to it šŖ
X-posting from discord since I couldn't find everyone on discord, saw some cool animations for ViT in one of our recent community blogposts I thought you guys might like! If you feel it's appropriate for the course, might be a nice cross-pollination of different community efforts to add them - but no pressure if it doesn't quite fit, I leave it to you š¤
https://huggingface.co/blog/MarkusStoll/embeddings-during-fine-tuning-of-vision-transform
Thanks for sharing this here @lunarflu. This looks really nice š
thanks, let me know if you need any help - e.g. a different resolution or some more information
Here's our (w/ @Anindyadeep, @kaustubh-s1) outline for the chapter on Transformers and Vision Transformers š¤
Chapter Layout
Introduction
Transformers and Vision Transformers
Pre-Trained Models, Finetuning etc
timm
(not clear: do we need to stick to timm or transformers for vision?)Interpretability + visualizing effects of inductive biases
circuits thread
ViTs at scale and real-world scenarios
Research
Other resources
Conclusions