johko / computer-vision-course

This repo is the homebase of a community driven course on Computer Vision with Neural Networks. Feel free to join us on the Hugging Face discord: hf.co/join/discord
MIT License
389 stars 126 forks source link

Trends and New Architectures - Draft Outline #37

Closed mmhamdy closed 3 months ago

mmhamdy commented 9 months ago

Hello๐Ÿ‘‹ ,

This is a draft outline for the Trends and New Architectures chapter. I think it'd be better to call it alternative rather than new. Below I'll give a brief overview of the chapter content.

๐Ÿ”น On Innovation: An Introduction

๐Ÿ”น Case Study: ViT vs. Image Transformer

Image Transformer was an early attempt by the authors of Attention Is All You Need to introduce transformers into computer vision. It would be interesting to compare it to the now-established ViT (both models come from Google Brain).

๐Ÿ”น Why Alternative Architectures?

๐Ÿ”น Hiera

Paper: Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles A hierarchical vision transformer that was introduced by Meta's FAIR. It improves modern hierarchical vision transformers not by adding new components, but by questioning the several vision-specific components in these architectures. Meta wrote a twitter thread about it.

๐Ÿ”น Hyena

Paper: Multi-Dimensional Hyena for Spatial Inductive Bias Initially introduced in NLP, the hyena layer offers a replacement for the transformer's self-attention with subquadratic complexity. A variant of it is the Multi-Dimensional Hyena, Hyena N-D layer, boosts the performance of various Vision Transformer architectures, such as ViT, Swin, and DeiT.

๐Ÿ”น I-JEPA

Paper: Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture The first AI model based on Yann LeCunโ€™s vision for more human-like AI. Image-based Joint-Embedding Predictive Architecture (I-JEPA), is a non-generative approach for self-supervised learning from images. It delivers strong performance on multiple computer vision tasks, and itโ€™s much more computationally efficient than other widely used computer vision models.

๐Ÿ”น RMT

Paper: RMT: Retentive Networks Meet Vision Transformers Retentive Network (RetNet) is a new architecture that ambitiously aims to replace Transformers and to be the new foundation architecture for large language models. Inspired by RetNet, RMT is an attempt to further improve the retention mechanism into a 2D form and introduce it to visual tasks.

๐Ÿ”น Trends And Research Directions

๐Ÿ”น Summary

This part offers a summary of the chapter and refers to other research work that was not mentioned in the chapter.


Notes:

Other Resources

lunarflu commented 9 months ago

Great job @mmhamdy ! ๐Ÿค— It's very thorugh, nice strong draft!๐Ÿ”ฅ

Some small comments:

I think it'd be better to call it alternative rather than new.

IMO it's good to keep "New Architectures", assuming the focus is on the latest improvements, releases and ideas. For me, "Alternative" would be better if comparing older architectures among themselves.

On innovation: An introduction

I would do "On Innovation: An Introduction" (maybe good to emphasize capitals especially for titles)

Case study: ViT vs Image Transformer

I would do "Case Study: ViT vs. Image Transformer

Image Transformer is an early attempt by the authors of attention is all you need to introduce

I would do '...authors of "Attention Is All You Need" to introduce...' (we all probably have heard of the paper, but good to be consistent, maybe even link to it to keep things super clear especially if beginners are following along. In a way, maybe we could think of this as "an optional side exploration", people can click and supplement their knowledge very easily if they feel like it, condenses a lot more info into the course if we do more of this and if you think it sounds reasonable)

Why alternative Architectures

I would do "Why Alternative Architectures โ“ " (optional whether your team prefers emoji / normal question mark, emojis can stand out and inject little bits of fun here and there, while also conveying ideas fast and simply)

A hierarchical vision transformer that was introduced by Meta's FAIR.

Is there some blogpost by Meta on this maybe? If so, could be cool to link, same idea as above where people can do extra exploration if they want to, and a blogpost might have different details or explanations to help understanding.

The first AI model based on Yann LeCunโ€™s vision for more human-like AI.

Could be cool to link to places where Yann lays out his vision, or if it exists, one really good / main one (same concept)

Retentive Network (RetNet) is a new architecture

I would link to RetNet (same concept)

that ambitiously aims to replace Transformer and

I would do "Transformers"

Overall up to you which changes to keep ๐Ÿค— TLDR; Since you mentioned you want to keep the chapter short, links might be a cool way to reference a lot of related content without taking up more screen space

mmhamdy commented 9 months ago

Thank you, @lunarflu for taking the time to read this draft and for providing such amazing feedback ๐Ÿ™‚

IMO it's good to keep "New Architectures", assuming the focus is on the latest improvements, releases and ideas. For me, "Alternative" would be better if comparing older architectures among themselves.

My reason for preferring "Alternative" over "New" is durability. This is a fast-changing field and what's new today may not be new tomorrow. "Alternative" here is compared to the mainstream approach. There's a higher chance for these architectures to stay alternative than new, which keeps this course somewhat fresh without quickly getting out of date. But anyway, it's not a big deal for me.

In a way, maybe we could think of this as "an optional side exploration", people can click and supplement their knowledge very easily if they feel like it, condenses a lot more info into the course if we do more of this and if you think it sounds reasonable)

This part about ViT vs. Image Transformer is really dear to me ๐Ÿ˜„. I'm using this as the opening act for this story. I like to think about why a certain trend or work gets steam while others don't. This case is interesting because it came from the same lab and has overlapping authors. It won't be much, just a short section that could be considered part of the introduction.

Is there some blogpost by Meta on this maybe? If so, could be cool to link, same idea as above where people can do extra exploration if they want to, and a blogpost might have different details or explanations to help understanding.

They didn't write a blog post about it, but they wrote a thread. Will add it above.

Could be cool to link to places where Yann lays out his vision, or if it exists, one really good / main one (same concept)

Yeah, sure. It's in this paper. Will add it above ๐Ÿ‘

Thanks, again ๐Ÿ‘‹

youssefadr commented 7 months ago

Hello! I'll take the RMT chapter if that's okay!

lulmer commented 7 months ago

๐Ÿ”น Why Alternative Architectures?

  • [ ] The limitations of CNNs
  • [ ] The limitations of ViTs

I like this part, I think a lot of the motivation behind those alternative architectures comes from the current limitations we could write a mini-chapter on how costly it is to train and predict a Transformer model.

๐Ÿ”น Hiera

Paper: Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles A hierarchical vision transformer that was introduced by Meta's FAIR. It improves modern hierarchical vision transformers not by adding new components, but by questioning the several vision-specific components in these architectures. Meta wrote a twitter thread about it.

  • [ ] Overview
  • [ ] Why it matters

๐Ÿ”น Hyena

Paper: Multi-Dimensional Hyena for Spatial Inductive Bias Initially introduced in NLP, the hyena layer offers a replacement for the transformer's self-attention with subquadratic complexity. A variant of it is the Multi-Dimensional Hyena, Hyena N-D layer, boosts the performance of various Vision Transformer architectures, such as ViT, Swin, and DeiT.

  • [ ] Overview
  • [ ] Why it matters

๐Ÿ”น I-JEPA

Paper: Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture The first AI model based on Yann LeCunโ€™s vision for more human-like AI. Image-based Joint-Embedding Predictive Architecture (I-JEPA), is a non-generative approach for self-supervised learning from images. It delivers strong performance on multiple computer vision tasks, and itโ€™s much more computationally efficient than other widely used computer vision models.

  • [ ] Overview
  • [ ] Why it matters

๐Ÿ”น RMT

Paper: RMT: Retentive Networks Meet Vision Transformers Retentive Network (RetNet) is a new architecture that ambitiously aims to replace Transformers and to be the new foundation architecture for large language models. Inspired by RetNet, RMT is an attempt to further improve the retention mechanism into a 2D form and introduce it to visual tasks.

  • [ ] Overview
  • [ ] Why it matters

I do agree with the architectures and the plan you mentioned (small overview and why it matters). While not knowing them in depth, I read quite a bit about RetNet because the claim is huge IMO so I would be comfortable writing on RMT. I saw that Hiera and Hyena try to solve computational problems of ViT too.

๐Ÿ”น Trends And Research Directions

  • [ ] Open Vocabulary Learning: aims to enable models to recognize and classify objects in images and videos without being explicitly trained on those categories. This is in contrast to traditional computer vision approaches, which require a large dataset of labeled images for each category.
  • [ ] General foundation models for computer vision: In NLP, language models predicting the next token have proven to be a good foundational model that can be fine-tuned for various tasks. In computer vision, what model and loss objective has the potential to serve as the foundational model for different computer vision tasks tasks.

Here we could mention the work done by Nvidia on open vocabulary panoptic semantic segmentation with Stable Diffusion used as a foundation model. The paper : ODISE
We could mention DinoV2, SAM, and maybe try to showcase that multiple directions (ViT based, Diffusion Based, Other) are overtaken for CV foundation models in literature (also linking with the I-JEPA chapter).

  • [ ] Domain-specific models: CV models that have been trained on a domain-specific dataset or task achieve higher accuracy and performance on that task than general computer vision models. This has many applications, for example, in medicine and healthcare.

Here we could also mention the Space Industry and more particularly Remote Sensing. Maybe MedSAM, but when talking about domain specific CV, often it is just finetuning over famous architectures / weight and eventually more work on custom loss functions. I am wondering if it would not be more appropriate in the chapter about Finetuning ViTs. What do you think ?

Notes:

  • This is just a glimpse of the various new and amazing work done in computer vision. I'm sure there is a lot to add to this chapter and improve it, so your feedback is highly appreciated.
  • We try to make the chapter as short and interesting as possible, but the trends section still needs more work in my opinion.

This Unit is by definition latest advances in research so it will never be fully covered, but I think we will do a good job if we can have the most impactfull paper that went out this year covered. My concern is : Do we follow the hands on philosophy of HF courses, usually there are code snippets using HF libraries, but since we are discussing on very emerging topics and experimental implementations, what kind of code will we be able to provide ?

mmhamdy commented 7 months ago

Hey, @lulmer, that's an awesome review, glad you took the time to write it ๐Ÿ™‚

Here we could mention the work done by Nvidia on open vocabulary panoptic semantic segmentation with Stable Diffusion used as a foundation model. The paper : ODISE

This looks interesting, we can look into it.

We could mention DinoV2, SAM, and maybe try to showcase that multiple directions (ViT based, Diffusion Based, Other) are overtaken for CV foundation models in literature (also linking with the I-JEPA chapter).

I don't think DINO and SAM are a fit among these guys, they are already established and almost mainstream.

Maybe MedSAM, but when talking about domain specific CV, often it is just finetuning over famous architectures / weight and eventually more work on custom loss functions. I am wondering if it would not be more appropriate in the chapter about Finetuning ViTs. What do you think ?

You're right, but this part is about models pretrained exclusively on domain data and not just a finetuned version of a general model.

Do we follow the hands on philosophy of HF courses, usually there are code snippets using HF libraries, but since we are discussing on very emerging topics and experimental implementations, what kind of code will we be able to provide ?

I don't think this chapter will be a hands-on style as it is more about looking forward to the possible future of computer vision.

lulmer commented 7 months ago

I will write something about Hyena and derived architectures if it is okay

farrosalferro commented 7 months ago

I'm also interested writing on the RMT, if that's okay. Actually there is another paper (preprint from ICLR 2024) that also utilizes Retention Network, called ViR (https://arxiv.org/abs/2310.19731). I think this paper represents the retention network applied on vision better than RMT as RMT replaces the gating function of RetNet with softmax (which I think, resulting kinda similar architecture as the MSA from Transformer). However, I think it is a good idea to put both architecture for comparisons. If ViR is going to be included in this unit, is it possible for me to carry the task? I'm currently trying to implement it. But if it's not the case, I'll do the other architecture! Keen to learn new things. Thank you!

youssefadr commented 7 months ago

@farrosalferro Hello, yes no problem for me. There is @lulmer that also wants to work on the RMT chapter.

lulmer commented 7 months ago

@farrosalferro I saw the ViR paper too, I think it is a good idea to mention it. I will stick to Hyena and SSMs architectures so I let you both on RMT.

farrosalferro commented 7 months ago

Thanks guys! I am sorry if I miss the information, but do we have a sort of a group? And how about the fork? Could you provide me the link? Thank you!

mmhamdy commented 7 months ago

Hi @farrosalferro, I sent you an invitation to join the repo