johko / computer-vision-course

This repo is the homebase of a community driven course on Computer Vision with Neural Networks. Feel free to join us on the Hugging Face discord: hf.co/join/discord
MIT License
376 stars 124 forks source link

Unit 4, Chapter 1 Fusion of Text and Vision: Draft Outline #54

Closed snehilsanyal closed 2 months ago

snehilsanyal commented 8 months ago

Hey fellow CV Course Contributors and Reviewers šŸ¤—

This issue discusses an initial draft for the chapter Fusion of Text and Vision which is part of Unit 4: Multimodal Models. We feel that since this is an introductory section, we will have less content related to code and more stress will be on content and setting up the stage for later sections in the unit. We would like this unit to be short and crisp, at most 3 sections, nothing more than that unless some other additions are required like spaces/demos.

Thought Process: Previous unit is Unit 3 on Vision Transformers, Next unit is Unit 5 on Generative Models. So, content in this unit will use Unit 3's transformer models (and not traditional approaches to the tasks, so we will refrain from adding too much historical aspects) and also will form a precursor for later sections as well as Unit 5 Generative Models.

1. Introduction

2. Multimodal Tasks and Models

A brief overview of different tasks and models (more emphasis on those tasks, which will be taken up in the course in different sections like #29 and #28

Mention briefly about the tasks and models (task, input and output, models with links or spaces). Can include other examples like text to speech, and speech to text in tasks and add a one-liner on it referring to the HF-Audio Course "For more information on this refer to HF-Audio Course". After this focus on Vision + Text/Audio.

Tasks and Models (each task, it's input output, and around 3-4 model names to go with it):

We can also create an infographic that divides the models into different categories like text + vision, text + vision + audio, more than 3 modalities etc, like a chart or hierarchy.

Mention everything related to tasks on vision + X (audio, text) and focus on Vision Language Models (text + vision) in the next section.

3. Vision Language Models

References:

  1. Awesome Self-Supervised Multimodal Learning
  2. HF Tasks
  3. Multi Modal Machine Learning Course, CMU
  4. Meta's ImageBind
  5. Multimodal Machine Learning: A Survey and Taxonomy
  6. Recent blog by Chip Huyen

Please feel free to share your views on the outline šŸ¤— šŸš€ šŸ”„

merveenoyan commented 8 months ago

Hello @snehilsanyal šŸ‘‹ Overall I think it's very cool. Please note that we also have this issue on Multimodal Models so it would be nice if you could explain how this outline fits in with this :) note: apparently I missed the issue mentions above!

johko commented 8 months ago

Hey @snehilsanyal ,

thanks for the detailed outline and all the thoughts you put into it.

I really like your intro, giving an intuition about what multimodal data is and why it is important :+1:

Regarding the tasks I have a few additions you can consider:

As the field is constantly moving (at high pace) there are always new tasks and names for them, so feel free to include whatever you see fit.

For the models part it would be great to have focus on some models that are included in the transformers library, but I also totally understand that you don't want to skip on things like LLava and ChatGPT-4V. Again, do whatever feels like it makes most sense to you an people would like to read/learn about :slightly_smiling_face:

johko commented 8 months ago

And one paper that I can recommend for a very detailed overview (~100 pages) is this one: https://arxiv.org/pdf/2210.09263.pdf

ATaylorAerospace commented 8 months ago

@snehilsanyal One addition to this chapter that might be very useful are text and vision use cases. Examples could be...

snehilsanyal commented 8 months ago

Hey @merveenoyan Thanks for your comments šŸ¤— Yes sure, thanks for pointing it out. We have mentioned the outline in detail on how it is related to #29 , so whatever is being done in this issue, we will be creating the content accordingly so that the content has a good flow and sync.

snehilsanyal commented 8 months ago

Hey @johko Thank you for your comments šŸ¤— , really glad that you liked our outline. I followed the #29 issue very closely and also encountered your commentary on the issue, much of those commentaries were summarized and incorporated into this outline so that everything is in sync. We will look into recent tasks as suggested by you and include them in the content šŸ¤—

Regarding models we plan to include all types of models as it is educational, but we will stick to those that have ready implementations available with the help of transformers library, for example already available (or developed by us) spaces, demos, or examples. So yes, it will be a mix, where people can read and learn about multimodality in general but since the course is about CV and by HF, we would include models that are already present in the HF ecosystem.

And thanks for the suggested paper :D We will go through it and check if something is interesting and will add it to the content, lol might need to divide pages amongst the group šŸ˜†

charchit7 commented 8 months ago

Thank you for your comments šŸ¤— @johko, @merveenoyan :) . We'll update accordingly.

ratan commented 7 months ago

Very nice detailed outline and flow captured here. We may also include speech-text scenario like whisper models.