Unit 4, Chapter 1 Fusion of Text and Vision: Draft Outline

snehilsanyal commented 8 months ago

Hey fellow CV Course Contributors and Reviewers 🤗

This issue discusses an initial draft for the chapter Fusion of Text and Vision which is part of Unit 4: Multimodal Models. We feel that since this is an introductory section, we will have less content related to code and more stress will be on content and setting up the stage for later sections in the unit. We would like this unit to be short and crisp, at most 3 sections, nothing more than that unless some other additions are required like spaces/demos.

Thought Process: Previous unit is Unit 3 on Vision Transformers, Next unit is Unit 5 on Generative Models. So, content in this unit will use Unit 3's transformer models (and not traditional approaches to the tasks, so we will refrain from adding too much historical aspects) and also will form a precursor for later sections as well as Unit 5 Generative Models.

1. Introduction

Why Multimodality?
Real-world data is multimodal (it is always a combination of different modalities)
Short example of the human sensory feedback system (humans make decisions based on different sensory inputs and feedback)
Multimodal in what sense? Data? Models? Fusion Technique? Are spectrograms an example of multimodal data representation? (input is multimodal, output is multimodal, both input and output are of different modalities, this part is foundation for multimodal tasks and models)
Why data is multimodal in many real-life scenarios, how real-life content is multimodal data and is essential for search (example from Google and Bing)
Some cool applications and examples of multimodality (Robotics: Vision Language Action models like RT2, RTX, Palm-E)

2. Multimodal Tasks and Models

A brief overview of different tasks and models (more emphasis on those tasks, which will be taken up in the course in different sections like #29 and #28

Mention briefly about the tasks and models (task, input and output, models with links or spaces). Can include other examples like text to speech, and speech to text in tasks and add a one-liner on it referring to the HF-Audio Course "For more information on this refer to HF-Audio Course". After this focus on Vision + Text/Audio.

Tasks and Models (each task, it's input output, and around 3-4 model names to go with it):

Document Visual Question Answering (text + vision), Models: LayoutLM, Nougat, Donut.
Image to Text, Visual Question Answering Models: Deplot, Pix2Struct, VILT, TrOCR, BLIP
Text to Image (synthesis and generation) SD, IF etc
Image and Video Captioning
Text to Video Models: CLIP-VTT etc

We can also create an infographic that divides the models into different categories like text + vision, text + vision + audio, more than 3 modalities etc, like a chart or hierarchy.

Mention everything related to tasks on vision + X (audio, text) and focus on Vision Language Models (text + vision) in the next section.

3. Vision Language Models

Introduction to Vision Language Models (brief, mechanism)
Cool Applications and examples (Multimodal Chatbots like GILL, LLava, Video ChatGPT, some cool application being developed in #29 )
Emphasize on tasks that involve CLIP and relatives #29
A brief ending of the introduction section which sets the stage for next sections like CLIP and relatives and fine-tuning.

References:

Please feel free to share your views on the outline 🤗 🚀 🔥

merveenoyan commented 8 months ago

Hello @snehilsanyal 👋 Overall I think it's very cool. Please note that we also have this issue on Multimodal Models so it would be nice if you could explain how this outline fits in with this :) note: apparently I missed the issue mentions above!

johko commented 8 months ago

Hey @snehilsanyal ,

thanks for the detailed outline and all the thoughts you put into it.

I really like your intro, giving an intuition about what multimodal data is and why it is important :+1:

Regarding the tasks I have a few additions you can consider:

Visual Grounding/Open Vocabulary Object Detection (like OWL-ViT)
Image-Text Retrieval
Referring Expression Comprehension (a rather special one)

As the field is constantly moving (at high pace) there are always new tasks and names for them, so feel free to include whatever you see fit.

For the models part it would be great to have focus on some models that are included in the transformers library, but I also totally understand that you don't want to skip on things like LLava and ChatGPT-4V. Again, do whatever feels like it makes most sense to you an people would like to read/learn about :slightly_smiling_face:

johko commented 8 months ago

And one paper that I can recommend for a very detailed overview (~100 pages) is this one: https://arxiv.org/pdf/2210.09263.pdf

ATaylorAerospace commented 8 months ago

@snehilsanyal One addition to this chapter that might be very useful are text and vision use cases. Examples could be...

Real Estate Analysis: Analyzing property images and descriptions for categorization
Ecommerce Product Recommendation: Recommending products based on image and text reviews
Healthcare Diagnosis: Interpreting medical images and patient history for diagnosis
Social Media Monitoring: Analyzing social media posts and images for sentiment analysis

snehilsanyal commented 8 months ago

Hey @merveenoyan Thanks for your comments 🤗 Yes sure, thanks for pointing it out. We have mentioned the outline in detail on how it is related to #29 , so whatever is being done in this issue, we will be creating the content accordingly so that the content has a good flow and sync.

snehilsanyal commented 8 months ago

Hey @johko Thank you for your comments 🤗 , really glad that you liked our outline. I followed the #29 issue very closely and also encountered your commentary on the issue, much of those commentaries were summarized and incorporated into this outline so that everything is in sync. We will look into recent tasks as suggested by you and include them in the content 🤗

Regarding models we plan to include all types of models as it is educational, but we will stick to those that have ready implementations available with the help of transformers library, for example already available (or developed by us) spaces, demos, or examples. So yes, it will be a mix, where people can read and learn about multimodality in general but since the course is about CV and by HF, we would include models that are already present in the HF ecosystem.

And thanks for the suggested paper :D We will go through it and check if something is interesting and will add it to the content, lol might need to divide pages amongst the group 😆

charchit7 commented 8 months ago

Thank you for your comments 🤗 @johko, @merveenoyan :) . We'll update accordingly.

ratan commented 7 months ago

Very nice detailed outline and flow captured here. We may also include speech-text scenario like whisper models.

johko / computer-vision-course

Unit 4, Chapter 1 Fusion of Text and Vision: Draft Outline #54