Multimodal Models - CLIP and relatives

pedrogengo commented 9 months ago

Hello!

Inspired by #19 #28, me and my fellow collaborators have also outlined a course curriculum for our section but we would like to have some inputs and feedback from the HF team before we finalize it and start working on it. This is our chosen structure so far.

Introduction

Motivation for multimodality
History of multimodal models
Self supervised learning enabling multi-modality

CLIP

Intro to (ELI5)
Theory behind clip (contrastive loss, embeddings, etc)
Variations of CLIP backbones
How tokenisation and embeddings work in clip
Applications of clip:
- Search and retrieve
- Zero shot classification
- Clip guidance (Using clip in other models to guide generation, DALLE, SD etc)
Fine-tuning clip (Open-clip, and other variants?)

Losses/ self supervised learning

Contrastive
Non contrastive
Triplet
One or two other ones

Relatives

Image-bind
BLIP
OWL-VIT
Flamingo (IDEFICS)
LLaVa

Practical applications & challenges

Applications
- Search image engine based on textual prompts
- Downstream tasks on embeddings eg classification, clustering etc
- Visual question answering systems
Challenges
- Data bias/ out of distribution data
- Hard to get enough data -> leads to using noisy internet data

References:

@mattmdjaga @froestiago

merveenoyan commented 9 months ago

Hello 👋 I think I'll comment on every chapter if it's ok.

The introduction seems very fine.

Introduction

Motivation for multimodality History of multimodal models Self supervised learning enabling multi-modality CLIP

It's very CLIP focused so it would be nice to be less specific IMO. I think thanks to CLIP we have many multimodal models this day but maybe keep it brief? Not sure, we can decide on the writing process as well.

Intro to (ELI5) Theory behind clip (contrastive loss, embeddings, etc) Variations of CLIP backbones How tokenisation and embeddings work in clip Applications of clip: Search and retrieve Zero shot classification Clip guidance (Using clip in other models to guide generation, DALLE, SD etc) Fine-tuning clip (Open-clip, and other variants?)

This section is nice.

Losses/ self supervised learning

Contrastive Non contrastive Triplet One or two other ones Relatives

This section is nice, maybe make sure it doesn't overlap with the section where we talk about existing architectures or foundation models.

Image-bind BLIP OWL-VIT Flamingo (IDEFICS) LLaVa Practical applications & challenges

Maybe keep this brief and explain more at Computer Vision in the Wild section, WDYT? Also pinging @johko

Applications Search image engine based on textual prompts Downstream tasks on embeddings eg classification, clustering etc Visual question answering systems Challenges Data bias/ out of distribution data Hard to get enough data -> leads to using noisy internet data

johko commented 9 months ago

Hey,

thanks for the great outline @pedrogengo . Here are my thoughts:

Introduction I think you can keep the introduction shorter, as we have a chapter "Connecting Text and Vision", which (I suppose) will talk about most things you mentioned. Maybe your introduction can focus on model history (which you already planned as one point), covering a bit what happened before CLIP. Of course if you want to make sure, feel free to reach out to someone from the other group to see what they plan on covering.

CLIP They are all totally valid points to cover, but as Merve also said, try to not get too carried away with it.

Losses/ self supervised learning Really nice idea of covering that here, love it :heart:

Relatives The related models seem a bit one-sided to me, BLIP, IDEFICS and LLaVa basically cover the same task. Maybe you can also focus on models that are available in transformers (which would rule out Image-bind and LLaVa). Some alternative suggestions from my side:

Donut or Nougat (Document Analysis)
GroupViT or OneFormer(Segmentation)
ALIGN (as a CLIP alternative)

but those are just some suggestions, feel free to have a look at the transformers docs in the multimodal section: https://huggingface.co/docs/transformers

Applications Looks good overall. Keep in mind that we do have a dedicated Zero Shot Computer Vision section, so you don't necessarily need to cover these kind of applications, plus you might already cover some cases in the section about models above.

Challenges Looking good :+1:

Hope that helps you :)

ahmadmustafaanis commented 9 months ago

Thought:

Relatives Image-bind BLIP OWL-VIT Flamingo (IDEFICS) LLaVa

Maybe we can divide it into better sections like and add models here.

Foundational Models
VQA
Image Captioning
Video Captioning
Diffusion Models (we already have a chapter for this)

johko commented 9 months ago

Maybe we can divide it into better sections like and add models here.

Foundational Models

VQA

Image Captioning

Video Captioning

Diffusion Models (we already have a chapter for this)

In general a good idea, but the main problem I see with this is that many new models focus on the "foundation" part, so actually most are able to perform many tasks at once by now. i don't know of many models focusing only on things like VQA and Image Captioning.

I think the most important part here is to cover models that are good representatives for different common architectures or training strategies, so people taking the course get an overview of what is out there.

johko / computer-vision-course

Multimodal Models - CLIP and relatives #29