Unit 4 - CLIP and relatives - Models - OWL-ViT & CLIP

johko / computer-vision-course

This repo is the homebase of a community driven course on Computer Vision with Neural Networks. Feel free to join us on the Hugging Face discord: hf.co/join/discord

MIT License

372 stars 123 forks source link

Unit 4 - CLIP and relatives - Models - OWL-ViT & CLIP #139

Closed mattmdjaga closed 6 months ago

mattmdjaga commented 6 months ago

This section is part of showcasing of CLIP relatives and it covers CLIP and OWL-ViT, an open vocabulary object detection model. The aim is to briefly introduce the model, show how to use it and provide resources to find more about the models.

The whole section structure: Models:

CLIP (Image & text embedding) @pedro
Donut or Nougat (Document Analysis)
GroupViT or OneFormer (Segmentation)
BLIP (Vision language with text generation)
OWL-VIT (Vision Language object detection)

pedrogengo commented 6 months ago

@mattmdjaga can you change the PR title to reflect we are putting CLIP and OWL here?

mattmdjaga commented 6 months ago

I added a link to my YOLO mention. However I linked https://github.com/ultralytics/ultralytics instead of the YOLO paper as I think it's more relevant to the topic. Not sure if everyone else agrees with that?

merveenoyan commented 6 months ago

@mattmdjaga you can add YOLO paper and add ultralytics in references. If you can commit rest of my suggestions I'll give another review and we can merge 😊

mattmdjaga commented 6 months ago

I think we can get this merged @mattmdjaga and if you feel like it you can add a small PyTorch implementation of CLIP in another PR or this PR, what do you think?

Yeh I'm down to add a PyTroch CLIP implementation and it can be in another PR as I might first need to work on other parts of this chapter which haven't been completed yet.

merveenoyan commented 6 months ago

@mattmdjaga sure, no worries! this one is good as is