mlfoundations / open_clip

An open source implementation of CLIP.
Other
9.78k stars 950 forks source link

How to use ViT-bigG-14', 'laion2b_s39b_b160k to caption images in a given folder - your readme is not helpful #667

Closed FurkanGozukara closed 11 months ago

FurkanGozukara commented 11 months ago

I want to use ViT-bigG-14', 'laion2b_s39b_b160k to generate captions for a given folder of images

And save them with same file name

Thank you so much

You only have this example which is not helpful

import torch
from PIL import Image
import open_clip

model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_s34b_b79k')
tokenizer = open_clip.get_tokenizer('ViT-B-32')

image = preprocess(Image.open("CLIP.png")).unsqueeze(0)
text = tokenizer(["a diagram", "a dog", "a cat"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)  # prints: [[1., 0., 0.]]
gabrielilharco commented 11 months ago

This model cannot do image captioning, only contrast images and captions.

FurkanGozukara commented 11 months ago

This model cannot do image captioning, only contrast images and captions.

which model do you suggest for image captioning? generate text descriptions of image

gabrielilharco commented 11 months ago

In this codebase we have some coca models. There is sample code at https://colab.research.google.com/github/mlfoundations/open_clip/blob/master/docs/Interacting_with_open_coca.ipynb

FurkanGozukara commented 11 months ago

In this codebase we have some coca models. There is sample code at https://colab.research.google.com/github/mlfoundations/open_clip/blob/master/docs/Interacting_with_open_coca.ipynb

amazing thank you so much

where can i get full list of coca_ViT-L-14 and others? is that the best model?

gabrielilharco commented 11 months ago

You can see all of our pretrained models with open_clip.list_pretrained(). Look for the coca_ prefix. See https://github.com/mlfoundations/open_clip/blob/main/docs/openclip_results.csv for some zero-shot classification and retrieval results.

FurkanGozukara commented 11 months ago

You can see all of our pretrained models with open_clip.list_pretrained(). Look for the coca_ prefix. See https://github.com/mlfoundations/open_clip/blob/main/docs/openclip_results.csv for some zero-shot classification and retrieval results.

thank you so much this looks best in that file

coca_ViT-L-14 | laion2b_s13b_b90k

FurkanGozukara commented 11 months ago

@gabrielilharco 1 final question

do any of these models support image caption generation other than coca ones?

[('RN50', 'openai'),
 ('RN50', 'yfcc15m'),
 ('RN50', 'cc12m'),
 ('RN50-quickgelu', 'openai'),
 ('RN50-quickgelu', 'yfcc15m'),
 ('RN50-quickgelu', 'cc12m'),
 ('RN101', 'openai'),
 ('RN101', 'yfcc15m'),
 ('RN101-quickgelu', 'openai'),
 ('RN101-quickgelu', 'yfcc15m'),
 ('RN50x4', 'openai'),
 ('RN50x16', 'openai'),
 ('RN50x64', 'openai'),
 ('ViT-B-32', 'openai'),
 ('ViT-B-32', 'laion400m_e31'),
 ('ViT-B-32', 'laion400m_e32'),
 ('ViT-B-32', 'laion2b_e16'),
 ('ViT-B-32', 'laion2b_s34b_b79k'),
 ('ViT-B-32', 'datacomp_xl_s13b_b90k'),
 ('ViT-B-32', 'datacomp_m_s128m_b4k'),
 ('ViT-B-32', 'commonpool_m_clip_s128m_b4k'),
 ('ViT-B-32', 'commonpool_m_laion_s128m_b4k'),
 ('ViT-B-32', 'commonpool_m_image_s128m_b4k'),
 ('ViT-B-32', 'commonpool_m_text_s128m_b4k'),
 ('ViT-B-32', 'commonpool_m_basic_s128m_b4k'),
 ('ViT-B-32', 'commonpool_m_s128m_b4k'),
 ('ViT-B-32', 'datacomp_s_s13m_b4k'),
 ('ViT-B-32', 'commonpool_s_clip_s13m_b4k'),
 ('ViT-B-32', 'commonpool_s_laion_s13m_b4k'),
 ('ViT-B-32', 'commonpool_s_image_s13m_b4k'),
 ('ViT-B-32', 'commonpool_s_text_s13m_b4k'),
 ('ViT-B-32', 'commonpool_s_basic_s13m_b4k'),
 ('ViT-B-32', 'commonpool_s_s13m_b4k'),
 ('ViT-B-32-256', 'datacomp_s34b_b86k'),
 ('ViT-B-32-quickgelu', 'openai'),
 ('ViT-B-32-quickgelu', 'laion400m_e31'),
 ('ViT-B-32-quickgelu', 'laion400m_e32'),
 ('ViT-B-16', 'openai'),
 ('ViT-B-16', 'laion400m_e31'),
 ('ViT-B-16', 'laion400m_e32'),
 ('ViT-B-16', 'laion2b_s34b_b88k'),
 ('ViT-B-16', 'datacomp_xl_s13b_b90k'),
 ('ViT-B-16', 'datacomp_l_s1b_b8k'),
 ('ViT-B-16', 'commonpool_l_clip_s1b_b8k'),
 ('ViT-B-16', 'commonpool_l_laion_s1b_b8k'),
 ('ViT-B-16', 'commonpool_l_image_s1b_b8k'),
 ('ViT-B-16', 'commonpool_l_text_s1b_b8k'),
 ('ViT-B-16', 'commonpool_l_basic_s1b_b8k'),
 ('ViT-B-16', 'commonpool_l_s1b_b8k'),
 ('ViT-B-16-plus-240', 'laion400m_e31'),
 ('ViT-B-16-plus-240', 'laion400m_e32'),
 ('ViT-L-14', 'openai'),
 ('ViT-L-14', 'laion400m_e31'),
 ('ViT-L-14', 'laion400m_e32'),
 ('ViT-L-14', 'laion2b_s32b_b82k'),
 ('ViT-L-14', 'datacomp_xl_s13b_b90k'),
 ('ViT-L-14', 'commonpool_xl_clip_s13b_b90k'),
 ('ViT-L-14', 'commonpool_xl_laion_s13b_b90k'),
 ('ViT-L-14', 'commonpool_xl_s13b_b90k'),
 ('ViT-L-14-336', 'openai'),
 ('ViT-H-14', 'laion2b_s32b_b79k'),
 ('ViT-g-14', 'laion2b_s12b_b42k'),
 ('ViT-g-14', 'laion2b_s34b_b88k'),
 ('ViT-bigG-14', 'laion2b_s39b_b160k'),
 ('roberta-ViT-B-32', 'laion2b_s12b_b32k'),
 ('xlm-roberta-base-ViT-B-32', 'laion5b_s13b_b90k'),
 ('xlm-roberta-large-ViT-H-14', 'frozen_laion5b_s13b_b90k'),
 ('convnext_base', 'laion400m_s13b_b51k'),
 ('convnext_base_w', 'laion2b_s13b_b82k'),
 ('convnext_base_w', 'laion2b_s13b_b82k_augreg'),
 ('convnext_base_w', 'laion_aesthetic_s13b_b82k'),
 ('convnext_base_w_320', 'laion_aesthetic_s13b_b82k'),
 ('convnext_base_w_320', 'laion_aesthetic_s13b_b82k_augreg'),
 ('convnext_large_d', 'laion2b_s26b_b102k_augreg'),
 ('convnext_large_d_320', 'laion2b_s29b_b131k_ft'),
 ('convnext_large_d_320', 'laion2b_s29b_b131k_ft_soup'),
 ('convnext_xxlarge', 'laion2b_s34b_b82k_augreg'),
 ('convnext_xxlarge', 'laion2b_s34b_b82k_augreg_rewind'),
 ('convnext_xxlarge', 'laion2b_s34b_b82k_augreg_soup'),
 ('coca_ViT-B-32', 'laion2b_s13b_b90k'),
 ('coca_ViT-B-32', 'mscoco_finetuned_laion2b_s13b_b90k'),
 ('coca_ViT-L-14', 'laion2b_s13b_b90k'),
 ('coca_ViT-L-14', 'mscoco_finetuned_laion2b_s13b_b90k'),
 ('EVA01-g-14', 'laion400m_s11b_b41k'),
 ('EVA01-g-14-plus', 'merged2b_s11b_b114k'),
 ('EVA02-B-16', 'merged2b_s8b_b131k'),
 ('EVA02-L-14', 'merged2b_s4b_b131k'),
 ('EVA02-L-14-336', 'merged2b_s6b_b61k'),
 ('EVA02-E-14', 'laion2b_s4b_b115k'),
 ('EVA02-E-14-plus', 'laion2b_s9b_b144k')]
FurkanGozukara commented 11 months ago

it turns out using ViT-bigG-14/laion2b_s39b_b160k as image captioning possible > https://twitter.com/GozukaraFurkan/status/1711933282529452115

image

a drawing of a man with curly hair and glasses, portrait of ultra realistic, benjamin vnuk, silvain sarrailh, master artist, with long curly, a beautiful artwork illustration, sharp high detail illustration, by Fedot Sychkov, handsome girl, 8 k illustration, portrait illustration

gabrielilharco commented 11 months ago

Looks like you are using BLIP2 there?

FurkanGozukara commented 11 months ago

Looks like you are using BLIP2 there?

Blip2 + ViT-bigG-14/laion2b_s39b_b160k

solo Blip2 produces nothing like that