Closed FurkanGozukara closed 11 months ago
This model cannot do image captioning, only contrast images and captions.
This model cannot do image captioning, only contrast images and captions.
which model do you suggest for image captioning? generate text descriptions of image
In this codebase we have some coca models. There is sample code at https://colab.research.google.com/github/mlfoundations/open_clip/blob/master/docs/Interacting_with_open_coca.ipynb
In this codebase we have some coca models. There is sample code at https://colab.research.google.com/github/mlfoundations/open_clip/blob/master/docs/Interacting_with_open_coca.ipynb
amazing thank you so much
where can i get full list of coca_ViT-L-14 and others? is that the best model?
You can see all of our pretrained models with open_clip.list_pretrained()
. Look for the coca_
prefix. See https://github.com/mlfoundations/open_clip/blob/main/docs/openclip_results.csv for some zero-shot classification and retrieval results.
You can see all of our pretrained models with
open_clip.list_pretrained()
. Look for thecoca_
prefix. See https://github.com/mlfoundations/open_clip/blob/main/docs/openclip_results.csv for some zero-shot classification and retrieval results.
thank you so much this looks best in that file
coca_ViT-L-14 | laion2b_s13b_b90k
@gabrielilharco 1 final question
do any of these models support image caption generation other than coca ones?
[('RN50', 'openai'),
('RN50', 'yfcc15m'),
('RN50', 'cc12m'),
('RN50-quickgelu', 'openai'),
('RN50-quickgelu', 'yfcc15m'),
('RN50-quickgelu', 'cc12m'),
('RN101', 'openai'),
('RN101', 'yfcc15m'),
('RN101-quickgelu', 'openai'),
('RN101-quickgelu', 'yfcc15m'),
('RN50x4', 'openai'),
('RN50x16', 'openai'),
('RN50x64', 'openai'),
('ViT-B-32', 'openai'),
('ViT-B-32', 'laion400m_e31'),
('ViT-B-32', 'laion400m_e32'),
('ViT-B-32', 'laion2b_e16'),
('ViT-B-32', 'laion2b_s34b_b79k'),
('ViT-B-32', 'datacomp_xl_s13b_b90k'),
('ViT-B-32', 'datacomp_m_s128m_b4k'),
('ViT-B-32', 'commonpool_m_clip_s128m_b4k'),
('ViT-B-32', 'commonpool_m_laion_s128m_b4k'),
('ViT-B-32', 'commonpool_m_image_s128m_b4k'),
('ViT-B-32', 'commonpool_m_text_s128m_b4k'),
('ViT-B-32', 'commonpool_m_basic_s128m_b4k'),
('ViT-B-32', 'commonpool_m_s128m_b4k'),
('ViT-B-32', 'datacomp_s_s13m_b4k'),
('ViT-B-32', 'commonpool_s_clip_s13m_b4k'),
('ViT-B-32', 'commonpool_s_laion_s13m_b4k'),
('ViT-B-32', 'commonpool_s_image_s13m_b4k'),
('ViT-B-32', 'commonpool_s_text_s13m_b4k'),
('ViT-B-32', 'commonpool_s_basic_s13m_b4k'),
('ViT-B-32', 'commonpool_s_s13m_b4k'),
('ViT-B-32-256', 'datacomp_s34b_b86k'),
('ViT-B-32-quickgelu', 'openai'),
('ViT-B-32-quickgelu', 'laion400m_e31'),
('ViT-B-32-quickgelu', 'laion400m_e32'),
('ViT-B-16', 'openai'),
('ViT-B-16', 'laion400m_e31'),
('ViT-B-16', 'laion400m_e32'),
('ViT-B-16', 'laion2b_s34b_b88k'),
('ViT-B-16', 'datacomp_xl_s13b_b90k'),
('ViT-B-16', 'datacomp_l_s1b_b8k'),
('ViT-B-16', 'commonpool_l_clip_s1b_b8k'),
('ViT-B-16', 'commonpool_l_laion_s1b_b8k'),
('ViT-B-16', 'commonpool_l_image_s1b_b8k'),
('ViT-B-16', 'commonpool_l_text_s1b_b8k'),
('ViT-B-16', 'commonpool_l_basic_s1b_b8k'),
('ViT-B-16', 'commonpool_l_s1b_b8k'),
('ViT-B-16-plus-240', 'laion400m_e31'),
('ViT-B-16-plus-240', 'laion400m_e32'),
('ViT-L-14', 'openai'),
('ViT-L-14', 'laion400m_e31'),
('ViT-L-14', 'laion400m_e32'),
('ViT-L-14', 'laion2b_s32b_b82k'),
('ViT-L-14', 'datacomp_xl_s13b_b90k'),
('ViT-L-14', 'commonpool_xl_clip_s13b_b90k'),
('ViT-L-14', 'commonpool_xl_laion_s13b_b90k'),
('ViT-L-14', 'commonpool_xl_s13b_b90k'),
('ViT-L-14-336', 'openai'),
('ViT-H-14', 'laion2b_s32b_b79k'),
('ViT-g-14', 'laion2b_s12b_b42k'),
('ViT-g-14', 'laion2b_s34b_b88k'),
('ViT-bigG-14', 'laion2b_s39b_b160k'),
('roberta-ViT-B-32', 'laion2b_s12b_b32k'),
('xlm-roberta-base-ViT-B-32', 'laion5b_s13b_b90k'),
('xlm-roberta-large-ViT-H-14', 'frozen_laion5b_s13b_b90k'),
('convnext_base', 'laion400m_s13b_b51k'),
('convnext_base_w', 'laion2b_s13b_b82k'),
('convnext_base_w', 'laion2b_s13b_b82k_augreg'),
('convnext_base_w', 'laion_aesthetic_s13b_b82k'),
('convnext_base_w_320', 'laion_aesthetic_s13b_b82k'),
('convnext_base_w_320', 'laion_aesthetic_s13b_b82k_augreg'),
('convnext_large_d', 'laion2b_s26b_b102k_augreg'),
('convnext_large_d_320', 'laion2b_s29b_b131k_ft'),
('convnext_large_d_320', 'laion2b_s29b_b131k_ft_soup'),
('convnext_xxlarge', 'laion2b_s34b_b82k_augreg'),
('convnext_xxlarge', 'laion2b_s34b_b82k_augreg_rewind'),
('convnext_xxlarge', 'laion2b_s34b_b82k_augreg_soup'),
('coca_ViT-B-32', 'laion2b_s13b_b90k'),
('coca_ViT-B-32', 'mscoco_finetuned_laion2b_s13b_b90k'),
('coca_ViT-L-14', 'laion2b_s13b_b90k'),
('coca_ViT-L-14', 'mscoco_finetuned_laion2b_s13b_b90k'),
('EVA01-g-14', 'laion400m_s11b_b41k'),
('EVA01-g-14-plus', 'merged2b_s11b_b114k'),
('EVA02-B-16', 'merged2b_s8b_b131k'),
('EVA02-L-14', 'merged2b_s4b_b131k'),
('EVA02-L-14-336', 'merged2b_s6b_b61k'),
('EVA02-E-14', 'laion2b_s4b_b115k'),
('EVA02-E-14-plus', 'laion2b_s9b_b144k')]
it turns out using ViT-bigG-14/laion2b_s39b_b160k as image captioning possible > https://twitter.com/GozukaraFurkan/status/1711933282529452115
a drawing of a man with curly hair and glasses, portrait of ultra realistic, benjamin vnuk, silvain sarrailh, master artist, with long curly, a beautiful artwork illustration, sharp high detail illustration, by Fedot Sychkov, handsome girl, 8 k illustration, portrait illustration
Looks like you are using BLIP2 there?
Looks like you are using BLIP2 there?
Blip2 + ViT-bigG-14/laion2b_s39b_b160k
solo Blip2 produces nothing like that
I want to use ViT-bigG-14', 'laion2b_s39b_b160k to generate captions for a given folder of images
And save them with same file name
Thank you so much
You only have this example which is not helpful