rom1504 / clip-retrieval

Easily compute clip embeddings and build a clip retrieval system with them
https://rom1504.github.io/clip-retrieval/
MIT License
2.42k stars 213 forks source link

Add the option to select the openclip model #284

Closed barinov274 closed 10 months ago

barinov274 commented 1 year ago

There are quite a few openclip models, but I need specifically laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K I looked at the load_model function, it parses the --clip_model argument, and if the string starts with open_clip:, then using the load_open_clip function, which actually loads the openclip model

        clip_model = clip_model[len("open_clip:") :]
        model, preprocess = load_open_clip(clip_model, use_jit, device, clip_cache_path)

That's all great, but then I would expect to see some sort of parsing line after open_clip:, like, after for example ViT-L-14, there should be a specification of which model I want to download. But instead, I saw this.

pretrained = dict(open_clip.list_pretrained())
    checkpoint = pretrained[clip_model]
    model, _, preprocess = open_clip.create_model_and_transforms(
        clip_model, pretrained=checkpoint, device=device, jit=use_jit, cache_dir=clip_cache_path
    )

So, the user is downloaded to the computer a random model that the user has no idea about, and he doesn't even have the ability to choose the model that he wants. And you can see the wide variety of models that the open_clip library offers:

 ('RN50', 'yfcc15m'),
 ('RN50', 'cc12m'),
 ('RN50-quickgelu', 'openai'),
 ('RN50-quickgelu', 'yfcc15m'),
 ('RN50-quickgelu', 'cc12m'),
 ('RN101', 'openai'),
 ('RN101', 'yfcc15m'),
 ('RN101-quickgelu', 'openai'),
 ('RN101-quickgelu', 'yfcc15m'),
 ('RN50x4', 'openai'),
 ('RN50x16', 'openai'),
 ('RN50x64', 'openai'),
 ('ViT-B-32', 'openai'),
 ('ViT-B-32', 'laion400m_e31'),
 ('ViT-B-32', 'laion400m_e32'),
 ('ViT-B-32', 'laion2b_e16'),
 ('ViT-B-32', 'laion2b_s34b_b79k'),
 ('ViT-B-32', 'datacomp_m_s128m_b4k'),
 ('ViT-B-32', 'commonpool_m_clip_s128m_b4k'),
 ('ViT-B-32', 'commonpool_m_laion_s128m_b4k'),
 ('ViT-B-32', 'commonpool_m_image_s128m_b4k'),
 ('ViT-B-32', 'commonpool_m_text_s128m_b4k'),
 ('ViT-B-32', 'commonpool_m_basic_s128m_b4k'),
 ('ViT-B-32', 'commonpool_m_s128m_b4k'),
 ('ViT-B-32', 'datacomp_s_s13m_b4k'),
 ('ViT-B-32', 'commonpool_s_clip_s13m_b4k'),
 ('ViT-B-32', 'commonpool_s_laion_s13m_b4k'),
 ('ViT-B-32', 'commonpool_s_image_s13m_b4k'),
 ('ViT-B-32', 'commonpool_s_text_s13m_b4k'),
 ('ViT-B-32', 'commonpool_s_basic_s13m_b4k'),
 ('ViT-B-32', 'commonpool_s_s13m_b4k'),
 ('ViT-B-32-quickgelu', 'openai'),
 ('ViT-B-32-quickgelu', 'laion400m_e31'),
 ('ViT-B-32-quickgelu', 'laion400m_e32'),
 ('ViT-B-16', 'openai'),
 ('ViT-B-16', 'laion400m_e31'),
 ('ViT-B-16', 'laion400m_e32'),
 ('ViT-B-16', 'laion2b_s34b_b88k'),
 ('ViT-B-16', 'datacomp_l_s1b_b8k'),
 ('ViT-B-16', 'commonpool_l_clip_s1b_b8k'),
 ('ViT-B-16', 'commonpool_l_laion_s1b_b8k'),
 ('ViT-B-16', 'commonpool_l_image_s1b_b8k'),
 ('ViT-B-16', 'commonpool_l_text_s1b_b8k'),
 ('ViT-B-16', 'commonpool_l_basic_s1b_b8k'),
 ('ViT-B-16', 'commonpool_l_s1b_b8k'),
 ('ViT-B-16-plus-240', 'laion400m_e31'),
 ('ViT-B-16-plus-240', 'laion400m_e32'),
 ('ViT-L-14', 'openai'),
 ('ViT-L-14', 'laion400m_e31'),
 ('ViT-L-14', 'laion400m_e32'),
 ('ViT-L-14', 'laion2b_s32b_b82k'),
 ('ViT-L-14', 'datacomp_xl_s13b_b90k'),
 ('ViT-L-14', 'commonpool_xl_clip_s13b_b90k'),
 ('ViT-L-14', 'commonpool_xl_laion_s13b_b90k'),
 ('ViT-L-14', 'commonpool_xl_s13b_b90k'),
 ('ViT-L-14-336', 'openai'),
 ('ViT-H-14', 'laion2b_s32b_b79k'),
 ('ViT-g-14', 'laion2b_s12b_b42k'),
 ('ViT-g-14', 'laion2b_s34b_b88k'),
 ('ViT-bigG-14', 'laion2b_s39b_b160k'),
 ('roberta-ViT-B-32', 'laion2b_s12b_b32k'),
 ('xlm-roberta-base-ViT-B-32', 'laion5b_s13b_b90k'),
 ('xlm-roberta-large-ViT-H-14', 'frozen_laion5b_s13b_b90k'),
 ('convnext_base', 'laion400m_s13b_b51k'),
 ('convnext_base_w', 'laion2b_s13b_b82k'),
 ('convnext_base_w', 'laion2b_s13b_b82k_augreg'),
 ('convnext_base_w', 'laion_aesthetic_s13b_b82k'),
 ('convnext_base_w_320', 'laion_aesthetic_s13b_b82k'),
 ('convnext_base_w_320', 'laion_aesthetic_s13b_b82k_augreg'),
 ('convnext_large_d', 'laion2b_s26b_b102k_augreg'),
 ('convnext_large_d_320', 'laion2b_s29b_b131k_ft'),
 ('convnext_large_d_320', 'laion2b_s29b_b131k_ft_soup'),
 ('convnext_xxlarge', 'laion2b_s34b_b82k_augreg'),
 ('convnext_xxlarge', 'laion2b_s34b_b82k_augreg_rewind'),
 ('convnext_xxlarge', 'laion2b_s34b_b82k_augreg_soup'),
 ('coca_ViT-B-32', 'laion2b_s13b_b90k'),
 ('coca_ViT-B-32', 'mscoco_finetuned_laion2b_s13b_b90k'),
 ('coca_ViT-L-14', 'laion2b_s13b_b90k'),
 ('coca_ViT-L-14', 'mscoco_finetuned_laion2b_s13b_b90k'),
 ('EVA01-g-14', 'laion400m_s11b_b41k'),
 ('EVA01-g-14-plus', 'merged2b_s11b_b114k'),
 ('EVA02-B-16', 'merged2b_s8b_b131k'),
 ('EVA02-L-14', 'merged2b_s4b_b131k'),
 ('EVA02-L-14-336', 'merged2b_s6b_b61k'),
 ('EVA02-E-14', 'laion2b_s4b_b115k'),
 ('EVA02-E-14-plus', 'laion2b_s9b_b144k')]

I propose a commit So you can choose a model by typing clip-retrieval inference --clip_model "open_clip:ViT-L-14 | datacomp_xl_s13b_b90k" ... And even if you don't set the checkpoint after "|", you'll get a line about the model print(f"Loading OpenClip model{model} with {checkpoint} checkpoint")

rom1504 commented 1 year ago

Can you fix the lint ?

And maybe add one test case there https://github.com/barinov274/clip-retrieval/blob/patch-1/tests/test_clip_inference/test_mapper.py#L9

raunakdoesdev commented 12 months ago

pls merge

rom1504 commented 10 months ago

that seems important but there is no test

also I am not convinced about the " | " syntax. ";" may be better

spaces have bad properties for shell arguments

rom1504 commented 10 months ago

thanks, I merged this into https://github.com/rom1504/clip-retrieval/pull/314