Questions about usage and training

rom1504 / clip-retrieval

Easily compute clip embeddings and build a clip retrieval system with them

https://rom1504.github.io/clip-retrieval/

MIT License

2.42k stars 213 forks source link

Questions about usage and training #215

Closed justlike-prog closed 1 year ago

justlike-prog commented 1 year ago

Hi, I would have 2 questions:

Is it possible to have a image as a query and the text as an answer?
Would it be possible to build something like this for one word tags instead of sentences? Would that make sense with CLIP?

Thanks!

rom1504 commented 1 year ago

Hi,

Yes searching image to text is already supported

If you pass single word as captions it will work even though you may find that using prompts like https://github.com/LAION-AI/CLIP_benchmark/blob/main/clip_benchmark/datasets/en_zeroshot_classification_templates.json will work better

justlike-prog commented 1 year ago

Ok got it thanks. But generally speaking it would be possible with a CLIP model also train on single tags (single strings) from scratch right? Obviously one would lose semantic information. I would like to build something like a tag based system rather than sentence based (because of easier usability - not having to design correct prompts). Just wondering about your thoughts about it.

I thought something like this A Multi-View Embedding Space for Modeling Internet Images, Tags, and their Semantics but I figured I could also use CLIP instead of CCA to put the images and tags into the same latent space.

rom1504 commented 1 year ago

yes it should work

maybe you can even fine tune an existing model or even freeze the image tower and train only the text tower, openclip supports that and it's much faster to train (up to 10x for the same data)

justlike-prog commented 1 year ago

Ok that's a great tip actually, thanks!

Just to understand the workflow correctly. Given an image embedding as a query, would I retrieve only one tag in this situation or all separately or would I need to actually query the space and look for nearest neighbours and check the tags of those image embeddings? Because in my case each image would have multiple tags.

rom1504 commented 1 year ago

You can build an index of tags (encoded with the text encoder) then query it with an image embedding. In one knn you can get k results hence k tags.

On Wed, Dec 14, 2022, 11:51 justlike-prog @.***> wrote:

Ok that's a great tip actually, thanks!

Just to understand the workflow correctly. Given an image embedding as a query, would I retrieve only one tag in this situation or all separately or would I need to actually query the space and look for nearest neighbours and check the tags of those image embeddings? Because in my case each image would have multiple tags.

— Reply to this email directly, view it on GitHub https://github.com/rom1504/clip-retrieval/issues/215#issuecomment-1350903613, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437SVKPYEZY3EEAP2I6LWNGRCPANCNFSM6AAAAAAS5MNLMY . You are receiving this because you commented.Message ID: @.***>

justlike-prog commented 1 year ago

Alright, thanks!

Could you point me to where in the code the image to text is done. Wanted to implement the feature for my own model but struggling a bit. Also would it then create a new text for the image or would it take one of those it trained on which is nearest in the latent space?

rom1504 commented 1 year ago

https://github.com/mlfoundations/open_clip is the training code for clip

you may be interested to read https://rom1504.medium.com/semantic-search-with-embeddings-index-anything-8fb18556443c to understand more about semantic search