Closed justlike-prog closed 1 year ago
Hi,
Yes searching image to text is already supported
If you pass single word as captions it will work even though you may find that using prompts like https://github.com/LAION-AI/CLIP_benchmark/blob/main/clip_benchmark/datasets/en_zeroshot_classification_templates.json will work better
Ok got it thanks. But generally speaking it would be possible with a CLIP model also train on single tags (single strings) from scratch right? Obviously one would lose semantic information. I would like to build something like a tag based system rather than sentence based (because of easier usability - not having to design correct prompts). Just wondering about your thoughts about it.
I thought something like this A Multi-View Embedding Space for Modeling Internet Images, Tags, and their Semantics but I figured I could also use CLIP instead of CCA to put the images and tags into the same latent space.
yes it should work
maybe you can even fine tune an existing model or even freeze the image tower and train only the text tower, openclip supports that and it's much faster to train (up to 10x for the same data)
Ok that's a great tip actually, thanks!
Just to understand the workflow correctly. Given an image embedding as a query, would I retrieve only one tag in this situation or all separately or would I need to actually query the space and look for nearest neighbours and check the tags of those image embeddings? Because in my case each image would have multiple tags.
You can build an index of tags (encoded with the text encoder) then query it with an image embedding. In one knn you can get k results hence k tags.
On Wed, Dec 14, 2022, 11:51 justlike-prog @.***> wrote:
Ok that's a great tip actually, thanks!
Just to understand the workflow correctly. Given an image embedding as a query, would I retrieve only one tag in this situation or all separately or would I need to actually query the space and look for nearest neighbours and check the tags of those image embeddings? Because in my case each image would have multiple tags.
— Reply to this email directly, view it on GitHub https://github.com/rom1504/clip-retrieval/issues/215#issuecomment-1350903613, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437SVKPYEZY3EEAP2I6LWNGRCPANCNFSM6AAAAAAS5MNLMY . You are receiving this because you commented.Message ID: @.***>
Alright, thanks!
Could you point me to where in the code the image to text is done. Wanted to implement the feature for my own model but struggling a bit. Also would it then create a new text for the image or would it take one of those it trained on which is nearest in the latent space?
https://github.com/mlfoundations/open_clip is the training code for clip
you may be interested to read https://rom1504.medium.com/semantic-search-with-embeddings-index-anything-8fb18556443c to understand more about semantic search
Hi, I would have 2 questions:
Thanks!