Closed johndpope closed 3 years ago
Hello, although this is not currently enabled with this repo we think this is feasible.
Say you have an image x
and a huge bank of possible captions t_1,...,t_n
.
You could get image/text features by running everything through CLIP, e.g. f_im = model.encode_image(x)
and text features for f_j = model.encode_text(t_j)
for j
in 1,...,n
.
Then you could use a nearest neighbors library like faiss to find the nearest neighbor between f_j
for j
in 1,...,n
and f_im
.
That nearest neighbor could be used as the generated words for that image.
Alternatively, if you don't have bank of possible captions, you can try performing an automated search for the prompts that maximize agreement with the image. A good starting point is the method from Shin et al., 2020 (https://arxiv.org/abs/2010.15980)
https://ucinlp.github.io/autoprompt/
With a bit of github digging for "faiss clip" - got a hit on this repo by @ps-auxw. It seems like he has done it - https://github.com/ps-auxw/CLI-P
I'll ask if @ps-auxw can integrate his repo to use this open_clip.
UPDATE - so if anyone's interested there's a neat way to install faiss using just pip - https://pypi.org/project/faiss-gpu/
this repo above has 2 steps -
"Say you have an image x and a huge bank of possible captions t_1,...,t_n." Here is said dataset / 12million captions.... https://github.com/google-research-datasets/conceptual-12m
UPDATE 2. This repo - https://github.com/johndpope/rerank/blob/main/data/prepare_data.py seems to do take captions +( from variety of datasets) and images / indexes them and includes some retrieval via knearest. https://github.com/RitaRamo/rerank/blob/993fb49df843ba8c5a3567aa97c0e5382ecbe48e/src/toolkit/data/datasets.py def retrieve_nearest_for_train_query(self, query_img, k=2):
UPDATE 3. found this - which looks more scalable than option 2. https://github.com/rom1504/clip-retrieval
I believe this does it https://github.com/dzryk/clip-grams
great thanks for linking!
so - I've been looking into some code for VQGAN https://github.com/mehdidc/feed_forward_vqgan_clip https://github.com/nerdyrodent/VQGAN-CLIP
and they let the user to pass a prompt to style / generate an image. Here's some using code from @nerdyrodent https://github.com/nerdyrodent/VQGAN-CLIP/issues/13
Must see - https://twitter.com/e08477/status/1418440857578098691?s=21 Here's theres only 4 images generated with a prompt eg. Mushroom, spaceship,volcano, old english house on a hill(might be wrong) But then as you look down - these have predicate prompts that style / shape image differently.
Mushroom + marble sculpture.
What I want is to give an image to CLIP and have it tell me what it thinks the words should be. Is this feasible / achievable ? Does this repo provide any way into this? Does it need dimensionality reduction? It is like tsne problem (show word2vec in 2 dimensions?) - but under the hood it's 512 dimensions? I'm yet to look at the code - maybe it will become clearer.