mlfoundations / open_clip

An open source implementation of CLIP.
Other
10.25k stars 979 forks source link

Generating prompts from an image #1

Closed johndpope closed 3 years ago

johndpope commented 3 years ago

so - I've been looking into some code for VQGAN https://github.com/mehdidc/feed_forward_vqgan_clip https://github.com/nerdyrodent/VQGAN-CLIP

and they let the user to pass a prompt to style / generate an image. Here's some using code from @nerdyrodent https://github.com/nerdyrodent/VQGAN-CLIP/issues/13

Must see - https://twitter.com/e08477/status/1418440857578098691?s=21 Here's theres only 4 images generated with a prompt eg. Mushroom, spaceship,volcano, old english house on a hill(might be wrong) image But then as you look down - these have predicate prompts that style / shape image differently.

Mushroom + marble sculpture.

What I want is to give an image to CLIP and have it tell me what it thinks the words should be. Is this feasible / achievable ? Does this repo provide any way into this? Does it need dimensionality reduction? It is like tsne problem (show word2vec in 2 dimensions?) - but under the hood it's 512 dimensions? I'm yet to look at the code - maybe it will become clearer.

mitchellnw commented 3 years ago

Hello, although this is not currently enabled with this repo we think this is feasible.

Say you have an image x and a huge bank of possible captions t_1,...,t_n.

You could get image/text features by running everything through CLIP, e.g. f_im = model.encode_image(x) and text features for f_j = model.encode_text(t_j) for j in 1,...,n.

Then you could use a nearest neighbors library like faiss to find the nearest neighbor between f_j for j in 1,...,n and f_im.

That nearest neighbor could be used as the generated words for that image.

gabrielilharco commented 3 years ago

Alternatively, if you don't have bank of possible captions, you can try performing an automated search for the prompts that maximize agreement with the image. A good starting point is the method from Shin et al., 2020 (https://arxiv.org/abs/2010.15980)

johndpope commented 3 years ago

https://ucinlp.github.io/autoprompt/

With a bit of github digging for "faiss clip" - got a hit on this repo by @ps-auxw. It seems like he has done it - https://github.com/ps-auxw/CLI-P

I'll ask if @ps-auxw can integrate his repo to use this open_clip.

UPDATE - so if anyone's interested there's a neat way to install faiss using just pip - https://pypi.org/project/faiss-gpu/

this repo above has 2 steps -

  1. build-index.py / just from images.
  2. query-index.py / just query the index built from step 1.

"Say you have an image x and a huge bank of possible captions t_1,...,t_n." Here is said dataset / 12million captions.... https://github.com/google-research-datasets/conceptual-12m

UPDATE 2. This repo - https://github.com/johndpope/rerank/blob/main/data/prepare_data.py seems to do take captions +( from variety of datasets) and images / indexes them and includes some retrieval via knearest. https://github.com/RitaRamo/rerank/blob/993fb49df843ba8c5a3567aa97c0e5382ecbe48e/src/toolkit/data/datasets.py def retrieve_nearest_for_train_query(self, query_img, k=2):

UPDATE 3. found this - which looks more scalable than option 2. https://github.com/rom1504/clip-retrieval

johndpope commented 3 years ago

I believe this does it https://github.com/dzryk/clip-grams

mitchellnw commented 3 years ago

great thanks for linking!