salesforce / BLIP

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
BSD 3-Clause "New" or "Revised" License
4.86k stars 648 forks source link

feature extraction on images only #67

Closed nikky4D closed 2 years ago

nikky4D commented 2 years ago

I want to process a folder of images that I will use for comparing to an input text (which will be given at a different time). How do I use your colab to extract features from images and then at a later time, compare them to an input text? All the examples involve passing in an image and a text at the same time.

LiJunnan1992 commented 2 years ago

Hi, please refer to the "Feature Extraction" part of the demo notebook for image feature extraction.

nikky4D commented 2 years ago

Thanks. But my question is for retrieval against a large dataset of images.

For example, I have a large dataset of images. From the colab I can extract image features for my dataset and save them.

But at test time, I get an incoming text. Using the colab, I can process the text and get out the text features. As I only want to compare an incoming text with the image features in my dataset, How then would I pass in the image features (not the image itself), and compare to the text features?

woctezuma commented 2 years ago

Notebook

Input:

https://github.com/salesforce/BLIP/blob/48211a1594f1321b00f14c9f7a5b4813144b2fb9/models/blip_itm.py#L43-L47

ITM:

https://github.com/salesforce/BLIP/blob/48211a1594f1321b00f14c9f7a5b4813144b2fb9/models/blip_itm.py#L50-L58

ITC:

https://github.com/salesforce/BLIP/blob/48211a1594f1321b00f14c9f7a5b4813144b2fb9/models/blip_itm.py#L60-L67

nikky4D commented 2 years ago

I see now. Thank you. Would this be the expected pipeline then with blip_itm.py:

First, process the image dataset as:

Then at test time, do the following:

woctezuma commented 2 years ago

That is what I would do. Maybe, wait for confirmation from the paper authors though.

On a side-note, similar projects exist with CLIP, e.g. rom1504/clip-retrieval.

nikky4D commented 2 years ago

Quick question: In the demo.ipynb, in Feature Extraction, you take the [0,0] location of the feature giving a vector of 1024. But in the Image-Text matching section, in blip_itm.py, there is no extraction of just the [0,0] location. what is there a reason for this difference?

woctezuma commented 2 years ago

ITM is a bit special, maybe look at ITC instead.

https://github.com/salesforce/BLIP/blob/48211a1594f1321b00f14c9f7a5b4813144b2fb9/models/blip_itm.py#L27 https://github.com/salesforce/BLIP/blob/48211a1594f1321b00f14c9f7a5b4813144b2fb9/models/blip_itm.py#L43 https://github.com/salesforce/BLIP/blob/48211a1594f1321b00f14c9f7a5b4813144b2fb9/models/blip_itm.py#L35 https://github.com/salesforce/BLIP/blob/48211a1594f1321b00f14c9f7a5b4813144b2fb9/models/blip_itm.py#L63

We can see that image_embeds[:,0,:] is accessed in order to convert embeddings into features via vision_proj().

This way, you don't have to rely on the dimension of the embeddings:

https://github.com/salesforce/BLIP/blob/48211a1594f1321b00f14c9f7a5b4813144b2fb9/models/blip.py#L204

And directly specify the dimension of the features:

https://github.com/salesforce/BLIP/blob/48211a1594f1321b00f14c9f7a5b4813144b2fb9/models/blip_itm.py#L17

Moreover, it would be interesting to check the dimension of image_embeds[:,0,:] . I believe it could be the same as vision_width, i.e. 1024 for ViT-L.

nikky4D commented 2 years ago

On a side-note, similar projects exist with CLIP, e.g. rom1504/clip-retrieval.

Thanks for the reference for retrieval. I checked it out. It looks very applicable for this.

ITM is a bit special, maybe look at ITC instead.

https://github.com/salesforce/BLIP/blob/48211a1594f1321b00f14c9f7a5b4813144b2fb9/models/blip_itm.py#L63

We can see that image_embeds[:,0,:] is accessed in order to convert embeddings into features via vision_proj().

Moreover, it would be interesting to check the dimension of image_embeds[:,0,:] . I believe it could be the same as vision_width, i.e. 1024 for ViT-L.

Thanks for the pointers. Looking closely at the code, ITM appears to use the entire embedding which is image_embeds.shape is 577x1024. ITC only uses the first vector in image_embeds entire 577x1024 matrix so it uses image_embeds[:,0,:], a 1x1024 vector as the feature. This is similar to CLIP setup which may be why they only pulled [0,0] in the code for features.

Thank you for the help/clarification. I'll close this now.

linhlt-it-ee commented 1 year ago

how to extract multimodal features of list of (images+texts) instead of only 1 image with BLIP model?

LiJunnan1992 commented 1 year ago

how to extract multimodal features of list of (images+texts) instead of only 1 image with BLIP model?

@linhlt-it-ee You can do batch forward

linhlt-it-ee commented 1 year ago

I can not load checkpoint from my finetune because of the wrong dimension 197,768 instead of 577, 768. Can someone tell me which configure I made it wrong?

LiJunnan1992 commented 1 year ago

Hi @linhlt-it-ee , you may want to checkout our LAVIS library which provides better support for feature extraction: https://github.com/salesforce/LAVIS