Closed nikky4D closed 2 years ago
Hi, please refer to the "Feature Extraction" part of the demo notebook for image feature extraction.
Thanks. But my question is for retrieval against a large dataset of images.
For example, I have a large dataset of images. From the colab I can extract image features for my dataset and save them.
But at test time, I get an incoming text. Using the colab, I can process the text and get out the text features. As I only want to compare an incoming text with the image features in my dataset, How then would I pass in the image features (not the image itself), and compare to the text features?
Input:
ITM:
ITC:
I see now. Thank you. Would this be the expected pipeline then with blip_itm.py:
First, process the image dataset as:
Then at test time, do the following:
text
image_embeds
, get image_atts
image_embeds
and image_atts
and text
to either itm head/itc headThat is what I would do. Maybe, wait for confirmation from the paper authors though.
On a side-note, similar projects exist with CLIP, e.g. rom1504/clip-retrieval
.
Quick question: In the demo.ipynb, in Feature Extraction, you take the [0,0]
location of the feature giving a vector of 1024
. But in the Image-Text matching section, in blip_itm.py
, there is no extraction of just the [0,0]
location. what is there a reason for this difference?
ITM is a bit special, maybe look at ITC instead.
https://github.com/salesforce/BLIP/blob/48211a1594f1321b00f14c9f7a5b4813144b2fb9/models/blip_itm.py#L27 https://github.com/salesforce/BLIP/blob/48211a1594f1321b00f14c9f7a5b4813144b2fb9/models/blip_itm.py#L43 https://github.com/salesforce/BLIP/blob/48211a1594f1321b00f14c9f7a5b4813144b2fb9/models/blip_itm.py#L35 https://github.com/salesforce/BLIP/blob/48211a1594f1321b00f14c9f7a5b4813144b2fb9/models/blip_itm.py#L63
We can see that image_embeds[:,0,:]
is accessed in order to convert embeddings into features via vision_proj()
.
This way, you don't have to rely on the dimension of the embeddings:
https://github.com/salesforce/BLIP/blob/48211a1594f1321b00f14c9f7a5b4813144b2fb9/models/blip.py#L204
And directly specify the dimension of the features:
Moreover, it would be interesting to check the dimension of image_embeds[:,0,:]
.
I believe it could be the same as vision_width
, i.e. 1024
for ViT-L.
On a side-note, similar projects exist with CLIP, e.g.
rom1504/clip-retrieval
.
Thanks for the reference for retrieval. I checked it out. It looks very applicable for this.
ITM is a bit special, maybe look at ITC instead.
We can see that
image_embeds[:,0,:]
is accessed in order to convert embeddings into features viavision_proj()
.Moreover, it would be interesting to check the dimension of
image_embeds[:,0,:]
. I believe it could be the same asvision_width
, i.e.1024
for ViT-L.
Thanks for the pointers. Looking closely at the code, ITM appears to use the entire embedding which is image_embeds.shape is 577x1024
. ITC only uses the first vector in image_embeds
entire 577x1024
matrix so it uses image_embeds[:,0,:]
, a 1x1024
vector as the feature. This is similar to CLIP setup which may be why they only pulled [0,0]
in the code for features.
Thank you for the help/clarification. I'll close this now.
how to extract multimodal features of list of (images+texts) instead of only 1 image with BLIP model?
how to extract multimodal features of list of (images+texts) instead of only 1 image with BLIP model?
@linhlt-it-ee You can do batch forward
I can not load checkpoint from my finetune because of the wrong dimension 197,768 instead of 577, 768. Can someone tell me which configure I made it wrong?
Hi @linhlt-it-ee , you may want to checkout our LAVIS library which provides better support for feature extraction: https://github.com/salesforce/LAVIS
I want to process a folder of images that I will use for comparing to an input text (which will be given at a different time). How do I use your colab to extract features from images and then at a later time, compare them to an input text? All the examples involve passing in an image and a text at the same time.