Does mBLIP provide feature extraction function?

kairaun commented 10 months ago

I was trying BLIP2 (from LAVIS) previously. It has a feature extraction function that allows me to calculate similarity. I would like to know if mBLIP has such a function?

gregor-ge commented 10 months ago

Hi,

it does not but if I remember correctly, BLIP2 uses the stage 1 Q-former (before aligning to an LLM) for this so mBLIP is not relevant for this.

Are you asking because you need a multilingual feature extractor? The Q-former from Lavis won't help you there (it's based on BERT and only really works for English) but I can refer you to our other work Babel-ImageNet, where we benchmark the various multilingual feature extractors so you can pick the best one for your use case.

kairaun commented 10 months ago

I'm working a side project like an search system based on BLIP2 (from LAVIS) that can extract features of my input(text or image) and then do similarity calculations with my database(features of many images), sort the similarity so that I can find the image that is closest to the input description.

I have completed the above part and it looks good. Now because I want to add the function of inputting multiple languages, I want to try mblip to see if it provides such a function.

I have solved the image feature part through get_image_features, I'm still trying to extract text features but have no idea.

gregor-ge commented 10 months ago

I see. To answer your question: No, mBLIP cannot give you text features for retrieval.

But to help with your problem, I think you want to use a strong CLIP to get the image and text features (https://github.com/mlfoundations/open_clip).

There are two main reasons already for English only: 1) BLIP2 uses a re-ranking step, where you compute the joint similarity of the top-128 images with the query, to get the final ranking. If you use only the features, then state-of-the-art CLIP models like SigLIP will work a lot better (https://github.com/mlfoundations/open_clip/blob/main/docs/openclip_retrieval_results.csv). 2) CLIP models are trained on magnitudes more data so they will work better in practice than BLIP2.

For your multilingual setting, CLIP models are your only reasonable option. You find many models benchmarked here (https://github.com/mlfoundations/open_clip/blob/main/docs/openclip_multilingual_retrieval_results.csv) and in our Babel-ImageNet. Your best choice depends on how you want to do this:

If you want one set of image features that you use for all English or non-English requests, then xlm-roberta-large-ViT-H-14 is probably your best bet.
If you are fine with separate features for English (use ViT-SO400M-14-SigLIP-384 here), then your multilingual pick depends on your main target languages: For higher-resource languages like German, Chinese, Russian, you can use again xlm-roberta-large-ViT-H-14. If you care for lower-resource languages like Farsi, Indonesian, etc, then nllb-clip-large-siglip or M-CLIP ViT-L14 might work better.

kairaun commented 10 months ago

Okay, thank you for your help. I will think about how to adjust it. If I have any questions, I will come back to you for advice.

gregor-ge / mBLIP

Does mBLIP provide feature extraction function? #11