Add zero-shot classification task for BLIP-2

huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

https://huggingface.co/transformers

Apache License 2.0

134.53k stars 26.9k forks source link

Add zero-shot classification task for BLIP-2 #25300

Open youssefadr opened 1 year ago

youssefadr commented 1 year ago

Feature request

I would like to add the support for the zero-shot classification task using BLIP2, computing text-image similarities with the normalized embeddings, that would be accessed from BLIP2 feature extractor.

The idea is to enable calling the zero-shot classification pipeline using BLIP2, by implementing the get_image_featureand get_text_featuresmethods.

I would love more guidance, if possible, on the criteria for accepting the PR.

Motivation

This is related to the following the discussion on this issue on the hub, and the comment left by @NielsRogge here https://huggingface.co/Salesforce/blip2-opt-2.7b/discussions/3#64cbe5e487ec96aa473a1f54 .

Your contribution

I would like to submit a PR to contribute for this feature.

NielsRogge commented 1 year ago

Yes so ideally you can add get_image_feature and get_text_feature to the Blip2ForConditionalGeneration class. For that you can refer to the original implementation .

ayushtues commented 1 year ago

@youssefadr let me know if you need any help in this PR, I am also in need of adding multimodal feature extraction from the Blip2Qformer

youssefadr commented 1 year ago

Hello, thanks for your message, I will tackle it this week 👍

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

youssefadr commented 1 year ago

Sorry, I have been caught be in work. Will finalize the PR today!

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

JhonDan1999 commented 10 months ago

Yes so ideally you can add get_image_feature and get_text_feature to the Blip2ForConditionalGeneration class. For that you can refer to the original implementation .

Hi I want to know if this has been done? because I am trying to use get_image_feature but I am getting this error AttributeError: 'Blip2ForConditionalGeneration' object has no attribute 'get_image_feature'

and I can not use Blip2Model because I have to use load_in_8bit that come with Blip2ForConditionalGeneration

NielsRogge commented 10 months ago

Hi, no this feature hasn't been added yet.

JhonDan1999 commented 10 months ago

Hi, no this feature hasn't been added yet.

Thank you for your prompt response I have the following questions I would appreciate your input: Q1: is there any way to extract the feature of an image using BLIP-2 from hugging face checkpoints with load_in_8bit? Q2: is the feature extraction in this notebook https://github.com/salesforce/LAVIS/blob/main/examples/blip2_feature_extraction.ipynb works in the same way as get_image_feature ? Q3: if I want to extract or convert an image into a Victor so I can use it by another model and do you have any recommendation of the best way to do this other than using Clip model because it did not give me a good result.

kirillsemenov1314 commented 9 months ago

@youssefadr hi, lmk please if help is needed here, would love to give a try to push things forward. It would actually be my first contribution, but I'm quite familiar with the BLIP2 model.