BLIP-2 request: If it's even possible, can you please provide an official example script of how to get the text(caption) features and image features into the same vector space (e.g. for cross-modal retrieval/search using BLIP-2 models, similar to what we can already do with CLIP.) Thanks in advance.

huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

https://huggingface.co/transformers

Apache License 2.0

134.69k stars 26.93k forks source link

BLIP-2 request: If it's even possible, can you please provide an official example script of how to get the text(caption) features and image features into the same vector space (e.g. for cross-modal retrieval/search using BLIP-2 models, similar to what we can already do with CLIP.) Thanks in advance. #25245

Closed wingz1 closed 1 year ago

wingz1 commented 1 year ago

System Info

linux, python 3.8+, pytorch '1.13.0+cu116'

Who can help?

@sgugger

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

N/A

Expected behavior

N/A

amyeroberts commented 1 year ago

Hi @wingz1, thanks for raising an issue!

This is a question best placed in our forums. We try to reserve the github issues for feature requests and bug reports.

There are code examples of how to use BLIP and BLIP-2 in the docs. Both have a similar API to CLIP and have the same methods e.g. get_text_features, get_image_features implemented and return similar outputs.

wingz1 commented 1 year ago

Thanks, I figured that -- I will check the forums! Indeed those methods do exist in BLIP-2, but those outputs don't share the same dimensionality or mean the same thing as the equivalent commands in CLIP due to the how the model is set up.

ydshieh commented 1 year ago

Not really a useful answer, but from the following lines in the modeling file, you can go language_projection to get the same dimension. But it's super questionable regarding if this is the same space with the meaningful text/image features.

(and yes, further question on this topic should be on the forum)

self.language_projection = nn.Linear(config.qformer_config.hidden_size, config.text_config.hidden_size)

ilanguage_model_inputs = self.language_projection(query_output)

inputs_embeds = self.language_model.get_input_embeddings()(input_ids) inputs_embeds = torch.cat([language_model_inputs, inputs_embeds], dim=1)

ayushtues commented 1 year ago

Hi I think multimodal embeddings is something lacking in the current implementation, where we can't extract embeddings obtained by passing both text and image to the QFormer, infact the Qformer in HF doesn't even take text input_ids as input here

Whereas the original Qformer implementation did take text inputs as input_id here , along with the image and this can be used to extract multimodal embeddings as done in the extract_features fn here

amyeroberts commented 1 year ago

@ayushtues Indeed, it seems that wasn't included when the model was first added to the library. @NielsRogge - was there a reason for not including this?

If there wasn't a specific reason - it seems like a useful addition :) @ayushtues would you be interested in opening a PR to add this? This would mean you get the github contribution for adding the feature.

NielsRogge commented 1 year ago

A similar request for it is here: https://github.com/huggingface/transformers/issues/25300

ayushtues commented 1 year ago

I was working on integrating BlipDiffusion into diffusers https://github.com/huggingface/diffusers/pull/4388/, which also needs multimodal features. Made a local copy of Blip2Qformer and was modifying in this PR, but having the change integrated into HF will make it much cleaner

amyeroberts commented 1 year ago

Great - let's add it into transformers then :) !

ayushtues commented 1 year ago

@youssefadr is picking this up as discussed in https://github.com/huggingface/transformers/issues/25300, happy to help him if needed

youssefadr commented 1 year ago

@ayushtues Yes, I'll open a PR this week asap!

jpizarrom commented 1 year ago

Hi @youssefadr

I hope it is fine that I opened a draft PR #25612 to share some progress about multimodal features. I started to try to contribute to huggingface this week :)

The weights of the original blip2 itm model are converted into Blip2ForImageTextRetrieval. The idea of adding Blip2ForImageTextRetrieval has not been discussed at all. wdyt?

Feel free to use what I did, if it makes sense. Please let me know if it makes sense for me to continue trying to implement Blip2ForImageTextRetrieval, maybe you are already working in this part, or maybe it is not really necessary to try to implement Blip2ForImageTextRetrieval.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.