Closed wingz1 closed 1 year ago
Hi @wingz1, thanks for raising an issue!
This is a question best placed in our forums. We try to reserve the github issues for feature requests and bug reports.
There are code examples of how to use BLIP and BLIP-2 in the docs. Both have a similar API to CLIP and have the same methods e.g. get_text_features
, get_image_features
implemented and return similar outputs.
Thanks, I figured that -- I will check the forums! Indeed those methods do exist in BLIP-2, but those outputs don't share the same dimensionality or mean the same thing as the equivalent commands in CLIP due to the how the model is set up.
Not really a useful answer, but from the following lines in the modeling file, you can go language_projection
to get the same dimension. But it's super questionable regarding if this is the same space
with the meaningful text/image features.
(and yes, further question on this topic should be on the forum)
self.language_projection = nn.Linear(config.qformer_config.hidden_size, config.text_config.hidden_size)
ilanguage_model_inputs = self.language_projection(query_output)
inputs_embeds = self.language_model.get_input_embeddings()(input_ids) inputs_embeds = torch.cat([language_model_inputs, inputs_embeds], dim=1)
Hi I think multimodal embeddings is something lacking in the current implementation, where we can't extract embeddings obtained by passing both text and image to the QFormer, infact the Qformer in HF doesn't even take text input_ids
as input here
Whereas the original Qformer implementation did take text inputs as input_id here , along with the image and this can be used to extract multimodal embeddings as done in the extract_features
fn here
@ayushtues Indeed, it seems that wasn't included when the model was first added to the library. @NielsRogge - was there a reason for not including this?
If there wasn't a specific reason - it seems like a useful addition :) @ayushtues would you be interested in opening a PR to add this? This would mean you get the github contribution for adding the feature.
A similar request for it is here: https://github.com/huggingface/transformers/issues/25300
I was working on integrating BlipDiffusion into diffusers https://github.com/huggingface/diffusers/pull/4388/, which also needs multimodal features. Made a local copy of Blip2Qformer and was modifying in this PR, but having the change integrated into HF will make it much cleaner
Great - let's add it into transformers then :) !
@youssefadr is picking this up as discussed in https://github.com/huggingface/transformers/issues/25300, happy to help him if needed
@ayushtues Yes, I'll open a PR this week asap!
Hi @youssefadr
I hope it is fine that I opened a draft PR #25612 to share some progress about multimodal features. I started to try to contribute to huggingface this week :)
The weights of the original blip2 itm model are converted into Blip2ForImageTextRetrieval. The idea of adding Blip2ForImageTextRetrieval has not been discussed at all. wdyt?
Feel free to use what I did, if it makes sense. Please let me know if it makes sense for me to continue trying to implement Blip2ForImageTextRetrieval, maybe you are already working in this part, or maybe it is not really necessary to try to implement Blip2ForImageTextRetrieval.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
linux, python 3.8+, pytorch '1.13.0+cu116'
Who can help?
@sgugger
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
N/A
Expected behavior
N/A