[Question] Is it possible to generate embedding (feature extraction) similar to BLIP-2?

haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

https://llava.hliu.cc

Apache License 2.0

19.54k stars 2.15k forks source link

[Question] Is it possible to generate embedding (feature extraction) similar to BLIP-2? #692

Open zmtbnv opened 11 months ago

zmtbnv commented 11 months ago

Question

BLIP-2 allows extracting Unimodal features like:

features_image = model.extract_features(sample, mode="image")
features_text = model.extract_features(sample, mode="text")
print(features_image.image_embeds.shape)
# torch.Size([1, 32, 768])
print(features_text.text_embeds.shape)
# torch.Size([1, 12, 768])

Is it possible to do the same with LLaVa?

Slinene commented 9 months ago

same question

saisurbehera commented 7 months ago

Any answers ?

wenxuanmou commented 1 month ago

Same question. Have you got a solution? Thanks

sreebhattacharyya commented 4 days ago

+1. Did anyone find a solution that is reasonably straightforward?