haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
20.05k stars 2.21k forks source link

[Question] Image and Text Embeddings For Downstream Tasks #1642

Open dipikakhullar opened 3 months ago

dipikakhullar commented 3 months ago

Question

how can I get the image and text embeddings for another task, and what size are these embeddings? Here is what I know: Here is the vision output shape: torch.Size([1, 576, 4096] Here is the text output shape: torch.Size([1, 128, 4096])

I just got the dimension of both vision and text embeddings from the model configuration and vision embedding are set to 4096 as per hidden_size. And text embeddings are set to 1024 as per mm_hidden_size.

but the text output shape last dimension and the mm_hidden_size value (1024) do not match up. Also, 576 X 4096 seems very large.

wenxuanmou commented 2 months ago

Same question. Have you got a solution? Thanks.