[Question] Image and Text Embeddings For Downstream Tasks

Question

how can I get the image and text embeddings for another task, and what size are these embeddings? Here is what I know: Here is the vision output shape: torch.Size([1, 576, 4096] Here is the text output shape: torch.Size([1, 128, 4096])

I just got the dimension of both vision and text embeddings from the model configuration and vision embedding are set to 4096 as per hidden_size. And text embeddings are set to 1024 as per mm_hidden_size.

but the text output shape last dimension and the mm_hidden_size value (1024) do not match up. Also, 576 X 4096 seems very large.

haotian-liu / LLaVA

[Question] Image and Text Embeddings For Downstream Tasks #1642

Question