[Question] How can I get an attention map from LLaVA 1.5?

haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

https://llava.hliu.cc

Apache License 2.0

20.02k stars 2.2k forks source link

[Question] How can I get an attention map from LLaVA 1.5? #1080

Open Lala-chick opened 8 months ago

Lala-chick commented 8 months ago

Question

Hello. Thank you for sharing such an impressive model. While using LLaVA, I would like to see where the model is focusing on the image based on the prompt. Can you provide assistance?

SylvJalb commented 8 months ago

I was wondering the same question today, but on LLaVA-1.6 👀

CN-Steve-Lee commented 8 months ago

same question

ahnchive commented 7 months ago

same question!

GasolSun36 commented 7 months ago

same quesiton, any solusions?

jdsannchao commented 7 months ago

same question!

AlecDusheck commented 6 months ago

I’ve been looking into this for a while now. It definitely seems to be possible. See: https://arxiv.org/html/2404.01331v1#S4.F2 for an example :)

They cite this paper which is extremely insightful. They have code examples that apply to raw CLIP models. I’m assuming it’s possible to use this technique for LLaVA based models as well.

I’ll be doing some more digging but if anyone else has figured this out reach out!

sherzod-hakimov commented 3 months ago

take a look at this repo here: https://github.com/zjysteven/VLM-Visualizer

rainarit commented 2 days ago

Have there been any updates on this?