Open Lala-chick opened 8 months ago
I was wondering the same question today, but on LLaVA-1.6 👀
same question
same question!
same quesiton, any solusions?
same question!
I’ve been looking into this for a while now. It definitely seems to be possible. See: https://arxiv.org/html/2404.01331v1#S4.F2 for an example :)
They cite this paper which is extremely insightful. They have code examples that apply to raw CLIP models. I’m assuming it’s possible to use this technique for LLaVA based models as well.
I’ll be doing some more digging but if anyone else has figured this out reach out!
take a look at this repo here: https://github.com/zjysteven/VLM-Visualizer
Have there been any updates on this?
Question
Hello. Thank you for sharing such an impressive model. While using LLaVA, I would like to see where the model is focusing on the image based on the prompt. Can you provide assistance?