Closed LuciusLan closed 5 months ago
First of all, thank you for your interest and good quality questions regarding our work.
Can I presume that in Figure 3, the bright squrare regions in bottom left image corresponds to the spike regions in the attention plot? -Yes, that's correct. In Figure 3, the top image is the original image, and the bottom image highlights the spike attention patch. Therefore, the bright square region in the bottom left part of the image corresponds to the attention spike regions.
I would like to know if you have conducted any case study on how the spike regions are located with regard to the question?
-Although we did not include a case study of spike regions for different questions on the same image in the paper, we have empirically confirmed that the spike region does not change dynamically even if different questions are posed to a fixed image.
For example, asking a question related to "yacht" instead of "bench"(or anything else) in Figure 3 does not significantly change the spike regions. However, since the content of the text query affects the
Do you explore any distributional pattern of blind tokens on the downscaled v.s. the grid portion? -To the best of our knowledge, the image grid slicing and downscaling processing you mentioned are applied in the LLaVA-HD version. We utilized the LLaVA-1.5 7B version for our experiments, and the research on the LLaVA model was conducted on 576 image tokens by openai/clip-vit-large-patch14-336.
If you have any further questions, please feel free to ask. Thank you.
Dear author, Thank you for the detailed explanations!
First, thank you for sharing this insightful work! Your findings on LVLMs' attention usually biased toward certain portion of image tokens is really insteresting. Can I presume that in Figure 3, the bright squrare regions in bottom left image corresponds to the spike regions in the attention plot? I would like to know if you have conducted any case study on how the spike regions are located with regard to the question? e.g., in figure 3 the question is regarding "bench", but it seems not many spike regions are focusing on the bench, which explains your experiments in figure 2. (Please correct me if I'm wrong) As far as I know, Llava-1.5 and following versions is taking a downscaled version of original image, together with grid-sliced subimages and concatenate them as "image token". Do you explore any distributional pattern of blind tokens on the downscaled v.s. the grid portion? e.g. does blind token in grid slice also appear in the downscaled image, or vice-versa?
Thank you for your efforts and wish your work to be accepted by your submitted venue!