thunlp / LLaVA-UHD

LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
268 stars 14 forks source link

[Question] `image_features` not matched to input text #11

Open sibosutd opened 3 months ago

sibosutd commented 3 months ago

https://github.com/thunlp/LLaVA-UHD/blob/69e75d0cc6bc4d6000045f08f94852d2d465cd91/llava_uhd/train/llava-uhd/adapt_llava.py#L169-L173

  1. In the code snippet above, I notice that the value of cur_image_idx doesn't change within a single batch. This implies that cur_image_features remain identical for images within the same batch, which seems unusual. Could you confirm if this is the intended behavior?

  2. Another point of confusion I have pertains to the line for j in range(5): and the expression j*16. Based on the settings used in the Resampler, I would expect the image_features to have dimensions [batch_size8, 64, 5120]. Can you clarify why the image features are selected using for j in range(5): and `j16`?