shikiw / OPERA

[CVPR 2024 Highlight] OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
MIT License
244 stars 22 forks source link

What should key_position be on mPLUG-Owl2? #7

Closed BillChan226 closed 5 months ago

BillChan226 commented 7 months ago

I'm trying to apply OPERA on mPLUG-Owl2, however, I'm stumbling on deciding which values should I associate to the "image_start", "image_end" and "response_start" keys. I tried to copy the setup codes from LLaVA-1.5 but it didn't work. Also I'm having a hard time understanding the why is "NUM_IMAGE_TOKENS = 576" for LLaVA-1.5.

Could you be so kind to please give me a hint on what these three values represent and how to set them for mPLUG-Owl2 ? Thank you!

shikiw commented 7 months ago

Hi,

Thanks for your appreciation! You may kindly check out this issue https://github.com/shikiw/OPERA/issues/2#issuecomment-1851351679 for the example of how to set up the "image_start", "image_end" and "response_start" keys. Similarly, mPLUG-Qwl2 also has the image tokens obtained from Visual Abstractor and the text tokens obtained from Text Embedding Layer, which make up all of input tokens for Language Decoder. You may specify the token indexes before the generation like https://github.com/shikiw/OPERA/blob/dba0dda9457a3234d22ef4b60ea38b74a02d3905/minigpt4/models/mini_gpt4.py#L375-L380

For LLaVA-1.5, "NUM_IMAGE_TOKENS = 576" is because it adopts CLIP ViT-L/14 as the vision encoder, thus it outputs 576 image tokens as the visual representation.